Broken iframes and HTML::TreeBuilder

We had a situation last week where someone had entered a broken <iframe> tag in a job description and our cleanup code didn’t properly remove it. This caused the text after the <iframe> to render as escaped HTML.

We needed to prefilter the HTML and just remove the <iframe>s. The most difficult part of this was figuring out what HTML::TreeBuilder was emitting and what I needed to do with it to do the cleanup. It was obvious that this would have to be recursive, since HTML is recursive (there could be nested, or multiple uncosed iframes!) and several tries at it failed until I finally dumped out the data structure in the debugger and spotted that HTML::TreeBuilder was adding “implicit” nodes. These essentially help it do bookkeeping, but don’t contain anything that has to be re-examined to properly do the cleanup. Worse, the first node contains all th text for the current level, so recursing on them was leading me off into infinite depths, as I kept looking for iframes in the content of the leftmost node, finding them, and uselessly recursing again on the same HTML.

The other interesting twist is that once I dropped the implicit nodes with a grep, I still needed to handle the HTML in the non-implicit nodes two different ways: if it had one or more iframe tags, then I needed to use the content method to take the node apart and process the pieces. There might be one or more non-iframes there, which end up getting returned untouched via as_HTML. If there are iframes, the recursion un-nests them and lets us clean up individual subtrees.

Lastly, any text returned from content comes back as an array of strings, so I needed to check for that case and recurse on all the items in the array to be sure I’ve filtered everything properly. My initial case checks for the trivial “no input so no output”, and “not a reference” to handle the starting string.

We do end up doing multiple invocations of HTML::TreeBuilder on the text as we recurse, but we don’t recurse at all unless there’s an iframe, and it’s unusual to have more than one.

Here’s the code:

+sub _filter_iframe_content {
  my($input) = @_;
  return '' unless $input;

  my $root;
  # We've received a string. Build the tree.
  if (!ref $input) {
    # Build a tree to process recursively.
    $root = HTML::TreeBuilder->new_from_content($input);
    # There are no iframe tags, so we're done with this segment of the HTML.
    return $input unless $root->look_down(_tag=>'iframe');
  } elsif (ref $input eq 'ARRAY') {
    # We got multiple strings from a content call; handle each one in order, and
    # return them, concatenated, to finish them up.
    return join '', map { _filter_iframe_content($_) } @$input;
  } else {
    # The input was a node, so make that the root of the (sub)tree we're processing.
    $root = $input;
  }

  # The 'implicit' nodes contain the wrapping HTML created by
  # TreeBuilder. Discard that.
  my @descendants = grep { ! $_->implicit } $root->descendants;

  # If there is not an iframe below the content of the node, return
  # it as HTML. Else recurse on the content to filter it.
  my @results;
  for my $node (@descendants) {
    # Is there an iframe in here?
    my $tree = HTML::TreeBuilder->new_from_content($node->as_HTML);
    if ($tree->look_down(_tag=>'iframe')) {
      # Yes. Recurse on the node, taking it apart.
      push @results, _filter_iframe_content($node->content);
    } else {
      # No, just return the whole thing as HTML, and we're done with this subtree.
      push @results, $node->as_HTML;
    }
  }
  return join '', @results;
}

Comments

Leave a Reply