What would I use to remove escaped html from large sets of data

Question

Our database is filled with articles retrieved from RSS feeds. I was unsure of what data I would be getting, and how much filtering was already setup (WP-O-Matic Wordpress plugin using the SimplePie library). This plugin does some basic encoding before insertion using Wordpress's built in post insert function which also does some filtering. Between the RSS feed's encoding, the plugin's encoding using PHP, Wordpress's encoding and SQL escaping, I'm not sure where to start.

The data is usually at the end of the field after the content I want to keep. It is all on one line, but separated out for readability:

<img src="http://feeds.feedburner.com/~ff/SoundOnTheSound?i=xFxEpT2Add0:xFbIkwGc-fk:V_sGLiPBpWU" border="0"></img>

<img src="http://feeds.feedburner.com/~ff/SoundOnTheSound?d=qj6IDK7rITs" border="0"></img>

<img src="http://feeds.feedburner.com/~ff/SoundOnTheSound?i=xFxEpT2Add0:xFbIkwGc-fk:D7DqB2pKExk"

Notice how some of the images are escape and some aren't. I believe this has to do with the last part being cut off so as to be unrecognizable as an html tag, which then caused it to be html endcoded while the actual img tags were left alone.

Another record has only this in one of the fields, which means the RSS feed gave me nothing for the item (filtered out now, but I have a bunch of records like this):

<img src="http://farm3.static.flickr.com/2183/2289902369_1d95bcdb85.jpg" alt="post_img" width="80"

All extracted samples are on one line, but broken up for readability. Otherwise, they are copied exactly from the database from the command line mysql client.

Question: What is the best way to work with the above escaped html (or portion of an html tag), so I can then remove it without affecting the content?

I want to remove it, because the images at the end of the field are usually images that have nothing to do with content. In the case of the feedburner ones, feedburner adds those to every single article in a feed. Other times, they're broken links surrounding broken images. The point is not the valid html img tags which can be removed easily. It's the mangled tags which if unencoded will not be valid html, which will not be parsable with your standard html parsers.

[EDIT] If it was just a matter of pulling the html I wanted out and doing a strip_tags and reinserting the data, I wouldn't be asking this question.

The portion that I have a problem with is that what used to be an img tag was html encoded and the end cut off. If it's deencoded it will not be an html tag, so I cannot parse it the usual way.

With all the <img src=" crap, I can't get my head around searching for it other than SELECT ID, post_content FROM table WHERE post_content LIKE '<img' which at least gets me those posts. But when I get the data, I need a way to find it, remove it, but keep the rest of the content.

[/EDIT]

[EDIT 2]

<img src="http://farm4.static.flickr.com/3162/2735565872_b8a4e4bd17.jpg" alt="post_img" width="80" />Through the first two months of the year, the volume of cargo handled at Port of Portland terminals has increased 46 percent as the port?s marine cargo business shows signs of recovering from a dismal 2009.<div> <a href="http://feeds.bizjournals.com/~ff/bizj_portland?a=YIs66yw13JE:_zirAnH6dt8:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/bizj_portland?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.bizjournals.com/~ff/bizj_portland?a=YIs66yw13JE:_zirAnH6dt8:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/bizj_portland?i=YIs66yw13JE:_zirAnH6dt8:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.bizjournals.com/~ff/bizj_portland?a=YIs66yw13JE:_zirAnH6dt8:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/bizj_portland?i=YIs66yw13JE:_zirAnH6dt8:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.bizjournals.com/~ff/bizj_portland?a=YIs66yw13JE:_zirAnH6dt8:qj6IDK7rITs"><img src="http://feeds.feedburner.com/~ff/bizj_portland?d=qj6IDK7rITs"

The part I want to keep:

<img src="http://farm4.static.flickr.com/3162/2735565872_b8a4e4bd17.jpg" alt="post_img" width="80" />Through the first two months of the year, the volume of cargo handled at Port of Portland terminals has increased 46 percent as the port?s marine cargo business shows signs of recovering from a dismal 2009.

To reiterate: It's not about removing the valid html img tags. That's easy. I need to be able to find specifically the <img src="http://feeds.feedburner.com/~ff/bizj_portland?d=qj6IDK7rITs" if it's part of the pattern of img tag img tag mangled img tag or anchor img anchor img img mangled image etc etc, but not remove <img if it is indeed part of the article. Out of the few dozen samples I've reviewed, it's been pretty consistent that this mangled img tag is at the end of the field.

The other one is the single mangled image tag. It's consistently a mangled flickr img tag, but as above, I can't just search for <img as it could be a valid part of the content.

The problem lies in that I can't simply decode it and parse it as HTML, because it will not be valid html. [/EDIT 2]

You're kidding right? http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Ether, Apr 13 '10 at 17:14
Using regular expressions to parse arbitrary HTML is usually a bad idea but I'm not so sure that's what you're doing. What are you looking for as the end result? Edit your question above and show what you want to end up with for each example you've included. — benrifkah, Apr 13 '10 at 18:53
My concern is the escaped html, which will then *not* become an html tag, because the closing part is gone. So, no, this is not a question about parsing HTML — Elizabeth Buckwalter, Apr 13 '10 at 19:19
@Ether, please read the question thoroughly. I am not asking about html parsing. — Elizabeth Buckwalter, Apr 13 '10 at 19:26
@Elizabeth => so the only part you want to preserve is the link url? — Eric Strom, Apr 13 '10 at 20:40
@Elizabeth From reading between the lines I now think that you want to strip out everything that you have in your examples and only keep some content that you haven't shown. If this is the case you will need to post an example of the content that you want to keep. Any response that you receive here may not work without knowledge of what you want to keep. Further, if the other content contains HTML then your question does indeed involve parsing HTML because you need code to search the entire content to find/strip this broken HTML. — benrifkah, Apr 13 '10 at 21:03
@benrifkah, added. This question is surprisingly difficult to ask. — Elizabeth Buckwalter, Apr 13 '10 at 21:53
I'll work on testing these answers tomorrow. I have some other tasks on this project that are taking priority at the moment. Thanks! — Elizabeth Buckwalter, Apr 14 '10 at 00:41
I haven't had time to work on this, but I've accepted the answer that is closest to my problem. — Elizabeth Buckwalter, Apr 19 '10 at 16:44

Dave Sherohman · Answer 1 · 2010-04-14T09:10:09.807

3

The best way is to:

Install HTML::Entities from CPAN and use that to unescape the URIs.
Install HTML::Parser from CPAN and use that to parse and remove the URIs after they're unescaped.

Regexes are not a suitable tool for this task.

edited Apr 14 '10 at 09:10

answered Apr 13 '10 at 19:02

Dave Sherohman

45,363
14
64
102

1

I don't think that URI unescaping is what she needs. URI unescaping is for changing "%5D" into "]" and other things. What may be useful is the decode_entities function from HTML::Entities to turn the "<" into "<" and so on. – benrifkah Apr 13 '10 at 19:08
The data won't become properly formatted html once they are decoded. – Elizabeth Buckwalter Apr 13 '10 at 19:25
So are you looking to turn what you have into valid HTML? If so then your question title is a little misleading. It asks how to remove HTML. Please clarify. – benrifkah Apr 13 '10 at 19:48
@benrifkah: Whoops, you are correct. I meant HTML::Entities, not URI::Escape. Answer edited to fix this. – Dave Sherohman Apr 14 '10 at 09:10

score 2 · Answer 2 · answered Apr 13 '10 at 20:28

I wouldn't strip it out. It's far from unrecoverable junk.

First apply HTML::Entities::decode_entities conditionally (use the occurence of < as the first character as heuristic), then let HTML::Tidy::libXML->clean(…, 'UTF-8', 1) reconstruct the mark-up as intended. clean returns a whole document, but it's trivial to extract just the needed img element.

Eric Strom · Accepted Answer · 2010-04-13T22:11:40.210

2

Question updated...

To extract the data you want, you could use this approach:

use HTML::Entities qw/decode_entities/;

my $decoded = decode_entities $raw;

if ($decoded =~ s{ (<img .+? (?:>.+?</img>|/>)) } {}x) {  # grab the image
    my $img = $1;
    $decoded =~ s{<.+?>}      {}xg;  # strip complete tags
    $decoded =~ s{< [^>]+? $} {}x;   # strip trailing noise

    print $img.$decoded;
}

Using a regex to parse HTML is generally frowned upon, however, in this case, it is more about stripping out segments that match a pattern. After testing the regexes on a larger set of data, you should have an idea of what might need to be tweaked.

Hope this helps.

edited Apr 13 '10 at 22:11

answered Apr 13 '10 at 20:51

Eric Strom

39,821
2
80
152

I haven't had time to test, but from what I can see this is the direction I'd like to go. Since the last img tag is malformed, I believe this will help find it. And capturing might not be a bad idea. Thanks! – Elizabeth Buckwalter Apr 19 '10 at 16:47

score 0 · Answer 4 · edited Apr 14 '10 at 16:02

0

How about a stupid simple Perl find and replace on the var containing your data...

foreach $line(@lines) {
    $line =~ s/&lt;/</gi;
    $line =~ s/&gt;/>/gi;
}

edited Apr 14 '10 at 16:02

Sinan Ünür

116,958
15
196
339

answered Apr 13 '10 at 18:44

onethreefour

9
1

1

damn encoding on this page screwed up my post! ;) I'll try again, but it prolly wont work ;) $line =~ s/<//gi; – onethreefour Apr 13 '10 at 18:46
See what I mean? And there are no lines. It's all on one line. – Elizabeth Buckwalter Apr 13 '10 at 19:20

score 0 · Answer 5 · answered Apr 14 '10 at 00:17

Your best bet will be to recollect all of the articles that are in the database so that they aren't truncated and corrupted. If this is not an option then...

Based on your examples above it looks like you're stripping out everything that follows the text content of each article. In your example the text content is followed by a DIV tag and a bunch of IMG tags that may or may not have been truncated and or been converted into HTML entities.

If all of your records are similar you can strip out everything after the Text content by removing the final div tag and everything that follows it using perl like this:

my $article = magic_to_get_an_article();
$article =~ s/<div>.*//s;
magic_to_store_article($article);

If your records include anything more complex than this you're better off using an HTML parsing module and reading the documentation carefully to find out how it handles invalid HTML.

Sinan Ünür · Answer 6 · 2010-04-14T16:26:26.320

Given the sample input and output you give at the end of your post, the following will get you the desired output:

#!/usr/bin/perl

use strict; use warnings;

use HTML::TokeParser::Simple;
my $parser = HTML::TokeParser::Simple->new( \*DATA );

if ( my $tag = $parser->get_tag('img') ) {
    print $tag->as_is;
    print $parser->get_text('div');
}

__DATA__
<img src="http://farm4.static.flickr.com/3162/2735565872_b8a4e4bd17.jpg" alt="post_img" width="80" />Through the first two months of the year, the volume of cargo handled at Port of Portland terminals has increased 46 percent as the port?s marine cargo business shows signs of recovering from a dismal 2009.<div> <a href="http://feeds.bizjournals.com/~ff/bizj_portland?a=YIs66yw13JE:_zirAnH6dt8:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/bizj_portland?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.bizjournals.com/~ff/bizj_portland?a=YIs66yw13JE:_zirAnH6dt8:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/bizj_portland?i=YIs66yw13JE:_zirAnH6dt8:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.bizjournals.com/~ff/bizj_portland?a=YIs66yw13JE:_zirAnH6dt8:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/bizj_portland?i=YIs66yw13JE:_zirAnH6dt8:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.bizjournals.com/~ff/bizj_portland?a=YIs66yw13JE:_zirAnH6dt8:qj6IDK7rITs">&lt;img src=&quot;http://feeds.feedburner.com/~ff/bizj_portland?d=qj6IDK7rITs&quot;

Output:

<img src="http://farm4.static.flickr.com/3162/2735565872_b8a4e4bd17.jpg" alt="po st_img" width="80" />Through the first two months of the year, the volume of car go handled at Port of Portland terminals has increased 46 percent as the port?s marine cargo business shows signs of recovering from a dismal 2009.

However, I am puzzled as to the size and scope of each chunk you are supposed to process.

What would I use to remove escaped html from large sets of data

6 Answers6