Need help with regular expressions in PHP

Question

I am trying to index some content from a series of .html's that share the same format.

So I get a lot of lines like this: <a href="meh">[18] blah blah blah < a...

And the idea is to extract the number (18) and the text next to it (blah...). Furthermore, I know that every qualifying line will start with "> and end with either <a or </p. The issue stems from the need to keep all other htmHTML tags as part of the text (<i>, <u>, etc.).

So then I have something like this:

$docString = file_get_contents("http://whatever.com/some.htm");
$regex="/\">\ [(.*?)\ ] (<\/a>)(.) *?(<)/";
preg_match_all($regex,$docString,$match);

Let's look at $regex for a sec. Ignore it's spaces, I just put them here because else some characters disappear. I specify that it will start with ">. Then I do the numbers inside the [] thing. Then I single out the </a>. So far so good.

At the end, I do a (.)*?(<). This is the turning point. By leaving the last bit, (<) like that, The text will be interrupted when an underline or italics tag is found. However, if I put (<a|</p) the resulting array ends up empty. I've tried changing that to only (<a), but it seems that 2 characters mess up the whole ting.

What can I do? I've been struggling with this all day.

score 1 · Answer 1 · edited May 23 '17 at 12:19

1

PHP Tidy is your friend. Don't use regexes.

edited May 23 '17 at 12:19

Community

1
1

answered Nov 10 '10 at 19:15

Vivin Paliath

94,126
40
223
295

score 1 · Answer 2 · answered Nov 10 '10 at 23:48

Something like /">\[(.*)\](.*)(?:<(?:a|\/p))/ seems to work fine for given your example and description. Perhaps adding non-capturing subpatterns does it? Please provide a counterexample wherein this doesn't work for you.

Though I agree that RegEx isn't a parser, it sounds like what you're looking for is part of a regularly behaved string - which is exactly what RegEx is strong at.

JAL · Answer 3 · 2010-11-10T19:19:26.837

As you've found, using a regex to parse HTML is not very easy. This is because HTML is not particularly regular.

I suggest using an XML parser such as PHP's DomDocument.

Create an object, then use the loadHTMLFile method to open the file. Extract your a tags with getElementsByTagName, and then extract the content as the NodeValue property.

It might look like

// Create a DomDocument object 
$html = new DOMDocument(); 

// Load the url's contents into the DOM 
$html->loadHTMLFile("http://whatever.com/some.htm"); 

// make an array to hold the text 
$anchors = array(); 

//Loop through the a tags and store them in an array 
foreach($html->getElementsByTagName('a') as $link) { 
    $anchors[] = $link->nodeValue;
    }

One alternative to this style of XML/HTML parser is phpquery. The documentation on their page should do a good job of explaining how to extract the tags. If you know jQuery, the interface may seem more natural.

Need help with regular expressions in PHP

3 Answers3