0

I am trying to index some content from a series of .html's that share the same format.

So I get a lot of lines like this: <a href="meh">[18] blah blah blah < a...

And the idea is to extract the number (18) and the text next to it (blah...). Furthermore, I know that every qualifying line will start with "> and end with either <a or </p. The issue stems from the need to keep all other htmHTML tags as part of the text (<i>, <u>, etc.).

So then I have something like this:

$docString = file_get_contents("http://whatever.com/some.htm");
$regex="/\">\ [(.*?)\ ] (<\/a>)(.) *?(<)/";
preg_match_all($regex,$docString,$match);

Let's look at $regex for a sec. Ignore it's spaces, I just put them here because else some characters disappear. I specify that it will start with ">. Then I do the numbers inside the [] thing. Then I single out the </a>. So far so good.

At the end, I do a (.)*?(<). This is the turning point. By leaving the last bit, (<) like that, The text will be interrupted when an underline or italics tag is found. However, if I put (<a|</p) the resulting array ends up empty. I've tried changing that to only (<a), but it seems that 2 characters mess up the whole ting.

What can I do? I've been struggling with this all day.

Gumbo
  • 643,351
  • 109
  • 780
  • 844
navand
  • 1,379
  • 1
  • 16
  • 20

3 Answers3

1

PHP Tidy is your friend. Don't use regexes.

Community
  • 1
  • 1
Vivin Paliath
  • 94,126
  • 40
  • 223
  • 295
1

Something like /">\[(.*)\](.*)(?:<(?:a|\/p))/ seems to work fine for given your example and description. Perhaps adding non-capturing subpatterns does it? Please provide a counterexample wherein this doesn't work for you.

Though I agree that RegEx isn't a parser, it sounds like what you're looking for is part of a regularly behaved string - which is exactly what RegEx is strong at.

Iiridayn
  • 1,747
  • 21
  • 43
0

As you've found, using a regex to parse HTML is not very easy. This is because HTML is not particularly regular.

I suggest using an XML parser such as PHP's DomDocument.

Create an object, then use the loadHTMLFile method to open the file. Extract your a tags with getElementsByTagName, and then extract the content as the NodeValue property.

It might look like

// Create a DomDocument object 
$html = new DOMDocument(); 

// Load the url's contents into the DOM 
$html->loadHTMLFile("http://whatever.com/some.htm"); 

// make an array to hold the text 
$anchors = array(); 

//Loop through the a tags and store them in an array 
foreach($html->getElementsByTagName('a') as $link) { 
    $anchors[] = $link->nodeValue;
    } 

One alternative to this style of XML/HTML parser is phpquery. The documentation on their page should do a good job of explaining how to extract the tags. If you know jQuery, the interface may seem more natural.

JAL
  • 21,295
  • 1
  • 48
  • 66