2

I have a PHP program that, at some point, needs to analyze a big amount of HTML+javascript text to parse info. All I want to parse needs to be in two parts.

  1. Seperate all "HTML goups" to parse
  2. Parse each HTML group to get the needed information.

In the 1st parse it needs to find:

<div id="myHome"

And start capturing after that tag. Then stop capturing before

<span id="nReaders"

And capture the number that comes after this tag and stop.

In the 2nd parse use the capture nº 1 (0 has the whole thing and 2 has the number) from the parse made before and then find .

I already have code to do that and it works. Is there a way to improve this, make it easier for the machine to parse?

preg_match_all('%<div id="myHome"[^>]>(.*?)<span id="nReaders[^>]>([0-9]+)<"%msi', $data, $results, PREG_SET_ORDER);
foreach($results AS $result){
    preg_match_all('%<div class="myplacement".*?[.]php[?]((?:next|before))=([0-9]+).*?<tbody.*?<td[^>]>.*?[0-9]+"%msi', $result[1], $mydata, PREG_SET_ORDER);
//takes care of the data and finish the program

Note: I need this for a freeware program so it must be as general as possible and, if possible, not use php extensions

ADD: I ommitted some parts here because I didn't expect for answers like those. There is also a need to parse text inside one of the tags that is in the document. It may be the 6th 7th or 8th tag but I know it is after a certain tag. The parser I've checked (thx profitphp) does work to find the script tag. What now? There are more than 1 tag with the same class. I want them all. But I want only with also one of a list of classes..... Where can I find instructions and demos and limitations of DOM parsers (like the one in http://simplehtmldom.sourceforge.net/)? I need something that will work on, at least, a big amount of free servers. Another thing. How do I parse this part: "php?=([0-9]+)" with those HTML parsers?

brunoais
  • 6,258
  • 8
  • 39
  • 59
  • 6
    The problem sounds better suited to an [html parser](http://stackoverflow.com/questions/292926/robust-mature-html-parser-for-php). See the [answers here](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) which explain why. – moinudin Dec 22 '10 at 19:44
  • 4
    As a general rule, [don't use regex to parse HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). – lonesomeday Dec 22 '10 at 19:45
  • 2
    uh, the daily "how abuse regex for html parsing" thread. – cbrandolino Dec 22 '10 at 19:47

4 Answers4

3

If you're concerned about efficiency (and indeed accuracy), don't attempt to parse HTML using regex.

You should use a parser, such as PHP's DOM

Community
  • 1
  • 1
John Parker
  • 54,048
  • 11
  • 129
  • 129
  • Using the code I made there are no problems with accuracy. Anyway I'll try to see that php extension. As this is for a program it should work with as many servers as possible – brunoais Dec 22 '10 at 21:21
  • @user551625 - Fine. Feel free to update this question whenever you eventually find some files it doesn't work on. – John Parker Dec 22 '10 at 21:24
1

As noted above, regex is not a good fit for this. You'll be better of using somethign like this:

Robust and Mature HTML Parser for PHP

Community
  • 1
  • 1
profitphp
  • 8,104
  • 2
  • 28
  • 21
  • The html parser that appears here looks useful and seems to work with a part of my regex. How about parsing code written inside tags? – brunoais Dec 22 '10 at 21:41
0

Efficiency doesn't matter if your results are incorrect. Parsing HTML with regexes will lead to incorrect results down the road. Use a parser.

Andy Lester
  • 91,102
  • 13
  • 100
  • 152
  • According to the way the pages that it will parse are it will never fail. – brunoais Dec 22 '10 at 21:27
  • Is there a HTML parser for php that I can use find what I want this way (I know no other algorithm to find the information)? If so then where can I find it? I need it to also work in side a script tag – brunoais Dec 22 '10 at 21:32
0

I found a way to create efficient searches.

If you want to search for "A huge string in a whole text" you can do it this way:

(?:(?:[^A]*A)+? huge string in a whole text)

It always works. Only creates a backtrace every 'A' character and not for every single character. Because of that it is not only memory efficient but processing power efficient too. If there are two options, it's also works without a problem:

(?:(?:[^AB]*AB)+?(?: huge string in a whole text|e the huge string in a whole text))

Up until now it has never failed.

Perception
  • 79,279
  • 19
  • 185
  • 195
brunoais
  • 6,258
  • 8
  • 39
  • 59