preg_match , regexp , php , extract text from html

Question

I'm trying to extract "Florida (FL)" from http://www.auctionarms.com/search/displayitem.cfm?itemnum=9736364&oh=216543. My code is

//get location
   $pattern = "/(State)</i>:</td>(.*)</td>/";
   preg_match_all($pattern, $htmlContent, $matches);
   print_r($matches);

any idea why is not working ?

This seems like the constant mantra of SO: avoid using regexp to parse html if you possibly can. It is *not* the tool for the job. — Jakub Hampl, May 22 '10 at 04:20

score 1 · Answer 1 · answered May 22 '10 at 04:57

When you have (State) in a regex, it will match the term State in the input string as a group, it won't match literal parenthesis in the input - you'll need to escape them as you have with the /s - /\(State\)<\/....

Then there's the problem that there's lots of whitespace around (including new lines - you'll need to include the m modifier), and a <b/> tag around the header which you seem to have not included in the regex. Even if you fix these problems, you're highly reliant on the exact markup used by the website you're scraping. This is a general problem you'll encounter when trying to parse HTML using regular expressions. It would be a better idea to use a HTML parser (e.g. creating a new DOMDocument and calling its loadhtml method).

I was suggesting HTML parser but after looking at the web page I changed my mind... no classes, no IDs, no css; very hard to locate the word State. — Salman A, May 22 '10 at 05:46

score 0 · Answer 2 · edited May 23 '17 at 10:32

0

I believe the reason is because the string you're trying to match is on the next line. You'll need to enable multi-line mode with:

$pattern = "/\(State\)<\/i>\:<\/td>(.*)<\/td>/m";

But remember: attempting to parse HTML with regular expressions makes the unholy child weep the blood of virgins. See:

RegEx match open tags except XHTML self-contained tags

edited May 23 '17 at 10:32

Community

1
1

answered May 22 '10 at 04:55

awgy

16,596
4
25
18

preg_match , regexp , php , extract text from html

2 Answers2