3 Regexp Patterns into one

Question

I have 3 regexp patterns, that parse a website (bit.ly/1cjZR29) into a nicer form:

$line[$item] = preg_replace("/\<(td|th|table|tr|div)(.*?)\>/", "|", $line[$item]);
$line[$item] = preg_replace("/\<\/(td|th|table|tr|div)\>/", "|", $line[$item]);
$line[$item] = preg_replace("/(.)\\1{3,}/sS", '$1', $line[$item]);

I want to join them together into 1 line.

When I tried

$line[$item] = preg_replace("/\<(td|th|table|tr|div)(.*?)\>(.*)\<\/(td|th|table|tr|div)\>/", "|", $line[$item]);

It couldn't match anything. The 3rd line is for deleting white spaces. Can anybody help me? Thanks in advice.

Might be worth reading this: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Marty, Aug 02 '13 at 07:39

score 2 · Answer 1 · edited May 23 '17 at 12:28

2

You shouldn't really be using regular expression to parse HTML. You could use some parser, such as the PHP Simple DOM Parser to do that, for the same reason @Marty Wallace suggested.

Also, your regular expressions are already relatively complex as they are, trying to merge them will only make it a maintenance nightmare.

edited May 23 '17 at 12:28

Community

1
1

answered Aug 02 '13 at 07:44

npinti

51,780
5
72
96

WebChemist · Accepted Answer · 2013-08-02T11:31:52.363

You can combine the first 2 rules to get content between tags, the whitespace is where it might get tricky. You can have conditional matches, but you can't do conditional replacements in a single rule. So you can say match an html tag or excess white space and replace with this one thing, but you can't say when html tag replace with this, when whitespace replace with that... the best you can do is check for whitespace directly before or after tags.

$test = preg_replace("/\s*(<(table|tbody|tr|td|th|div)(.*?)>)*\s*([^<\s]+)\s*(<\/(table|tbody|tr|td|th|div)>)*\s*/m", "| $4 |", $test);

using the link you provided, I took the html of the rankings table and was able to obtain what I think you're looking for..

| Rank || Level || Name || RemainExp || Race || 1 || 302 || n0ise || 220.301.329 || Aidian || 2 || 302 || ....

but this will not handle excess whitespace inside values, like if there were 3 spaces between "Remain" and "Exp", and I found that whitespace in-between opening tags was fine, but whitespace in the last </td>,</tr> or </table> tags was not properly handled. It also mishandles unmatched tags, like <a>. This is why they are telling you to use a parser, because unless you can strictly control the html source its probably going to throw you a curveball down the road. But don't let that stop you from practicing your regex if its a quick one-off html scrape or some (non-production) situation where adding a full framework would be overkill.

Also another tip I've found to easily remove html tags is to use jQuery to access an elements inner html and use the .text() function to strip out tags. You might consider that if you don't need to process the text serverside

Example: JsFiddle

3 Regexp Patterns into one

2 Answers2