You can combine the first 2 rules to get content between tags, the whitespace is where it might get tricky. You can have conditional matches, but you can't do conditional replacements in a single rule. So you can say match an html tag or excess white space and replace with this one thing, but you can't say when html tag replace with this, when whitespace replace with that... the best you can do is check for whitespace directly before or after tags.
$test = preg_replace("/\s*(<(table|tbody|tr|td|th|div)(.*?)>)*\s*([^<\s]+)\s*(<\/(table|tbody|tr|td|th|div)>)*\s*/m", "| $4 |", $test);
using the link you provided, I took the html of the rankings table and was able to obtain what I think you're looking for..
| Rank || Level || Name || RemainExp || Race || 1 || 302 || n0ise ||
220.301.329 || Aidian || 2 || 302 || ....
but this will not handle excess whitespace inside values, like if there were 3 spaces between "Remain" and "Exp", and I found that whitespace in-between opening tags was fine, but whitespace in the last </td>
,</tr>
or </table>
tags was not properly handled. It also mishandles unmatched tags, like <a>
. This is why they are telling you to use a parser, because unless you can strictly control the html source its probably going to throw you a curveball down the road. But don't let that stop you from practicing your regex if its a quick one-off html scrape or some (non-production) situation where adding a full framework would be overkill.
Also another tip I've found to easily remove html tags is to use jQuery to access an elements inner html and use the .text()
function to strip out tags. You might consider that if you don't need to process the text serverside
Example: JsFiddle