RegEx in HTML split with preg-match

Question

I have a corrupt html-page which i unfortunately can't parse with xml/xcode so i came up with regex. I'm a regexbeginner but I cant get the right result.

Source

<td>FIELD:</td> <td>VALUE<td>

I want to get the value and this is where I'm stuck

$regex = '{<td[^>]*<td>(.*?)</td>}';

edit: as a result I want an array where I can reach the value, so I'm just interested in the value

I'm thankfull for every hint.

cheers endo

[The pony he comes...](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) ... But anyway, what exactly do you expect that regex to do? — Niet the Dark Absol, Apr 24 '12 at 20:30
Just a thought, but why not correct the 'corrupt' page? Then JavaScript, and user-agents, will work better, and more consistently. Treat the disease, not the symptoms. — David Thomas, Apr 24 '12 at 20:31
Just a note, you need some delimiters on that regex `'/{.....}/'` — the_red_baron, Apr 24 '12 at 20:39
@the_red_baron - `{...}` works as delimiters in PHP. You can also use `~...~`, `/.../`, and I think there are some other styles. — Justin Morgan - On strike, Apr 24 '12 at 20:40

score 1 · Answer 1 · answered Apr 24 '12 at 20:42

1

Try this:

'{<td>.*?</td>\s+<td>(.*?)</td>}'

But you missed a / in the html text If, by corrupted, you mean missing slashes at closing tags, you can use this:

'{<td>.*?</?td>\s+<td>(.*?)</?td>}' where the slashes in closing tags are now optional

answered Apr 24 '12 at 20:42

Israel Unterman

13,158
4
28
35

score 0 · Accepted Answer · edited Jun 20 '20 at 09:12

0

There are some immediately visible problems with your regex; for example, <td[^>]*<td> doesn't do what you think it does. But rather than suggest a different regex, let me urge you to do the sanest thing:

Don't use regex for this!

Trust me. Don't do it. Others will come in here and suggest new regex patterns, and their patterns will all be wrong. Regex isn't even up to the task of parsing clean HTML/XML, so trying to use it on arbitrarily corrupted code is a recipe for madness. Try HTML Tidy, which is made for this sort of thing. Depending on what's wrong with the HTML, a parser like HtmlPurifier or Beautiful Soup might also be able to work with it.

It may seem like a little more effort, but you'll save yourself time in the long run.

edited Jun 20 '20 at 09:12

Community

1
1

answered Apr 24 '12 at 20:38

Justin Morgan - On strike

30,035
12
80
104

1

convinced to not use regex ^^ – endo.anaconda Apr 24 '12 at 20:59
1

Don't be convinced to not-use regex; but be prepared for it to fail, and go spectacularly wrong, if you're using it in the wrong place, or for the wrong reasons. For this use-case, seriously: *don't*, for others, matching and/or replacing pieces of a string with a different piece of a string, or whatever, regex is *awesomely* powerful. – David Thomas Apr 24 '12 at 21:43
@DavidThomas - I agree with you, actually. I'm a big fan of regex and consider it extremely useful, even for certain tasks involving HTML. However, it's *definitely* not suited for what the asker is trying to do. My answer isn't "never use regex and HTML together," it's "avoid using regex to parse HTML, let alone HTML that isn't even well-formed." – Justin Morgan - On strike Apr 25 '12 at 14:04

RegEx in HTML split with preg-match

2 Answers2

Don't use regex for this!