0

I have a corrupt html-page which i unfortunately can't parse with xml/xcode so i came up with regex. I'm a regexbeginner but I cant get the right result.

Source

<td>FIELD:</td> <td>VALUE<td>

I want to get the value and this is where I'm stuck

$regex = '{<td[^>]*<td>(.*?)</td>}';

edit: as a result I want an array where I can reach the value, so I'm just interested in the value

I'm thankfull for every hint.

cheers endo

endo.anaconda
  • 2,449
  • 4
  • 29
  • 55

2 Answers2

1

Try this:

'{<td>.*?</td>\s+<td>(.*?)</td>}'

But you missed a / in the html text If, by corrupted, you mean missing slashes at closing tags, you can use this:

'{<td>.*?</?td>\s+<td>(.*?)</?td>}' where the slashes in closing tags are now optional

Israel Unterman
  • 13,158
  • 4
  • 28
  • 35
0

There are some immediately visible problems with your regex; for example, <td[^>]*<td> doesn't do what you think it does. But rather than suggest a different regex, let me urge you to do the sanest thing:

Don't use regex for this!

Trust me. Don't do it. Others will come in here and suggest new regex patterns, and their patterns will all be wrong. Regex isn't even up to the task of parsing clean HTML/XML, so trying to use it on arbitrarily corrupted code is a recipe for madness. Try HTML Tidy, which is made for this sort of thing. Depending on what's wrong with the HTML, a parser like HtmlPurifier or Beautiful Soup might also be able to work with it.

It may seem like a little more effort, but you'll save yourself time in the long run.

Community
  • 1
  • 1
Justin Morgan - On strike
  • 30,035
  • 12
  • 80
  • 104
  • 1
    convinced to not use regex ^^ – endo.anaconda Apr 24 '12 at 20:59
  • 1
    Don't be convinced to not-use regex; but be prepared for it to fail, and go spectacularly wrong, if you're using it in the wrong place, or for the wrong reasons. For this use-case, seriously: *don't*, for others, matching and/or replacing pieces of a string with a different piece of a string, or whatever, regex is *awesomely* powerful. – David Thomas Apr 24 '12 at 21:43
  • @DavidThomas - I agree with you, actually. I'm a big fan of regex and consider it extremely useful, even for certain tasks involving HTML. However, it's *definitely* not suited for what the asker is trying to do. My answer isn't "never use regex and HTML together," it's "avoid using regex to parse HTML, let alone HTML that isn't even well-formed." – Justin Morgan - On strike Apr 25 '12 at 14:04