0
<tr bgcolor='#C0C0C0'>
   <td>ID</td><td>personName</td>
   <td>homePhone<br>officePhone</td>
   <td>city</td>
   <td>street</td>
</tr>

ok, so this is a piece of HTML I get as a webresponse, and I need to parse it with grups to extract the ID, personName, city, homePhone, officePhone and street.

Can anyone give me a REGEX pattern for this? I've been trying for hours and I can't see where I'm wrong. Anyway is there any nice tool to create regex expressions, couse running the application over and over again is a pain. Thanks.

Andrew Barber
  • 39,603
  • 20
  • 94
  • 123
Ryan
  • 5,456
  • 25
  • 71
  • 129
  • 5
    First off, RegEx is a poor choice as an HTML parser. You should use an HTML parser for your platform and language. Secondly What language/platform _are_ you using? RegEx dialects can be quite different. – Oded Dec 19 '10 at 20:03
  • 4
    Read here for detailed explanation regarding your problem: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Gabi Purcaru Dec 19 '10 at 20:04
  • http://www.regexbuddy.com/ is my tool of choice. It still won't save you from going insane if you parse HTML with regular expressions. – TrueWill Dec 19 '10 at 21:02

1 Answers1

0

(Assuming .NET ...)

This should do it:

(?s:<tr.*?>(?:.*?<td.*?>(?<content>.*?)</td>)*)

That extracts these values:

  • ID
  • personName
  • homePhone<br>officePhone
  • city
  • street

It will return one match, with one group, with multiple captures.

For example, this code will write each value to the console.

var input = "<tr bgcolor='#C0C0C0'><td>ID</td>\n<td>personName</td>\n<td>homePhone\n<br>officePhone</td>\n<td>city</td>\n<td>street</td></tr>";
var pattern = "(?s:<tr.*?>(?:.*?<td.*?>(?<content>.*?)</td>)*)";

var match = Regex.Match(input, pattern);

foreach (var capture in match.Groups["content"].Captures)
    Console.WriteLine(capture.Value);

It will work with any number of cells. It ignores text, new lines and whitespace between cells. It ignores any attributes on the row or cell.

I use this tool for working with regular expressions: http://www.radsoftware.com.au/regexdesigner/

Tatham Oddie
  • 4,290
  • 19
  • 31
  • Thank you. This works great! But could you please explain what .*? does, and the meaning of s: and why doesn't .*? match the > immediately after? Can't find this anywhere. – Ryan Dec 20 '10 at 09:12
  • (?s: expression ) is an options modifier that puts it in 'single line mode'. In this mode, the "." character matches all characters *including* new lines. The "*?" means a lazy match. That is, it captures as few characters as it needs to. So, something like ".*?>" will match any character until it hits an ">". – Tatham Oddie Jan 29 '11 at 12:50