1

I am trying to determine in which column the name "Phone" appears, by checking the HTML of a web page. The string in which I am doing the search looks like this :

<tr class="C1">
<td>Name</td>
<td>Address</td>
...
...   < some more columns, but their number is not fixed >
...
<td>Phone</td>
...
...    <more columns>
...
</tr>

Is it possible to determine using regular expressions ?

Wartin
  • 1,965
  • 5
  • 25
  • 40

2 Answers2

1

From the viewpoint of theoretical computer science: It is not possible, since tables could be nested; and regular expressions generally cannot cope with nested structures (you need a Typ-2-Grammer (Chomsky-Hierarchy), i.e. a Parser, to analyse the structure of a html-Text, it's not Typ-3, i.e. regular).

From a practical viewpoint, however, if you assume, that the tables are not nested, you could use a RegEx to extract table rows (something like <tr (?!</tr>)*</tr>), match the entries afterwards (something like <td (?!</td>)*</td>) to produce a List of columns and search that list for an Entry containing the string "Phone"....

phynfo
  • 4,830
  • 1
  • 25
  • 38
1

Tough task. I'm referring you to various posts that explain why HTML parsing using RegEx is (virtually) imposibble:

  1. RegEx match open tags except XHTML self-contained tags
  2. https://stackoverflow.com/a/590789/290343
  3. https://stackoverflow.com/a/133684/290343
Community
  • 1
  • 1
Ofer Zelig
  • 17,068
  • 9
  • 59
  • 93