help with regular expression pattern to extract some text from html in C#

Question

I have this html block:

<tr>
<th colspan="2" valign="middle">some text</th>
</tr>
<tr>
<td class="row1">lalala<span>dadada</span></td>
<td class="row2"><input name="unwantedinput"></td>
</tr>

<th colspan="2" valign="middle">some text</th>
</tr>
<tr>
<td class="row1">nanana<span>bababa</span></td>
<td class="row2"><input name="unwantedinput"></td>
</tr>


<tr>
<th colspan="2" valign="middle">Some other text</th>
</tr>
<tr>
<td class="row1">(this text needs to be extracted)</td>
<td class="row2"><input name="myUniqueInput"></td>
</tr>

<tr>
<th colspan="2" valign="middle">some text</th>
</tr>
<tr>
<td class="row1">lalala<span>dadada</span></td>
<td class="row2"><input name="unwantedinput"></td>
</tr>

what I need is to extract only the data between the "(this text needs to be extracted)".. here is what I've done so far:

<th[^>]*>(.*?)<input[^>]*name="myUniqueInput"[^>]*>

the problem with this pattern. its matching the whole text from the beginning till the "myUniqueInput".. any idea how to fix this? thanks in advance..

Duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags), [Regular expression to find a value in a webpage](http://stackoverflow.com/questions/2393176/regular-expression-to-find-a-value-in-a-webpage) and too many others to count. — outis, Apr 30 '11 at 09:05

score 1 · Accepted Answer · answered Apr 30 '11 at 09:06

1

/<td[^>]*>([^<]*)<[^>]*>\s*<td[^>]*>\s*<input[^>]*name="myUniqueInput"/

You can always match more/less depending if you know how the html will look. The idea is to skip td* before the input name. Then get everything between the previous td /td.

answered Apr 30 '11 at 09:06

Johan Soderberg

2,650
1
15
12

I need it to be more dynamic.. for example if I want it to match: "(this text needs to be extracted) "... It will fail.. – Desolator Apr 30 '11 at 09:38
/]*>(.*?)<\/td>*>\s*]*>\s*]*name="myUniqueInput"/ This will get it with included so it needs to be filtered after. How dynamic does it have to be? :) – Johan Soderberg Apr 30 '11 at 09:44
well it partially works with the html I have... anyway thanks with the idea.. I will try to make the pattern more dynamic to make it match everything I need.. thanks for your help :) – Desolator Apr 30 '11 at 10:05

score 0 · Answer 2 · answered Apr 30 '11 at 08:57

0

It's generally accepted that regular expressions aren't expressive enough to parse HTML correctly. Have you considered using a library to parse the HTML for you, and then extracting the data from there?

answered Apr 30 '11 at 08:57

Brian Willis

22,768
9
46
50

As far as a library is concerned to parse HTML you can use "http://htmlagilitypack.codeplex.com/" . This is .NET specific :) – Ankur Apr 30 '11 at 09:00
thanks for the answer.. and I know that there are html parsers but i'm not interested in that right now... I just need dirty solution.. – Desolator Apr 30 '11 at 09:00
I don't agree that "It's generally accepted that regular expressions aren't expressive enough to parse HTML correctly". You probably mistook attempts at brackets matching problem that can't be solved with "standard" regular expressions with more generic case of web scraping. Regular Expressions are quite sufficient for web scraping in many many cases. But even bracket matching can be achieved with certain implementations of regular expressions, for example in .NET Framework implementation. – Andrew Savinykh Apr 30 '11 at 09:14

help with regular expression pattern to extract some text from html in C#

2 Answers2