Extract text from multiline HTML using Regex

Question

I'm trying to extract some text from HTML file.

This is sample of part that makes me a hedeache:

<TD>
      Adresa instalacije:
  </TD>
  <TD COLSPAN=2>

    <TABLE border=0 cellpadding=3 cellspacing="1" bgcolor="#AAAA77" width="100%">
      <TR bgcolor="#FFFFCC">
        <TD COLSPAN=2><B>SOME TEXT</B></TD>
      </TR>
      <TR bgcolor="#FFFFCC">
        <TD>ADM &#353;ifra: </TD>
        <TD><B>914122</B></TD>
      </TR>
    </TABLE>
  </TD>

The part I want to extract is between

 <TD COLSPAN=2><B> </B></TD>

And this is my regex:

var regexAdresa = @"<TD>Adresa korisnika:</TD><TD COLSPAN=2>";
regexAdresa += @"<TABLE border=0 cellpadding=3 cellspacing=""1"" bgcolor=""#AAAA77"" width=""100%"">";
 regexAdresa += @"<TR bgcolor=""#FFFFCC"">";
 regexAdresa += @"<TD><B>(.*?)</B></TD>";
 regexAdresa += @"</TR></TABLE></TD>";

var r0 = new Regex(regexAdresa);
var rr0 = r0.Match(text);
var res0 = rr0.Groups[1].ToString();

My result is always resturs 0. Am I doing something wrong?

You can't just pretend the whitespace doesn't exist. Regexes match the characters you tell them to match; they don't say "Oh, this looks like HTML, let's see, what are the parsing rules for HTML..." A proper HTML parser would be happy to ignore whitespace for you, though. — 15ee8f99-57ff-4f92-890c-b56153, May 30 '17 at 14:20
Surprised [this](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) hasn't been linked yet. — Saravana, May 30 '17 at 14:23
Any recomendations from HTML parser because this is small HTML file and I need just to extract few informations. Most of the things I get but this one is pretty hard. — Josef, May 30 '17 at 14:35
@Josef No, this one is very easy. I told you the answer fifteen minutes ago. The regex answer, I mean -- in addition to the "proper" answer, which is to use an HTML parser. — 15ee8f99-57ff-4f92-890c-b56153, May 30 '17 at 14:36
As others have mentioned, you can use some HTML parser like [HTMLAgilityPack](http://html-agility-pack.net/) or something. For this specific example, you might be able to use a regex like `[^<>]+<\/B><\/TD>`, check [here](https://regex101.com/r/E29P7X/2) — Arghya C, May 30 '17 at 14:53

score 2 · Answer 1 · answered May 30 '17 at 15:02

I'd use PhantomJS, it's invisible to the user and it parses the entire DOM, giving you access via Selenium. To Access <TD COLSPAN=2><B> </B></TD>.

var text = driver.findElement(By.CssSelector("td.colspan=2" b)).Text;

Warning code not tested, given as example only.

For further information on using the By locator within Selenium click here.

score 0 · Answer 2 · answered May 30 '17 at 15:15

Thanks to all, especially to @Arghya C.

I've tried something and for now this satisfy my needs. Maybe is not best solution but it works:

var regexAdresa = @"<TD (COLSPAN=[1-9]+)?><B>[^<>]+<\/B><\/TD>";
Regex g = new Regex(regexAdresa);
Match m = g.Match(text);
if (m.Success)
   {
       MessageBox.Show(m.ToString());
       MessageBox.Show(Regex.Replace(m.ToString(), "<.*?>", String.Empty));                
    }

I get the line where is text that i want and in second step with regex the HTML tags are removed.

Extract text from multiline HTML using Regex

2 Answers2