1

I'm new to regular expressions (C#). I need to get the brand names out of an HTML document. I'm using

 MatchCollection m1 = Regex.Matches(html,"<td>.+?</td>",RegexOptions.Singleline);

and the result is 108 lines similar to the following. Each containing a different brand name, Acer in this case.

<td><a href=acer-phones-59.php>
<img src="http://cdn2.gsmarena.com/vv/logos/lg_acer.gif" 
width=92 height=22 border=0 alt="Acer"></a></td>
<td><a href=acer-phones-59.php>Acer phones (89)</a></td>

I need the words "acer" only once, and "acer-phones-59.php" only once. How can I adjust my expression in order to get the brand names and reference name from each line. Any help would be greatly appreciated, thank you.

Smiel
  • 77
  • 4
  • 2
    while you are waiting for somebody to write your regex, you should read this: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – dognose Sep 15 '15 at 12:54
  • Use HtmlagilityPack. Although it has some peculiar bugs if you want to manipulate HTML code, it is quite reliable for just Web scraping. – Wiktor Stribiżew Sep 15 '15 at 12:54
  • Just FYI: no one will be able to answer your question. Rephrase it, specify how one can detect the elements containing your required texts, and then perhaps, there will come an answer. – Wiktor Stribiżew Sep 15 '15 at 14:13

1 Answers1

-1
Regex.Matches( inputString, @"<td>(.|\n)+?href=(.+?)>(.|\n)+?alt="(.+)"", RegexOptions.None )

The answers are in Group2 and Group4.

Derek
  • 7,615
  • 5
  • 33
  • 58