1

I've got problem with extracting a string from a html code (that's basically problem with regex expression). Here's the code:

string wheretosearch = @"
<td class=""name"">
<div>
<a href=""/addr1.html"" class=""link "">
<span>Title1</span>
</a></td>

[some code]

<td class=""name"">
<div>
<a href=""/addr2.html"" class=""link "">
<span>Title2</span>
</a></td>";

I want to extract titles between tags. What my problem is that I cannot put the unknown number of chars in regex (.* section after td class=""name"" ):

<td class=""name"">.*<span>(?<title>.*)</span>

To put things simply: I want regex to find <td class=""name""> and then after unknown number of characters find first occurrence of <span> and then take the value between that first <span> and </span>.

What it actually does it takes the last occurrence of <span> and gives the last title only.

EDIT:

Okay, besides the HTML issue, the problem is like: I've got string:

"This is a text: NICE. This is a great text: NICE TOO."

I would like to take "This" then unknown number of characters, then string between ": " and "." How this could be done?

Of course I'm interested in each occurance of that complex expression, so the output would be "NICE" and "NICE TOO" in collection.

For my expression like "This.*(?<title>.*)." i get only the "NICE TOO" string, as @urlreader mentioned, it finds the max length matched string.

Jarzyn
  • 290
  • 2
  • 11
  • 7
    Ahem ... http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – driis Nov 06 '12 at 21:31
  • 3
    Not a good idea to use regex for html parsing. Use [Html Agility Pack](http://htmlagilitypack.codeplex.com/) – Steve Nov 06 '12 at 21:32
  • +1 for the Agility pack, works pretty great, gobbles almost any crap you throw at it. – flq Nov 06 '12 at 21:38
  • Thanks for the tips of the HTML issue, but like in EDIT, it's more regex issue I think than HTML one, though I would not process HTML with regex any more ;) – Jarzyn Nov 06 '12 at 21:53

2 Answers2

1
<td class=""name"">.*?<span>(?<title>.*)</span>

it is because regex tries to find the max length matched string.

urlreader
  • 6,319
  • 7
  • 57
  • 91
  • Ok, thank you, besides the HTML issue: "This is a text: NICE. This is a great text: NICE TOO." I would like to take "This" then unknown number of characters, then string between ": " and "." How this could be done? – Jarzyn Nov 06 '12 at 21:39
1

For question in edit I would try something like:

This[\w|\s]*: (?<title>[\w|\s]+)\.

Remember you have to escape the dot at the end.

Everything you will ever need for regex in c# is here

A handy tool: http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx

slawekwin
  • 6,270
  • 1
  • 44
  • 57