Extracting value from string

Question

I've got problem with extracting a string from a html code (that's basically problem with regex expression). Here's the code:

string wheretosearch = @"
<td class=""name"">
<div>
<a href=""/addr1.html"" class=""link "">
<span>Title1</span>
</a></td>

[some code]

<td class=""name"">
<div>
<a href=""/addr2.html"" class=""link "">
<span>Title2</span>
</a></td>";

I want to extract titles between tags. What my problem is that I cannot put the unknown number of chars in regex (.* section after td class=""name"" ):

<td class=""name"">.*<span>(?<title>.*)</span>

To put things simply: I want regex to find <td class=""name""> and then after unknown number of characters find first occurrence of <span> and then take the value between that first <span> and </span>.

What it actually does it takes the last occurrence of <span> and gives the last title only.

EDIT:

Okay, besides the HTML issue, the problem is like: I've got string:

"This is a text: NICE. This is a great text: NICE TOO."

I would like to take "This" then unknown number of characters, then string between ": " and "." How this could be done?

Of course I'm interested in each occurance of that complex expression, so the output would be "NICE" and "NICE TOO" in collection.

For my expression like "This.*(?<title>.*)." i get only the "NICE TOO" string, as @urlreader mentioned, it finds the max length matched string.

Ahem ... http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — driis, Nov 06 '12 at 21:31
Not a good idea to use regex for html parsing. Use [Html Agility Pack](http://htmlagilitypack.codeplex.com/) — Steve, Nov 06 '12 at 21:32
+1 for the Agility pack, works pretty great, gobbles almost any crap you throw at it. — flq, Nov 06 '12 at 21:38
Thanks for the tips of the HTML issue, but like in EDIT, it's more regex issue I think than HTML one, though I would not process HTML with regex any more ;) — Jarzyn, Nov 06 '12 at 21:53

score 1 · Answer 1 · answered Nov 06 '12 at 21:36

1

<td class=""name"">.*?<span>(?<title>.*)</span>

it is because regex tries to find the max length matched string.

answered Nov 06 '12 at 21:36

urlreader

6,319
7
57
91

Ok, thank you, besides the HTML issue: "This is a text: NICE. This is a great text: NICE TOO." I would like to take "This" then unknown number of characters, then string between ": " and "." How this could be done? – Jarzyn Nov 06 '12 at 21:39

slawekwin · Accepted Answer · 2013-04-10T07:36:41.577

1

For question in edit I would try something like:

This[\w|\s]*: (?<title>[\w|\s]+)\.

Remember you have to escape the dot at the end.

Everything you will ever need for regex in c# is here

A handy tool: http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx

edited Apr 10 '13 at 07:36

answered Apr 10 '13 at 07:03

slawekwin

6,270
1
44
57

Extracting value from string

2 Answers2