0

I'm trying to pull out a string between 2 other strings. But to make it more complicated the proceeding contents will often differ.

enter image description here

The string I'm trying to retrieve is Christchurch.

The regex I have so far is (?<=300px">).*(?=</td) and it will pull out the string I'm looking fine but it will also return dozens of other strings through out the LARGE text file I'm searching.

What I'd like to do is limit the prefix to start seraching from Office:, all the way to 300px"> but, the contents between those 2 strings will sometimes differ dependant upon user preferences.

To put it in crude non regex terms I want to do the following: Starting at Office: all the way to 300px> find the string that starts here and ends with </td. Thus resulting in Christchurch.

chouaib
  • 2,763
  • 5
  • 20
  • 35

4 Answers4

3

Have you considered using the HTMLAgilityPack instead? It's a Nuget package for handling HTML which is able to handle malformed HTML pretty well. Most on Stack Overflow would recommend against using Regex for HTML - see here: RegEx match open tags except XHTML self-contained tags

Here's how you'd do it for your example:

using HtmlAgilityPack; //This is a nuget package!
var html = @"<tr >
               <td align=""right"" valign=""top""><strong>Office:</strong>&nbsp; </td>
               <td align=""left"" class=""stippel"" style=""white-space: wrap;max-width:300px"">Christchurch </td>
             </tr>";

var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);

var node = htmlDoc.SelectSingleNode("//td[@class='stippel']");
Console.WriteLine(node.InnerHtml);

I haven't tested this code but it should do what you need.

Greg the Incredulous
  • 1,676
  • 4
  • 29
  • 42
  • 1
    The advantage to this is that you can probably look for the tag with your class in it and just pull out its value. – adamdc78 Mar 03 '15 at 00:36
0

The issue you're encountering is that * is greedy. Use the lazy/reluctant version *?.

Office:[\s\S]*?300px">(.*?)</td

This solution uses a group match rather than look-arounds.

adamdc78
  • 1,153
  • 8
  • 18
0

I guess you need something like this:

office.*\n.*|(?<=300px">).*(?=<\/td)
chouaib
  • 2,763
  • 5
  • 20
  • 35
0

Thanks to the posts from adamdc78 and greg I have the been able to come up with the below regex. This is exactly what I needed.

Thanks for you help.

(?<=office.*\n.*300px">).*(?=<\/td)
  • 1
    welcome to StackOverflow: you should accept their answers (since they helped) and not add a *thank you answer* – chouaib Mar 03 '15 at 01:45