C# Regex to get string between two strings with a wildcard string in between?

Question

I know this has been asked in some capacity - but I was not able to see working example of the solution yet. I know that there is the Html Agility Pack to parse HTML strings, but I do not wish to download/install it. I get the contents of a webpage using

string html = client.DownloadString("http://yoursite.com/page.html");

I have a tags which have a class with them, but some of those tags also have their own ID, or style, etc for example:

<td>I Dont want this</td>
<td class="myClass">I want this</td>
<td class="myClass" id="myID">I want this</td>
<td style="border-top-width: 0px; class="myClass">I want this</td>

I tried

<td>(.*?)</td>

But it returns the tags without any class, id, etc.

I tried

<td class="myClass"[^>]*>(.*?)</td>

But it returns only the second and third <td> values but not the fourth. How can I add a wildcard to return any <td> with myClass but ignores anything that comes before or after like id or style?

I'm compelled to point you to [this rather famous question](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — Jonesopolis, Aug 09 '16 at 17:42
That part of HTML will ALWAYS have the same format as specified by the question. There will not be any errors or overloads that browsers autocomplete and autocorrect for. — KingsInnerSoul, Aug 09 '16 at 17:50
Just add another `[^>]*` before the `class` attribute. If your HTML is as consistent as you say, that should be sufficient. — Alan Moore, Aug 09 '16 at 18:57

score 0 · Answer 1 · 2016-08-09T18:25:11.533

This will only identify if a <td has a class or id attribute.
It passes if either one has.

If you require only a class value and id is optional, change the conditional
to (?(class)|(?!))

After it finds the opening tag, this method just find's the very next closure.
(Note that it doesn't check if the opening <td is a self contained tag.
If that's possible, add (?<!/>) right after the atomic group i.e (?>..)(?<!/>))

The class and id values are in their named capture groups.

Verbatim

@"(?is)<td(?=\s)(?>(?:(?<=\s)class\s*=\s*""(?<class>[^""]*)""|(?<=\s)id\s*=\s*""(?<id>[^""]*)""|"".*?""|'.*?'|[^>]*?)+>)(?(class)|(?(id)|(?!))).*?</td\s*>"

Expanded

 (?is)
 < td                   # 'td' tag, or any tag for that matter
 (?= \s )
 (?>                    # Atomic grouping
      (?:
           (?<= \s )
           class  \s* = \s*       # 'class' attribute
           "
           (?<class>              # 'class' value                                                      
                [^"]*                  
           )
           "
        |  (?<= \s )
           id  \s* = \s*          # 'id' attribute
           "
           (?<id>                 # 'id' value                                                      
                [^"]*
           )
           "
        |  " .*? "
        |  ' .*? '
        |  [^>]*? 
      )+
      >
 )
 (?(class)              # Conditional - Only tags with our 'class' or 'id' attr/value
   |  
      (?(id)
        |  (?!)
      )
 )
 .*? 
 </td \s* >

Btw, on your last line <td style="border-top-width: 0px; class="myClass">I want this</td>
style value encloses the class= part
style="border-top-width: 0px; class="

NotGI · Answer 2 · 2016-08-09T18:10:48.180

0

That should do it: <td(.+|)(class="myClass")(.+|)>(.+)<\/td>

Live example: https://regex101.com/r/gG6gH0/2

But if the list is any a different format then you described then you must exclude the '<' , '>' characters from the capturing group.

edited Aug 09 '16 at 18:10

answered Aug 09 '16 at 18:04

NotGI

458
9
21

C# Regex to get string between two strings with a wildcard string in between?

2 Answers2