0

I have ran into a problem trying to use Regex to copy a string out of an Html string and its not giving me what I need.

I am trying to get the table out of a Div and I have tried this with RegEx

string strLook = respData2;
string s = Regex.Match(strLook, "<div id='results'(.+)</div>", RegexOptions.Singleline).Groups[1].Value;

this is just giving me the whole Html string.

I have also tried using SubString

int starPos = strLook.LastIndexOf("<div id='results'>") + "<div id='results'>".Length + 1;
int length = strLook.IndexOf("</div>") - starPos;
string sub = strLook.Substring(starPos, length);

When I step through the SubString, it says that the last index is 18, well I know that the Div is not 18 characters away from the beginning (unless I am wrong what the 18 is for), this isn't even getting the Div either.

So how do I get the contents of the Div, which is a html table, so I can write it to a file.

Thanks

Chris
  • 2,953
  • 10
  • 48
  • 118
  • Don't parse HTML string with regex. – Avinash Raj Sep 03 '14 at 12:51
  • I see two problems at a glance: 1. Using regex to parse HTML 2. Using the greedy quantifier `.+` – Jerry Sep 03 '14 at 12:52
  • This is the age-old 'parsing HTML with regex' subject. Google it, it'll become quickly apparent that you shouldn't. You may even find yourself at one of SO's infamous answers along the way. –  Sep 03 '14 at 12:52
  • 1
    You should take a look to HTML Agility Pack http://htmlagilitypack.codeplex.com which is a powerfull and light tool to work with HTML – Aymeric Sep 03 '14 at 12:52
  • @Aymeric, I just started using HAP the other day, I can use HAP to extract the table, but since I will be having multiple tables, I need to grab the Div Results, place all the ones I need into a single file then use HAP to extract the table. Or maybe I am going about it the wrong way. I can't find any good tutorials on it, and its documentation is nill on the site – Chris Sep 03 '14 at 12:57
  • @Jerry, I would rather use SubString than RegEx because from what I hear RegEx can take a long time – Chris Sep 03 '14 at 12:58
  • 1
    http://stackoverflow.com/questions/846994/how-to-use-html-agility-pack did you go through this SO's post? – Aymeric Sep 03 '14 at 12:59
  • @Aymeric, I used google. I will check this link now – Chris Sep 03 '14 at 13:00
  • There are plenty of links on this topic which are pretty explicit, and honestly I'm almost 100% sure you won't find a better tool than HAP for working with HTML – Aymeric Sep 03 '14 at 13:02
  • I like what I am seeing with HAP so far, but I'm going to have about 50 html strings that I need to extract the tables from then put that into a datatable. – Chris Sep 03 '14 at 13:04
  • 1
    This is the comp sci 101 solution, not the best, but if this is a one time thing and you don't want to use a tool, you need to account for div tags inside the div you want. So make an open-div-count variable and set it to 1. Then compare the indexes of "
    ", if the open div tag comes first / index is less, add 1 to your open-div-count, else subtract 1. Either way, add the shorter substring to your result, rinse and repeat. You're done when you're open-div-count variable is 0.
    – Dudeman3000 Sep 03 '14 at 15:12

1 Answers1

0

Don't parse HTML with Regular Expressions. Use HTML Agility Pack instead.

Old project page: http://htmlagilitypack.codeplex.com/

wp78de
  • 18,207
  • 7
  • 43
  • 71
Samuel Neff
  • 73,278
  • 17
  • 138
  • 182