1

I'm having a problem with using Html Agility Pack to extract data from websites. The page source that loaded by Html Agility is not the same as the source from View Page Source menu in browser. Here is the difference I have:

View Page Source:

<td>
    <span style="color:#158EF7; font-weight:bold">
        ABCDEF
    </span>
</td>

Source returned from Html Agility

<td>
    <font face="Arial" color="#404040" size="2">
        <span style="color:#158EF7; font-weight:bold">
            ABCDEF
        </span>
    </font>
</td>

I don't know why it has this big difference. Maybe because of the javascript code or something else. But I don't care about it, I just want to get the exact source like the one I get from View Page Source menu. How can I achieve that? Thanks for your help.

Triet Doan
  • 11,455
  • 8
  • 36
  • 69

1 Answers1

1

I had this problem too when parsing Google to find out webpages and their position. Exactly as you say the string that I fecthed DIFFERED from the page source.

If I remember correctly you send a HTTP-Header which includes an attribute called User-Agent read more here -->WIKIEPEDIA ABOUT USER-AGENTS.This tells the site your parsing what kind of browser you are or if you are a webcrawling robot.

Problem
After many hours I saw that I sent an empty string it was set to a default value though, but I didn't know at the time. This in turn made Google believe I was not sending the HTTP-REQUEST from a browser, but a mere mechanical spider... AKA web crawler.

Solution
Try and set your user-agent to the same browser as you are using. That should give you string more like the page source.
But!!! If they run scripts that change the content on the site accoring to anything they have scripted. That's a whole other story.

Check here for the different user-agent strings --> User-agent string list.

8bitcat
  • 2,206
  • 5
  • 30
  • 59
  • 1
    It works. Thanks a lot. May I ask you one more question? I'm using FireFox. When I click `Inspect Element`, it automatically changes something, like inserting `` in the ``. Can I make `Html Agility` do that too?
    – Triet Doan Jan 20 '14 at 05:24
  • Check here it might help you http://stackoverflow.com/questions/938083/why-do-browsers-insert-tbody-element-into-table-elements – 8bitcat Jan 20 '14 at 07:16