2

I am trying to scrape data from htm file in the format as it is including hex code for french characters, But after scraping it converts into character which I dont want. Any idea?? I have searched everywhere for this but failed to get answer so I registered here with my own trouble.

src = UserForm1.WebBrowser1.document.body.innerHTML

above code to get the source code

the specific "source text" which I want is below: the bold font one coverts to Character in extracteed source file.

"Intel Inside<sup>**&reg;**</sup><br>pour une<br>productivit**&#xE9;**<br>exceptionnelle."

but instead I am getting:

"Intel Inside<sup>®</sup><br>pour une<br>productivité<br>exceptionnelle.

How do I get the text I mentioned first. I have just started my VBA so I might sound naive, so please be lil gentle with me.

Thanks:)

Suman Kumar
  • 63
  • 1
  • 12
  • Can you share the URL? –  Nov 21 '16 at 14:36
  • its in file://****/Test/sachin_HTML/Test%20File/204217_ca_cs_sb_fy17q4wk6_oa_sb-performance-high_fr_160x600_vr_index.html – Suman Kumar Nov 21 '16 at 15:01
  • OK I have got another link as an example: http://www.dell.com/fr/p/laptops?dgc=IR&cid=Q3_New_LT_Portfolio&lid=469x208_P_homepage:_r_3_c_2_t_0....In view source look in line number:559 "title="Système d'exploitation"" output looks like: "title="Système d'exploitation"" – Suman Kumar Nov 21 '16 at 16:08

1 Answers1

0

You can get the raw HTML using MSXML2 as your browser. IE.responseText is the unprocessed HTML. As soon as you load the raw HTML into the HTMLBody the special characters are converted.

The IE.responseText is raw text. This means you will have to manually parse it. I would recommend using RegEx to do so.

enter image description here

Public Sub ParseMaterial()
    Const FILE_URL = "D:\test.html"
    Dim IE As MSXML2.XMLHTTP60
    Set IE = New MSXML2.XMLHTTP60

    Dim HTMLDoc As MSHTML.HTMLDocument
    Dim HTMLBody As MSHTML.HTMLBody

    Set HTMLDoc = New MSHTML.HTMLDocument
    Set HTMLBody = HTMLDoc.body

    IE.Open "GET", FILE_URL, False
    IE.send

    While IE.ReadyState <> 4
        DoEvents
    Wend

    HTMLBody.innerHTML = IE.responseText

    Debug.Print "HTMLBody.innerHTML"
    Debug.Print HTMLBody.innerHTML
    Debug.Print
    Debug.Print "Raw HTML:  IE.responseText";
    Debug.Print IE.responseText
End Sub
  • Okie one followup question...can we fire .getelementbyclassname in .responsetext?? – Suman Kumar Nov 21 '16 at 20:19
  • No ResponseText is just an ordinary string of characters. There are no elements. THe answer to [Extract values from HTML TD and Tr](http://stackoverflow.com/questions/8776484/extract-values-from-html-td-and-tr) should give you an idea of what you need to do. –  Nov 21 '16 at 22:24