2

I want to retrieve a table from the URL of https://s.cafef.vn/screener.aspx#data using VBA. This task is difficult because the table contains JSON data embedded in an html file.

Taking @Tomalak ‘s advice, I am trying to split up my task; solving four following individual problems one after another:

  1. Send an HTTP request to have the HTML
  2. Locate the JSON string
  3. Parse JSON with VBA and then
  4. Loop over the raw data from the JSON and write into an Excel table.

Extract a JSON DATA table in html using VBA; converting Apps Script into VBA

However, I get stuck at Step 2, the response text that I get is stored in htmlTEXT. Its print-out looks like below attached, but the problem is as a string variable, htmlTEXT can hold up only a small part of the html page content. The JSON paragraph does not lie on the top part of the html page and is therefore not returned into htmlTEXT.

My questions are:

  1. How can we get the whole content of the html page (with the JSON paragraph included)?

  2. Once the JSON paragraph is captured, what Regular Expression can be used to extract the JSON paragraph ?

Noticing that the JSON paragraph starts with [{ and ends with }], I therefore use the pattern [{*}] but it does not work at all, (though it works with pattern like (D.C); resulting in DOC for my testing purpose)

What is wrong with my code?


Sub ExtractJSON_in_html()
    ' =====send an HTTP request with VBA ====
    Dim JSONtext As String
    Dim htmlTEXT As String
    Dim SDI As Object

    Set objHTTP = CreateObject("MSXML2.XMLHTTP")
    Url = "https://s.cafef.vn/screener.aspx#data"
  
    objHTTP.Open "GET", Url, False
    objHTTP.send
    htmlTEXT = objHTTP.responsetext
   
    MsgBox htmlTEXT

    ' ===== Locate the JSON string  =======
    Set SDI = CreateObject("VBScript.RegExp")
    SDI.IgnoreCase = True
    SDI.Pattern = "[{*}]"
    SDI.Global = True

    Set theMatches = SDI.Execute(htmlTEXT)

    For Each Match In theMatches     
        'MsgBox Match.Value
        JSONtext = Match.Value
    Next
End Sub

htmlTEXT:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">

<head>

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"/>
-- JASON Paragraph var jsonData = [{"Url":"http://s.cafef.vn/upcom/A32-cong-ty-co-phan-32.chn","CenterName":"UpCom","Symbol":"A32","TradeCenterID":9,"ChangePrice":0,"VonHoa":212.84,"ChangeVolume":400,"EPS":6.19220987764706,"PE":5.0547382305287,"Beta":0,"Price":0,"UpdatedDate":"\/Date(1625562652463)\/","FullName":"Công ty cổ phần 32","ParentCategoryId":0
{"Url":"http://s.cafef.vn/upcom/YTC-cong-ty-co-phan-xuat-nhap-khau-y-te-thanh-pho-ho-chi-minh.chn","CenterName":"UpCom","Symbol":"YTC","TradeCenterID":9,"ChangePrice":0,"VonHoa":170.8,"ChangeVolume":200,"EPS":-4.29038514857143,"PE":-14.217837766922,"Beta":0,"Price":0,"UpdatedDate":"\/Date(1625562969277)\/","FullName":"Công ty Cổ phần Xuất nhập khẩu Y tế Thành phố Hồ Chí Minh","ParentCategoryId":0}];
Pᴇʜ
  • 56,719
  • 10
  • 49
  • 73
Cao Doremi
  • 91
  • 6
  • 3
    The problem I encountered is that `objHTTP.Status` is `404` **Not Found**. So Probably some cookies or whatever is needed (you have to figuer this out). You need to get `200` as status which means **OK**, otherwise your htmltext is not what you expect. • Second problem is your regex pattern is wrong. To get the json string you need `\[\{.*\}\]` as a pattern. – Pᴇʜ Jul 08 '21 at 07:19
  • 1
    `"\(([^)]+)\)"` is what I use to get string between and including parenthesis. May be `"\{([^}]+)\}"` this will give you string between and including curly braces. ... \[\{.*\}\] will give you all the text between and including first and last brackets – Naresh Jul 08 '21 at 07:44
  • 1
    Tried `"\{([^}]+)\}"` with `"[{'Url':'http://s.cafef.vn/upcom/A32-cong-ty-co-phan-32.chn'},{'Url':'http://s.cafef.vn/upcom/YTC-cong-ty-co-phan-xuat-nhap-khau-y-te-thanh-pho-ho-chi-minh.chn'}]"` .. it works.. gives collection of strings between and including curly braces. – Naresh Jul 08 '21 at 08:07
  • 1
    I think it doesn't work with xhr, because if I load the main page `https://cafef.vn/` I get the responseText `` That's only an empty body with an `onload` event which gets the real content over JS and JS is not working with xhr. I think the better approach here is SeleniumBasic to use the Chrome browser to make the job with JS. You can look here at the answer of YasserKhalil how to install SeleniumBasic and the latest WebDriver: https://stackoverflow.com/questions/57216623/using-google-chrome-in-selenium-vba-installation-steps – Zwenn Jul 08 '21 at 08:59

3 Answers3

3

This will return the JSON string as a Dictionary object for you to work through:

You will need JsonConverter (and reference to Microsoft Scripting Runtime for Dictionary object)

Private Sub Test()
    Dim xmlhttp As Object
    Set xmlhttp = CreateObject("MSXML2.XMLHTTP")
    
    xmlhttp.Open "GET", "https://s.cafef.vn/screener.aspx"
    xmlhttp.send
    
    Dim jsonStr As String
    jsonStr = Mid$(xmlhttp.responseText, InStr(xmlhttp.responseText, "[{"))
    jsonStr = Left$(jsonStr, InStr(jsonStr, "}]") + 1)
    
    Dim jsDict As Scripting.Dictionary
    Set jsDict = JsonConverter.ParseJson("{""results"":" & jsonStr & "}")
    
    Debug.Print jsDict("results").Count '1874
End Sub

Note: The original URL in your question returns 404 error, you just need to remove #data from the URL.

Raymond Wu
  • 3,357
  • 2
  • 7
  • 20
3

I would want more certainty over matching the correct JavaScript object than given by the current Instr methods (which could be extended to include the var jsonData pattern as well.) In case of using regex then the following pattern can be used, which will allow for line break matching. Note, you only need one entire match then parse the JavaScript array returned with a json parser.


Public Sub ExtractJSON_in_html()
    ' =====send an HTTP request with VBA ====
    Dim JSONtext As String
    Dim htmlTEXT As String
    Dim SDI As Object

    Set OBJHTTP = CreateObject("MSXML2.XMLHTTP")
    URL = "https://s.cafef.vn/screener.aspx"
  
    OBJHTTP.Open "GET", URL, False
    OBJHTTP.setRequestHeader "User-Agent", "Mozilla/5.0"
    OBJHTTP.send
    htmlTEXT = OBJHTTP.responseText
   
    MsgBox htmlTEXT

    ' ===== Locate the JSON string  =======
    Set SDI = CreateObject("VBScript.RegExp")
    SDI.IgnoreCase = True
    SDI.Pattern = "var\sjsonData\s=\s([\s\S].*)?;"
    
    WriteTxtFile SDI.Execute(htmlTEXT)(0).SubMatches(0)
End Sub

 Public Sub WriteTxtFile(ByVal aString As String, Optional ByVal filePath As String = "C:\Users\<user>\Desktop\Test.txt")
    Dim fso As Object, Fileout As Object
    Set fso = CreateObject("Scripting.FileSystemObject")
    Set Fileout = fso.CreateTextFile(filePath, True, True)
    Fileout.Write aString
    Fileout.Close
End Sub

Regex:

enter image description here


Sample of treeview of result:

Array with 1874 elements; 1 expanded.

QHarr
  • 83,427
  • 12
  • 54
  • 101
1

Edited your macro.. This will add a worksheet and parse JSON text from Range A1

Option Explicit

Sub ExtractJSON_in_html()
Dim JSONtext As String, JSONtextarr() As String, Url As String
Dim htmlTEXT As String, colHead As String
Dim SDI As Object, objHTTP As Object, theMatches As Object, Match As Variant
Dim StartPos As Long, endPos As Long, i As Long

Set objHTTP = CreateObject("MSXML2.XMLHTTP")
Url = "https://s.cafef.vn/screener.aspx"

' =====send an HTTP request with VBA ====
objHTTP.Open "GET", Url, False
objHTTP.send
htmlTEXT = objHTTP.responseText
StartPos = InStr(1, htmlTEXT, "var jsonData = [", vbTextCompare)
endPos = InStr(StartPos, htmlTEXT, "]", vbTextCompare)
htmlTEXT = Replace(Mid(htmlTEXT, StartPos, endPos - StartPos + 1), ",""", ";")

' ===== Make the JSON strings collecton  =======
Set SDI = CreateObject("VBScript.RegExp")
SDI.IgnoreCase = True
SDI.Global = True


SDI.Pattern = "[^a-zA-Z0-9&{}./:;,-]"
htmlTEXT = SDI.Replace(htmlTEXT, "")

SDI.Pattern = "\{([^}]+)\}"
Set theMatches = SDI.Execute(htmlTEXT)
JSONtext = ""
Debug.Print theMatches.Count
For Each Match In theMatches
    JSONtext = JSONtext & Match.Value & ","
Next

' ===== Populate new worksheet with parsed JSON =======
JSONtext = Replace(Mid(JSONtext, 2, Len(JSONtext) - 3), ",ParentCategoryId", ",,ParentCategoryId", , , vbTextCompare)
JSONtextarr = Split(JSONtext, "},{", , vbTextCompare)
Worksheets.Add
Range("A2").Resize(UBound(JSONtextarr) + 1, 1).Value = Application.Transpose(JSONtextarr)

Range("A2").CurrentRegion.TextToColumns Destination:=Range("A2"), DataType:=xlDelimited, _
    TextQualifier:=xlDoubleQuote, ConsecutiveDelimiter:=False, Tab:=False, _
    Semicolon:=True, Comma:=False, Space:=False, Other:=False, TrailingMinusNumbers:=True

Debug.Print Range("A2").CurrentRegion.Columns.Count
For i = 1 To Range("A2").CurrentRegion.Columns.Count
colHead = Split(Cells(2, i), ":")(0)
Cells(1, i) = colHead
Range("A2").CurrentRegion.Columns(i).Replace What:=colHead & ":", Replacement:="", LookAt:=xlPart, _
    SearchOrder:=xlByRows, MatchCase:=False, SearchFormat:=False, _
    ReplaceFormat:=False
Next i


End Sub
Naresh
  • 2,984
  • 2
  • 9
  • 15