29

Convert HTML-table to Excel

The code below fetches the HTML-table at https://rasmusrhl.github.io/stuff, and converts it to Excel-format.

The problem is that:

  • Numbers in parentheses are converted to negative numbers
  • Numbers are rounded or truncated

Solution

Thank you all for your great contributions. The varied anwers helped me understand, that for my purposes a workaround was the best solution: Because I generate the HTML-tables myself, I can control the CSS of each cell. CSS codes exists that instruct Excel how to interpret cell contents: http://cosicimiento.blogspot.dk/2008/11/styling-excel-cells-with-mso-number.html, also explained in this question: Format HTML table cell so that Excel formats as text?

In my case the CSS should be text, which is mso-number-format:\"\\@\". It is integrated in R code below:

library(htmlTable)
library(nycflights13)
library(dplyr)

nycflights13::planes %>% 
    slice(1:10) %>% mutate( seats = seats*1.0001,
                            s1    = c("1-5", "5-10", "1/2", "1/10", "2-3", "1", "1.0", "01", "01.00", "asfdkjlæ" ),
                            s2    = c("(10)", "(12)", "(234)", "(00)", "(01)", "(098)", "(01)", "(01.)", "(001.0)", "()" )) -> df 


rle_man <- rle(df$manufacturer)

css_matrix <- matrix( data = "mso-number-format:\"\\@\"", nrow = nrow(df), ncol = ncol(df))
css_matrix[,1] <- "padding-left: 0.4cm;mso-number-format:\"\\@\""
css_matrix[,2:10] <- "padding-left: 1cm;mso-number-format:\"\\@\""
css_matrix[,5] <- "padding-left: 2cm;mso-number-format:\"\\@\""


htmlTable( x = df,  
           rgroup   = rle_man$values, n.rgroup = rle_man$lengths, 
           rnames   = FALSE, align = c("l", "r" ), 
           cgroup   =  rbind(  c("", "Some text goes here. It is long and does not break", "Other text goes here", NA),
                               c( "", "Machine type<br>(make)", "Specification of machine", "Other variables")),
           n.cgroup = rbind(   c(1,8,2, NA),
                               c(1, 3, 5, 2)), 
           css.cell = css_matrix )            -> html_out

temp_file <- tempfile( pattern = "table", fileext = ".html" )
readr::write_file( x = html_out, path = temp_file)
utils::browseURL( temp_file)

That HTML-file can be dragged and dropped into Excel with all cells interpreted as text. Note, only dragging-and-dropping the html-file into excel works, it does not work to open the table in a browser and copy-pasting it into excel.

The only thing missing from this method is the horizontal lines, but I can live with that.

Below is VBA with the same effect as dragging and dropping:

Sub importhtml()
'
' importhtml Macro
'

'
With ActiveSheet.QueryTables.Add(Connection:= _
                                 "URL;file:///C:/Users/INSERTUSERNAME/Desktop/table18b85c0a20f3html.HTML", Destination:=Range("$a$1"))

.FieldNames = True
.RowNumbers = False
.FillAdjacentFormulas = False
.PreserveFormatting = False
.RefreshOnFileOpen = False
.BackgroundQuery = True
.RefreshStyle = xlInsertDeleteCells
.SavePassword = False
.SaveData = True
.AdjustColumnWidth = True
.RefreshPeriod = 0
.WebSelectionType = xlEntirePage
.WebFormatting = xlWebFormattingAll
.WebPreFormattedTextToColumns = True
.WebConsecutiveDelimitersAsOne = True
.WebSingleBlockTextImport = False
.WebDisableDateRecognition = True
.WebDisableRedirections = False
.Refresh BackgroundQuery:=False
End With

End Sub
Rasmus Larsen
  • 5,721
  • 8
  • 47
  • 79
  • 1
    Excel number formatting - is it set to Accounting format or something like that? If so, then a number in brackets/parenthesis is negative... see [MSDNL: WebFormatting property](https://msdn.microsoft.com/en-us/library/office/aa215962(v=office.11).aspx) - have you tried *.WebFormatting = xlWebFormattingNone* – Our Man in Bananas Jan 08 '18 at 15:33
  • 1
    Isn't it simpler to just add the `mso-number-format` style in the source html (where it's needed) ? – CommonSense Jan 12 '18 at 10:19
  • @CommonSense maybe it is. What should I append to each cell of the HTML-table, to make excel treat it like text? – Rasmus Larsen Jan 12 '18 at 12:37
  • Are you sure that numbers in () do not represent negative values? – MarcinSzaleniec Jan 12 '18 at 12:52
  • @MarcinSzaleniec yes I am sure, because I created the table myself :) (in R). The numbers in parentheses actually represent percentages, because some scientific journals requires that format for percentages. – Rasmus Larsen Jan 12 '18 at 12:56
  • The implementation depends on the desired output. As you can see `Excel` sets the `General` number format to each cell. If you want represent information "as is", it's good option to set number format to `text`. For example, replace line `(10)` with `(10)`. More advanced example with classes can be found [here](https://stackoverflow.com/a/4620023/6634373)! – CommonSense Jan 12 '18 at 13:08
  • Can you post the R code - does it use `htmlTable` function? – Robin Mackenzie Jan 13 '18 at 10:35
  • @RobinMackenzie Indeed, I used htmlTable::htmlTable(). The code is here: https://github.com/rasmusrhl/stuff/blob/master/generatehtml . It also contains a function to append something to each cell (commented out). It would be nice if I could append something which is invisible when viewed in HTML, but makes Excel interpret all cell contents as text. – Rasmus Larsen Jan 13 '18 at 13:43
  • 1
    @Rasmus Larsen : I have updated my answer with server side R solution. Enjoy. – S Meaden Jan 14 '18 at 10:57

9 Answers9

7

For a client side solution

So run this code after the first block of code, it rewrites the final two columns.

Sub Test2()
    '* tools references ->
    '*   Microsoft HTML Object Library


    Dim oHtml4 As MSHTML.IHTMLDocument4
    Set oHtml4 = New MSHTML.HTMLDocument

    Dim oHtml As MSHTML.HTMLDocument
    Set oHtml = Nothing

    '* IHTMLDocument4.createDocumentFromUrl
    '* MSDN - IHTMLDocument4 createDocumentFromUrl method - https://msdn.microsoft.com/en-us/library/aa752523(v=vs.85).aspx
    Set oHtml = oHtml4.createDocumentFromUrl("https://rasmusrhl.github.io/stuff/", "")
    While oHtml.readyState <> "complete"
        DoEvents  '* do not comment this out it is required to break into the code if in infinite loop
    Wend
    Debug.Assert oHtml.readyState = "complete"


    Dim oTRs As MSHTML.IHTMLDOMChildrenCollection
    Set oTRs = oHtml.querySelectorAll("TR")
    Debug.Assert oTRs.Length = 17

    Dim lRowNum As Long
    For lRowNum = 3 To oTRs.Length - 1

        Dim oTRLoop As MSHTML.HTMLTableRow
        Set oTRLoop = oTRs.Item(lRowNum)
        If oTRLoop.ChildNodes.Length > 1 Then

            Debug.Assert oTRLoop.ChildNodes.Length = 14

            Dim oSecondToLastColumn As MSHTML.HTMLTableCell
            Set oSecondToLastColumn = oTRLoop.ChildNodes.Item(12)

            ActiveSheet.Cells(lRowNum + 2, 13).Value2 = "'" & oSecondToLastColumn.innerText


            Dim oLastColumn As MSHTML.HTMLTableCell
            Set oLastColumn = oTRLoop.ChildNodes.Item(13)

            ActiveSheet.Cells(lRowNum + 2, 14).Value2 = "'" & oLastColumn.innerText

        End If
        'Stop

    Next lRowNum

    ActiveSheet.Columns("M:M").EntireColumn.AutoFit
    ActiveSheet.Columns("N:N").EntireColumn.AutoFit


End Sub

For a Server Side Solution

Now that we know you control the source script and that it is in R then one can change the R script to style the final columns with mso-number-format:'\@' . Here is a sample R script that achieves this, one builds a CSS matrix of the same dimensions as the data and passes the CSS matrix as a parameter into htmlTable. I have not tampered with your R source instead I give here a simple illustration for you to interpret.

A=matrix(c("(2)","(4)","(3)","(1)","(5)","(7)"),nrow=2,ncol=3,byrow=TRUE)
css_matrix <- matrix(data="",nrow=2,ncol=3)
css_matrix[,3] <- "mso-number-format:\"\\@\""
htmlTable(x=A,css.cell=css_matrix)

Opening in Excel I get this enter image description here

Robin Mackenzie adds

you might mention in your server-side solution that OP just needs to add css_matrix[,10:11] <- "mso-number-format:\"\@\"" to their existing R code (after the last css_matrix... line) and it will implement your solution for their specific problem

Thanks Robin

S Meaden
  • 8,050
  • 3
  • 34
  • 65
  • Thanks. I am hoping for a general approach, which would make Excel not change the content of the table in the first place. Like the option `.WebDisableDateRecognition = True`, but for all the other things that excel changes, so that I do not have to check the individual columns to see if Excel changed anything, and then change the macro, but maybe this is impossible. Also I get the error `User type not defined`. – Rasmus Larsen Jan 11 '18 at 06:39
  • 1
    @Rasmus: I really do not think it is possible to do as you ask. I am happy to debug the above solution if you want. You have to go to Tools->References and check the library `Microsoft HTML Object Library` as mentioned in the comment on 2nd and 3rd line. – S Meaden Jan 11 '18 at 11:30
  • 2
    @Rasmus: anyway, a large bounty of 500 should attract attention. Good luck. – S Meaden Jan 11 '18 at 11:30
  • 2
    If you control the source document of course you can use `mso-number-format:'\@'` to force text in the style tag. In such a case, that is obviously the easiest and simplest solution. http://cosicimiento.blogspot.co.uk/2008/11/styling-excel-cells-with-mso-number.html – S Meaden Jan 13 '18 at 15:28
  • 1
    @SMeaden - you might mention in your server-side solution that OP just needs to add `css_matrix[,10:11] <- "mso-number-format:\"\\@\""` to their existing R code (after the last `css_matrix...` line) and it will implement your solution for their specific problem (agree that your generic solution works). This deserves the bounty. The OP problem column is 11 btw. – Robin Mackenzie Jan 14 '18 at 11:34
  • 1
    @RobinMackenzie : cheers, added and credited to you ;) I didn't dare tread in his R script. I only downloaded it yesterday. http://exceldevelopmentplatform.blogspot.co.uk/2018/01/r-installation-and-first-steps.html – S Meaden Jan 14 '18 at 11:43
  • 1
    @SMeaden - thanks, but you got right to the heart of it so I hope you get the bounty. You might take a look at RStudio which has a free edition. – Robin Mackenzie Jan 14 '18 at 11:51
4

This works with a temp file.

What it does: Downloads Data Locally. Then, replaces the "(" with a "\". Then, imports the data. Formats the data as text (to ensure we can change it back without error). Then, changes the text. This cannot be done with Range.Replace because that will reformat the cell contents.

' Local Variables
Public FileName As String ' Temp File Path
Public FileUrl As String ' Url Formatted Temp File Path
Public DownloadUrl As String ' Where We're Going to Download From

' Declares Have to Be At Top
Private Declare Function GetTempPath Lib "kernel32" _
  Alias "GetTempPathA" _
  (ByVal nBufferLength As Long, _
  ByVal lpBuffer As String) As Long
Private Declare Function GetTempFileName Lib "kernel32" _
  Alias "GetTempFileNameA" _
  (ByVal lpszPath As String, _
  ByVal lpPrefixString As String, _
  ByVal wUnique As Long, _
  ByVal lpTempFileName As String) As Long

' Loads the HTML Content Without Bug
Sub ImportHtml()

    ' Set Our Download URL
    DownloadUrl = "https://rasmusrhl.github.io/stuff"

    ' Sets the Temporary File Path
    SetFilePath

    ' Downloads the File
    DownloadFile

    ' Replaces the "(" in the File With "\(", We Will Later Put it Back
    ' This Ensures Formatting of Content Isn't Modified!!!
    ReplaceStringInFile


    ' Our Query Table is Now Coming From the Local File, Instead
    Dim s As QueryTable
    Set s = ActiveSheet.QueryTables.Add(Connection:=("FINDER;file://" + FileUrl), Destination:=Range("$A$1"))

    With s

        .Name = "stuff"
        .FieldNames = True
        .RowNumbers = False
        .FillAdjacentFormulas = False
        .PreserveFormatting = False
        .RefreshOnFileOpen = False
        .BackgroundQuery = True
        .RefreshStyle = xlInsertDeleteCells
        .SavePassword = False
        .SaveData = True
        .AdjustColumnWidth = True
        .RefreshPeriod = 0
        .WebSelectionType = xlEntirePage
        .WebFormatting = xlWebFormattingAll
        .WebPreFormattedTextToColumns = True
        .WebConsecutiveDelimitersAsOne = True
        .WebSingleBlockTextImport = False
        .WebDisableDateRecognition = True
        .WebDisableRedirections = False
        .Refresh BackgroundQuery:=False

        ' Sets Formatting So When We Change Text the Data Doesn't Change
        .ResultRange.NumberFormat = "@"

        ' Loop Through Cells in Range
        ' If You Do Excel Replace, Instead It Will Change Cell Format
        Const myStr As String = "\(", myReplace As String = "("
        For Each c In .ResultRange.Cells
            Do While c.Value Like "*" & myStr & "*"
                c.Characters(InStr(1, c.Value, myStr), Len(myStr)).Text = myReplace
            Loop
        Next


    End With
End Sub

' This function replaces the "(" in the file with "\("
Sub ReplaceStringInFile()

    Dim sBuf As String
    Dim sTemp As String
    Dim iFileNum As Integer
    Dim sFileName As String

    ' Edit as needed
    sFileName = FileName

    iFileNum = FreeFile
    Open sFileName For Input As iFileNum

    Do Until EOF(iFileNum)
        Line Input #iFileNum, sBuf
        sTemp = sTemp & sBuf & vbCrLf
    Loop
    Close iFileNum

    sTemp = Replace(sTemp, "(", "\(")

    iFileNum = FreeFile
    Open sFileName For Output As iFileNum
    Print #iFileNum, sTemp
    Close iFileNum

End Sub

' This function sets file paths because we need a temp file
Function SetFilePath()

    If FileName = "" Then
        FileName = GetTempHtmlName
        FileUrl = Replace(FileName, "\", "/")
    End If

End Function

' This subroutine downloads the file from the specified URL
' The download is necessary because we will be editing the file
Sub DownloadFile()

    Dim myURL As String
    myURL = "https://rasmusrhl.github.io/stuff"

    Dim WinHttpReq As Object
    Set WinHttpReq = CreateObject("Microsoft.XMLHTTP")
    WinHttpReq.Open "GET", DownloadUrl, False, "username", "password"
    WinHttpReq.send

    myURL = WinHttpReq.responseBody
    If WinHttpReq.Status = 200 Then
        Set oStream = CreateObject("ADODB.Stream")
        oStream.Open
        oStream.Type = 1
        oStream.Write WinHttpReq.responseBody
        oStream.SaveToFile FileName, 2 ' 1 = no overwrite, 2 = overwrite
        oStream.Close
    End If

End Sub

'''''''''''''''''''''''''''''
' THIS BLOCK OF CODE GETS A TEMPORARY FILE PATH USING THE GetTempHtmlName Function
'''''''''''''''''''''''''''''


Public Function GetTempHtmlName( _
  Optional sPrefix As String = "VBA", _
  Optional sExtensao As String = "") As String
  Dim sTmpPath As String * 512
  Dim sTmpName As String * 576
  Dim nRet As Long
  Dim F As String
  nRet = GetTempPath(512, sTmpPath)
  If (nRet > 0 And nRet < 512) Then
    nRet = GetTempFileName(sTmpPath, sPrefix, 0, sTmpName)
    If nRet <> 0 Then F = Left$(sTmpName, InStr(sTmpName, vbNullChar) - 1)
    If sExtensao > "" Then
      Kill F
      If Right(F, 4) = ".tmp" Then F = Left(F, Len(F) - 4)
      F = F & sExtensao
    End If
    F = Replace(F, ".tmp", ".html")
    GetTempHtmlName = F
  End If
End Function

'''''''''''''''''''''''''''''
' End - GetTempHtmlName
'''''''''''''''''''''''''''''
Ctznkane525
  • 7,297
  • 3
  • 16
  • 40
  • i just had to change the order. I had reformatted, but VBA requires variables, then pinvokes then function/subs, so its changed the way I have it in Excel. – Ctznkane525 Jan 12 '18 at 01:29
  • Thanks, wow, I did not imagine it would be this complicated :) Anyways, i get `Invalid procedure call or argument`, when i click `debug` it points to the line: `WinHttpReq.Open "GET", DownloadUrl, False, "username", "password"`. I don't know anything about VBA, what am I doing wrong? – Rasmus Larsen Jan 12 '18 at 06:37
  • it works....you just have to make sure you start the importhtml macro...it'll call the rest...if you do one of the other macros independently...youll get that error – Ctznkane525 Jan 12 '18 at 09:18
  • also...you cannot go into the VBA and just hit play since there's more than one subroutine...you have to explicitly call the importhtml one – Ctznkane525 Jan 12 '18 at 09:28
  • @rasmuslarsen the code works...any further questions beyond my comments let me know – Ctznkane525 Jan 13 '18 at 01:57
4

You may give this a try to see if you get the desired output...

Sub GetWebData()
Dim IE As Object
Dim doc As Object
Dim TRs As Object
Dim TR As Object
Dim Cell As Object
Dim r As Long, c As Long

Application.ScreenUpdating = False

Set IE = CreateObject("InternetExplorer.Application")
IE.Visible = False
IE.navigate "https://rasmusrhl.github.io/stuff/"
Do While IE.Busy Or IE.readyState <> 4
    DoEvents
Loop
Set doc = IE.document

Set TRs = doc.getElementsByTagName("tr")
Cells.Clear

For Each TR In TRs
    r = r + 1
    For Each Cell In TR.Children
        c = c + 1
        Cells(r, c).NumberFormat = "@"
        Cells(r, c) = Cell.innerText
    Next Cell
    c = 0
Next TR
IE.Quit
Columns.AutoFit
Application.ScreenUpdating = True
End Sub

Solution 2:

To make it work, you need to add the following two references by going to Tools (on VBA Editor) --> References and then find the below mentioned two references and check the checkboxes for them and click OK.

1) Microsoft XML, v6.0 (find the max version available)

2) Microsoft HTML Object Library

Sub GetWebData2()
Dim XMLpage As New MSXML2.XMLHTTP60
Dim doc As New MSHTML.HTMLDocument
Dim TRs As IHTMLElementCollection
Dim TR As IHTMLElement
Dim Cell As IHTMLElement
Dim r As Long, c As Long

Application.ScreenUpdating = False

Set XMLpage = CreateObject("MSXML2.XMLHTTP")

XMLpage.Open "GET", "https://rasmusrhl.github.io/stuff/", False
XMLpage.send
doc.body.innerhtml = XMLpage.responsetext
Set TRs = doc.getElementsByTagName("tr")
Set TRs = doc.getElementsByTagName("tr")
Cells.Clear

For Each TR In TRs
    r = r + 1
    For Each Cell In TR.Children
        c = c + 1
        Cells(r, c).NumberFormat = "@"
        Cells(r, c) = Cell.innerText
    Next Cell
    c = 0
Next TR
Columns.AutoFit
Application.ScreenUpdating = True
End Sub
Subodh Tiwari sktneer
  • 9,906
  • 2
  • 18
  • 22
4

To get the tabular data (keeping the format as it is) from that page, you can try like below:

 Sub Fetch_Data()
    Dim http As New XMLHTTP60, html As New HTMLDocument
    Dim posts As Object, post As Object, elem As Object
    Dim row As Long, col As Long

    With http
        .Open "GET", "https://rasmusrhl.github.io/stuff/", False
        .send
        html.body.innerHTML = .responseText
    End With

    Set posts = html.getElementsByClassName("gmisc_table")(0)

    For Each post In posts.Rows
        For Each elem In post.Cells
            col = col + 1: Cells(row + 1, col).NumberFormat = "@": Cells(row + 1, col) = elem.innerText
        Next elem
        col = 0
        row = row + 1
    Next post
End Sub

Reference to add to the library:

1. Microsoft HTML Object Library
2. Microsoft XML, v6.0  'or whatever version you have

This is how that portion looks like when get parsed. enter image description here

SIM
  • 21,997
  • 5
  • 37
  • 109
  • Thanks. I am looking for a solution which looks as much as the HTML as possible. Including nested headers, italics and indentations etc. – Rasmus Larsen Jan 13 '18 at 14:00
4
<style type=text/css>
    td {mso-number-format: '\@';}
</style>
<table ...

Putting the above global style definition for the cells (<td>s) on the output you generate using R or rewriting the document on the client side like below just works.

Sub importhtml()
    '*********** HTML document rewrite process ***************
    Const TableUrl = "https://rasmusrhl.github.io/stuff"

    Const adTypeBinary = 1, adSaveCreateOverWrite = 2, TemporaryFolder = 2
    Dim tempFilePath, binData() As Byte

    With CreateObject("Scripting.FileSystemObject")
        tempFilePath = .BuildPath(.GetSpecialFolder(TemporaryFolder), .GetTempName() & ".html")
    End With

    'download HTML document
    With CreateObject("MSXML2.ServerXMLHTTP")
        .Open "GET", TableUrl, False
        .Send
        If .Status <> 200 Then Err.Raise 3, "importhtml", "200 expected"
        binData = .ResponseBody
    End With

    With CreateObject("Adodb.Stream")
        .Charset = "x-ansi"
        .Open
        .WriteText "<style type=text/css>td {mso-number-format:'\@';}</style>"
        .Position = 0 'move to start
        .Type = adTypeBinary 'change stream type
        .Position = .Size 'move to end
        .Write binData 'append binary data end of stream
        .SaveToFile tempFilePath, adSaveCreateOverWrite 'save temporary file
        .Close
    End With
    '*********** HTML document rewrite process ***************

    With ActiveSheet.QueryTables.Add(Connection:= _
        "URL;" & tempFilePath, Destination:=Range("$A$1"))
        'load HTML document from rewritten local copy

        .Name = "stuff"
        .FieldNames = True
        .RowNumbers = False
        .FillAdjacentFormulas = False
        .PreserveFormatting = False
        .RefreshOnFileOpen = False
        .BackgroundQuery = True
        .RefreshStyle = xlInsertDeleteCells
        .SavePassword = False
        .SaveData = True
        .AdjustColumnWidth = True
        .RefreshPeriod = 0
        .WebSelectionType = xlEntirePage
        .WebFormatting = xlWebFormattingAll
        .WebPreFormattedTextToColumns = True
        .WebConsecutiveDelimitersAsOne = True
        .WebSingleBlockTextImport = False
        .WebDisableDateRecognition = True
        .WebDisableRedirections = False
        .Refresh BackgroundQuery:=False

    End With

    Kill tempFilePath
End Sub
Kul-Tigin
  • 16,728
  • 1
  • 35
  • 64
2

Try this, to import the data as a table:

Sub ImportDataAsTable()
    ActiveWorkbook.Queries.Add Name:="Table 0", Formula:= _
        "let" & Chr(13) & "" & Chr(10) & "    Source = Web.Page(Web.Contents(""https://rasmusrhl.github.io/stuff/""))," & Chr(13) & "" & Chr(10) & "    Data0 = Source{0}[Data]," & Chr(13) & "" & Chr(10) & "    #""Changed Type"" = Table.TransformColumnTypes(Data0,{{""tailnum"", type text}, {"""", type text}, {""Some text goes here. It is long and does not break Machine type (make) year"", type text}, {""Some text goes here. It is long and does not break Mach" & _
        "ine type (make) type"", type text}, {""Some text goes here. It is long and does not break Machine type (make) manufacturer"", type text}, {""Some text goes here. It is long and does not break"", type text}, {""Some text goes here. It is long and does not break Specification of machine model"", type text}, {""Some text goes here. It is long and does not break Specifi" & _
        "cation of machine engines"", type text}, {""Some text goes here. It is long and does not break Specification of machine seats"", type text}, {""Some text goes here. It is long and does not break Specification of machine speed"", type text}, {""Some text goes here. It is long and does not break Specification of machine engine"", type text}, {""2"", type text}, {""Oth" & _
        "er text goes here Other variables s1"", type text}, {""Other text goes here Other variables s2"", type text}})" & Chr(13) & "" & Chr(10) & "in" & Chr(13) & "" & Chr(10) & "    #""Changed Type"""
    ActiveWorkbook.Worksheets.Add
    With ActiveSheet.ListObjects.Add(SourceType:=0, Source:= _
        "OLEDB;Provider=Microsoft.Mashup.OleDb.1;Data Source=$Workbook$;Location=""Table 0"";Extended Properties=""""" _
        , Destination:=Range("$A$1")).QueryTable
        .CommandType = xlCmdSql
        .CommandText = Array("SELECT * FROM [Table 0]")
        .RowNumbers = False
        .FillAdjacentFormulas = False
        .PreserveFormatting = True
        .RefreshOnFileOpen = False
        .BackgroundQuery = True
        .RefreshStyle = xlInsertDeleteCells
        .SavePassword = False
        .SaveData = True
        .AdjustColumnWidth = True
        .RefreshPeriod = 0
        .PreserveColumnInfo = True
        .ListObject.DisplayName = "Table_0"
        .Refresh BackgroundQuery:=False
    End With
End Sub
ashleedawg
  • 20,365
  • 9
  • 72
  • 105
  • 1
    Wow, `Microsoft.Mashup.OleDb.1` , I've never seen that before. Some links or explanation would be good. https://www.mrexcel.com/forum/excel-questions/988066-change-data-source-oledbconnection.html It's nice to see a large bounty flush out great new ideas. – S Meaden Jan 12 '18 at 09:22
  • 1
    Thanks. Running this gets the format of the individual cells right. But the headers, subheaders, row-groups, and indentations are missing :(. My goal is to make table look as much as the HTML-table as possible. My original code comes close to this, but some of the cell content is formatted wrong :( – Rasmus Larsen Jan 12 '18 at 09:51
  • 1
    @ashleedawg: so this is Power Query right? https://support.office.com/en-us/article/Introduction-to-Microsoft-Power-Query-for-Excel-6e92e2f4-2079-4e1f-bad5-89f6269cd605 – S Meaden Jan 12 '18 at 09:53
  • 1
    This needs Office 2016 to compile I think. https://stackoverflow.com/questions/48224836/vba-activeworkbook-queries-does-not-compile/48225249#48225249 – S Meaden Jan 12 '18 at 11:31
2

Processing the HTML and then Copying and Pasting it into Excel will

Here are the steps I used:

  • CreateObject("MSXML2.XMLHTTP"): get the URL's responseText
  • CreateObject("HTMLFile"): create a HTML Document from responseText
  • Replace grey with black to darken the borders
  • Prefix columns s1 and s2 with @ to preserve formatting
  • Copy the HTML to the Windows Clipboard
    • Note: The HTML need to enclosed in HTML and Body tags to paste properly
  • Setup the destination Worksheet
  • Paste the HTML into the Worksheet
  • Replace the @ sign with '
    • Note: This preserves the formatting by storing the data as text
  • Finish formatting the Worksheet

enter image description here


Sub LoadTable()
    Const URL = "https://rasmusrhl.github.io/stuff/"
    Dim x As Long
    Dim doc As Object, tbl As Object, rw As Object
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", URL, False
        .send
        If .readyState = 4 And .Status = 200 Then
            Set doc = CreateObject("HTMLFile")
            doc.body.innerHTML = .responseText
            doc.body.innerHTML = Replace(doc.body.innerHTML, "grey", "black")
            Set tbl = doc.getElementsByTagName("TABLE")(0)

            For x = 0 To tbl.Rows.Length - 1
                Set rw = tbl.Rows(x)

                If rw.Cells.Length = 14 Then
                    'If InStr(rw.Cells(12).innerText, "-") Or InStr(rw.Cells(12).innerText, "/") Then
                    rw.Cells(12).innerText = "@" & rw.Cells(12).innerText
                    rw.Cells(13).innerText = "@" & rw.Cells(13).innerText
                End If
            Next

            With CreateObject("new:{1C3B4210-F441-11CE-B9EA-00AA006B1A69}")
                .SetText "<html><body>" & doc.body.innerHTML & "</body></html>"
                .PutInClipboard
            End With

            With Worksheets("Sheet1")
                .Cells.Clear
                .Range("A1").PasteSpecial
                .Cells.Interior.Color = vbWhite
                .Cells.WrapText = False
                .Columns.AutoFit
                .Columns("M:N").Replace What:="@", Replacement:="'"
            End With

        Else
            MsgBox "URL:  " & vbCrLf & "Ready state: " & .readyState & vbCrLf & "HTTP request status: " & .Status, vbInformation, "URL Not Responding"
        End If
    End With
End Sub
  • Hi, `CreateObject("new:{1C3B4210-F441-11CE-B9EA-00AA006B1A69}")` looks to be the same as `New MSForms.DataObject` can I ask why you do this? Is it to save a Tools->Reference ? I saw it the other day for a web browser and wondered at it. – S Meaden Jan 16 '18 at 17:18
  • 1
    That is a code snippet that I have in my personal macro workbook. I use it whenever I need to use the clipboard without bothering referencing the MSForms library. I wrote the original version using early-binding snippets for working with the HTML but converted it to late binding so the OP would only have to fix the Worksheet reference before running the code. –  Jan 16 '18 at 18:22
  • ok, ta. I really like all manner of syntaxes that get squeezed through `GetObject` and `CreateObject`. Thanks. – S Meaden Jan 16 '18 at 18:41
1

Based on the documentation from Microsoft MSDN Library: WebFormatting Property you could try the below change to your code:

 .WebFormatting = xlWebFormattingNone

This may allow the data to be copied without any number formatting - then you can set your own number format for those cells (using MSDN: Excel VBA NumberFormat property )

A similar solution should solve the issue with numbers being truncated or rounding - set the decimal points for the affected cells in your target range...

Our Man in Bananas
  • 5,809
  • 21
  • 91
  • 148
  • Thank you! Unfortunately it does not work. Setting .WebFormatting = xlWebFormattingNone still changes the numbers in parentheses to negative. – Rasmus Larsen Jan 08 '18 at 17:30
  • @RasmusLarsen: check the number format of the cells which show as negative... tr changing to text – Our Man in Bananas Jan 08 '18 at 21:26
  • Thanks, but does not seem to work :( After running the macro, the number format of all the cells are "General". If I change to "Text", the displayed values and actual values still are for example: "-98". – Rasmus Larsen Jan 09 '18 at 06:34
  • @RasmusLarsen: you may find more help at [get data from websites using Excel](https://www.wiseowl.co.uk/blog/s393/query-table.htm) – Our Man in Bananas Jan 09 '18 at 14:35
-1

With the url https://rasmusrhl.github.io/stuff, it's by luck that Excel can simply just open it directly and save as .xlsx (how come no one try this before the tedious process). If direct open fails, all other methods here are great option!

Option Explicit

Sub OpenWebFile()
    Const URL As String = "https://rasmusrhl.github.io/stuff"
    Dim oWB As Workbook
    On Error Resume Next
    Set oWB = Workbooks.Open(Filename:=URL, ReadOnly:=True)
    If oWB Is Nothing Then
        MsgBox "Cannot open the url " & URL, vbExclamation + vbOKOnly, "ERR " & Err.Number & ":" & Err.Description
        Err.Clear
    Else
        ' Change to your desired path and filename
        oWB.SaveAs Filename:="C:\Test\stuff.xlsx", FileFormat:=xlOpenXMLWorkbook
        Set oWB = Nothing
    End If
End Sub
PatricK
  • 6,375
  • 1
  • 21
  • 25
  • Thanks. The problem with this solution is that: Numbers in parentheses are converted to negative numbers, and numbers are rounded or truncated. – Rasmus Larsen Jan 16 '18 at 06:30