4

What am I trying to do?

I have a list of URLs that I want to scrape using Excel's Web Query functionality. I'm trying to completely automate the process, so I'm developing an SSIS package that calls a Script Task for each URL. The Script Task creates a new Excel workbook with a worksheet, activates the worksheet, adds a QueryTable connection, refreshes the QueryTable to get the data, using XlWebSelectionType.xlAllTables. It then saves the workbook and closes the workbook and the Excel application.

What technologies am I utilizing?

  • VS 2015 (Enterprise)
  • SQL Server 2016
  • Microsoft Excel 16.0 Object library
  • Excel local install from Office 365 ProPlus

What's the problem?

While the script task does save all the data from the tables on the web page, it puts them all into the single worksheet, and does not save the table names. So while my data is correctly grouped in the worksheet, I have no way of knowing which "group" of data corresponds to which table.

What do I want to do about it?

Ideally I would want each QueryTable table to be saved into its own Worksheet, with the table name set as the Worksheet name. Barring that, I need a way to save the table name with the corresponding data. Adding it as a new column in the QueryTable would be the best in this scenario.

What do I have so far?

Here's the main part of the script:

Public Sub Main()
    Dim URL As String = Dts.Variables("User::URL").Value.ToString()
    Dim FileName As String = Dts.Variables("User::FileName").Value.ToString()
    Dim xlNone As XlWebFormatting = XlWebFormatting.xlWebFormattingNone
    Dim Format As XlFileFormat = XlFileFormat.xlCSVWindows
    Dim ScrapeStatus As Integer = 1

    Dim excel As New Microsoft.Office.Interop.Excel.ApplicationClass

    With excel
        .SheetsInNewWorkbook = 1
        .DisplayAlerts = False
    End With

    Dim wb As Microsoft.Office.Interop.Excel.Workbook = excel.Workbooks.Add()

    With wb
        .Activate()
        .Worksheets.Select(1)
    End With

    Try

        Dim rnStart As Range = wb.ActiveSheet.Range("A1:Z100")
        Dim qtQtrResults As QueryTable = wb.ActiveSheet.QueryTables.Add(Connection:="URL;" + URL, Destination:=rnStart)

        With qtQtrResults
            .BackgroundQuery = False
            .WebFormatting = xlNone
            .WebSelectionType = XlWebSelectionType.xlAllTables
            .Refresh()
        End With

        excel.CalculateUntilAsyncQueriesDone()
        wb.SaveAs(FileName)

        wb.Close()
        excel.Quit()
        System.Runtime.InteropServices.Marshal.ReleaseComObject(excel)
        GC.Collect()
        GC.WaitForPendingFinalizers()
        Dts.TaskResult = ScriptResults.Success

    Catch ex As Exception

        Dts.Variables("User::Error").Value = ex.Message.ToString()
        wb.Saved = True
        wb.Close()
        excel.Quit()
        System.Runtime.InteropServices.Marshal.ReleaseComObject(excel)
        GC.Collect()
        GC.WaitForPendingFinalizers()
        Dts.TaskResult = ScriptResults.Failure

    End Try

End Sub

What results am I getting?

For the URL http://athletics.chabotcollege.edu/information/directory/home#directory, if I use the Web Query functionality while inside Excel, I get the following to select from: enter image description here All the table names are displayed

However, when I pull all tables via the Script Task, I end up with a worksheet that looks similar to this: enter image description here

Other info

I should also note that while most of the web pages have a similar structure, not all are the same. So I can't assume every page will have the same table names, or structure the tables in the same way. My solution needs to be dynamic and flexible.

Hadi
  • 36,233
  • 13
  • 65
  • 124
digital.aaron
  • 5,435
  • 2
  • 24
  • 43

2 Answers2

2

By changing .WebSelectionType = XlWebSelectionType.xlAllTables to .WebSelectionType = XlWebSelectionType.xlEntirePage I'm able to capture the "names" of the tables. They are actually aria-title values inside the parent <section> tag of each table. It's ugly, but it does return the strings I'm looking for.

I ended up saving both the xlAllTables and xlEntirePage QueryTables as text files. Then I split the xlAllTables file into separate chunks for each table, and then search the xlEntirePage text file for the "string" that represents the table, and I copy the preceding line, which has the title. I then save the table text as a new file with the copied title as the filename. It's very hacky, but it did what I needed it to do.

digital.aaron
  • 5,435
  • 2
  • 24
  • 43
1

I don't think you can get the table names via web queries, if you check the web page source you can notice that the tables does not have a name attribute. The names that Excel shows in the interface are not related to the tables, they are the title of the section (which is the parent Tag of the Table) so they are not considered as Tablename.

Also, after checking the QueryTable documentation there is no option to retrieve table names or title of the table container, so it is not necessary that Excel uses the web queries to show the Tables and the headers in the Interface (as shown in the screenshots)

I think there is one way to split the data over worksheets (without table names) is:

  1. You must use Regular Expressions to retrieve the Table count from the web page <table></table>
  2. You have to create a worksheet for each Table
  3. You have to create a QueryTable for each Table
  4. In each QueryTable you have to set the destination worksheet Range and the following properties:

    .WebSelectionType = XlWebSelectionType.xlSpecifiedTables
    .WebTables = i 'Where i is the index of Table
    

Maybe you should use an HTML parser and Regular expression to collect the Table metadata

Hadi
  • 36,233
  • 13
  • 65
  • 124
  • 1
    After digging a bit deeper, I can confirm that what I'm seeing in Excel is actually driven by Power Query, and not just a simple web query. I'm now looking at if/how I can harness Power Query in my script task. – digital.aaron Jan 09 '19 at 13:45
  • @digital.aaron Tonight i will work also in that way. Hope that the issue will be solved – Hadi Jan 09 '19 at 13:51
  • 1
    I found a super-hacky solution that works for now, but I'm not happy with it. I think figuring out how to utilize Power Query in the script is the way to go. – digital.aaron Jan 09 '19 at 13:57
  • 1
    @digital.aaron i tried to figure out a solution with no luck, if you need to retrieve them using Regular expression maybe i can help. And please notify me if you find a feasible solution. Good Luck – Hadi Jan 09 '19 at 22:43
  • I'm even worse at RegEx than I am with VB, so any assistance you could provide there would be very helpful. – digital.aaron Jan 09 '19 at 23:07
  • The class I'm looking for is the `WorkbookQueries` class. From everything I've read, this is a VBA-only class. It supposedly became native to Excel as of 2016, but it's not in the Excel 16 library I'm referencing in my Script Task. I'm now going deeper down the rabbit hole by investigating if I can create the module in VBA, and add it to my excel object. https://support.microsoft.com/en-us/help/219905/how-to-dynamically-add-and-run-a-vba-macro-from-visual-basic – digital.aaron Jan 10 '19 at 03:02