0

I'm trying to read the tables from a [web page][1] into pandas DataFrames. pandas.read_html returns a list of empty tables because the tables from the HTML are indeed empty. They're probably populated dynamically.

Someone suggested that the data source could to be an [XML][3] that looks like this:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Items xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <Item>
        <ID>Content ID
Unique identifier for use of tool</ID>
        <Type>Product Type
1. sample name
2. sample name
3. sample name
4. sample name
5. sample name
6. sample name
7. Accessories</Type>
        <Name>Sub Category
Name of checkbox item and announcement subhead</Name>
        <PubDate>Published Date
Text type - not Date

Spell out month name completely</PubDate>
        <Desc>Description
Enter a full description</Desc>
        <Notes>Special Notes
To appear under recommendation table</Notes>
        <Image>Product Image
Enter entire URL path to image or provide the  image through email.
</Image>
        <LinkA>Announcement URL
Enter file name, no spaces</LinkA>
        <LinkB>Product URL
Enter full URL</LinkB>
        <TableA>Product SKU
Enter product number being discontinued</TableA>
        <TableB>Product Description
Enter product description for original product</TableB>
        <TableC>Replacement Product SKU
Enter product number to replace discontinued product

Only use this column when multiple country skus are not needed</TableC>
        <TableD>Replacement Product Description
Enter product description for replacement product

Only use this column when multiple country skus are not needed</TableD>
        <TableE>Custom Header 1

Use these custom headers to build a dynamic table. Used primarily for multiple skus per country</TableE>
        <TableF>Custom Header 2

Use these custom headers to build a dynamic table. Used primarily for multiple skus per country</TableF>
        <TableG>Custom Header 3

Use these custom headers to build a dynamic table. Used primarily for multiple skus per country</TableG>
        <TableH>Custom Header 4

Use these custom headers to build a dynamic table. Used primarily for multiple skus per country</TableH>
        <TableI>Custom Header 5

Use these custom headers to build a dynamic table. Used primarily for multiple skus per country</TableI>
        <TableJ>Custom Header 6

Use these custom headers to build a dynamic table. Used primarily for multiple skus per country</TableJ>
    </Item>
    <Item>
        <ID>1</ID>
        <Type>1</Type>
        <Name>xx sample namexx</Name>
        <PubDate>June 1, 2011</PubDate>
        <Desc>xx Sample Description xx.</Desc>
        <Image>a3100-24.png</Image>
        <LinkA>HP-A3100SI-ES-Announcement.pdf</LinkA>
    </Item>
    <Item>
        <TableA>JD298A</TableA>
        <TableB>xx Sample Table Name xx</TableB>
        <TableC>N/A</TableC>
    </Item>
    <!-- other Item nodes -->
</Items>

How can I convert this XML to DataFrames?

  • 2
    `read_html` in a nutshell looks for `/` tags in the html source. In this case, if you view the page source, the table rows are not contained in the public html code.
    /
    – Brad Solomon May 07 '18 at 15:04
  • 2
    @Victor, that's because your browser executed some JavaScript code which inserted the `` into the HTML document. – Cristian Ciupitu May 07 '18 at 15:07
  • The real data source seems to be the XML from http://h17007.www1.hpe.com/data/xml/eos/eos.xml – Cristian Ciupitu May 07 '18 at 15:12
  • related to your recent update: Now you have to raw data (xml), you just have to parse it in Python structure (dict) instead of trying to find any kind of ``
    – Arount May 07 '18 at 15:19
  • @Victor, you should include a snippet of the XML so that the question keeps its value even if the link to the data becomes broken. – Cristian Ciupitu May 07 '18 at 15:22
  • What is the purpose of this scraping, why only the tables from that page? (I'm trying to avoid [the XY problem](https://meta.stackexchange.com/q/66377/135744)) – Cristian Ciupitu May 07 '18 at 15:31
  • @CristianCiupitu I have been practising importing and other random pages did import, but this one seems interesting ! So I am trying to understand how to import from this page –  May 07 '18 at 15:32
  • I have already told you: parse the XML. Anyway here's another idea: get the page with a headless browser which supports JavaScript, e.g. [Ghost.py](https://github.com/jeanphix/Ghost.py) or [PhantomJS+Python](https://stackoverflow.com/q/13287490/12892), wait a couple of seconds for the scripts to run, get the current contents of the page, then parse it with [`read_html`](https://pandas.pydata.org/pandas-docs/stable/io.html#io-read-html). – Cristian Ciupitu May 07 '18 at 16:28
  • @CristianCiupitu how did you get XML data ? –  May 08 '18 at 07:28
  • @Arount how can I get the xml from the link ? :) –  May 08 '18 at 08:25
  • @Victor: [Firefox Developer → Tools Network Monitor](https://developer.mozilla.org/en-US/docs/Tools/Network_Monitor) and looked at examined the URLs that could have data: JS(ON), XML etc. On an unrelated note you should edit your question and add more details like at least a snippet of the XML. – Cristian Ciupitu May 09 '18 at 01:17
  • @CristianCiupitu Please feel free to edit the question as you see appropriate, I will accept your review :) –  May 09 '18 at 07:54
  • @CristianCiupitu beside, if you can get the XML, why would you need the snippet anyways ? If the data changes, then it should still work. It is the XML that is used as an input, not the XML. Or Have I misunderstood. –  May 09 '18 at 07:55

1 Answers1

2

Anytime one works with complex XML and needs simpler structures like flattened dataframes with two-dimensional row by column, one should consider XSLT, the special purpose language designed to transform XML files into other XML, HTML, and as shown below even text files! Python's lxml can run XSLT 1.0 scripts.

Below XSLT produces a pipe-delimited text file with named columns that is then imported into Pandas. The challenge of this XML though is related nodes among the ID-identified Item is not children of it but siblings to it. As a result, a special keying had to be run for the Muenchian Grouping, a strategy borrowed from XSLT Grouping Sibling @Tomalak's answer.

XSLT (save as .xsl file, a special .xml file, to be imported)

Due to needs of mapping empty cells, script exhausts all possible columns and hence its length.

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                              xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
   <xsl:output method="text" indent="yes" omit-xml-declaration="yes"/>

   <xsl:key name="item_key" match="Item" use="generate-id(preceding-sibling::Item[count(ID) &gt; 0][1])"/>

   <xsl:template match ="/Items">  
        <!-- COLUMN HEADERS -->
        <xsl:text>ID|Type|Name|PubDate|Desc|Notes|Image|LinkA|LinkB|TableA|TableB|TableC|TableD|TableE|TableF|TableG|TableH|TableI|TableJ&#xa;</xsl:text>
        <xsl:apply-templates select="Item[count(ID) > 0 and not(contains(ID, 'Content'))]"/>      
   </xsl:template>

   <xsl:template match ="Item">   
    <!-- INDICATORS TO REPEAT ACROSS RELATED ROWS -->
    <xsl:variable name="ID" select="normalize-space(ID)"/>
    <xsl:variable name="Type" select="normalize-space(Type)"/>
    <xsl:variable name="Name" select="normalize-space(Name)"/>
    <xsl:variable name="PubDate" select="normalize-space(PubDate)"/>
    <xsl:variable name="Desc" select="normalize-space(Desc)"/>
    <xsl:variable name="Notes" select="normalize-space(Notes)"/>
    <xsl:variable name="Image" select="normalize-space(Image)"/>
    <xsl:variable name="LinkA" select="normalize-space(LinkA)"/>
    <xsl:variable name="LinkB" select="normalize-space(LinkB)"/>

        <!-- ITEM ID NODES -->
        <xsl:value-of select="$ID"/><xsl:text>|</xsl:text>
        <xsl:value-of select="$Type"/><xsl:text>|</xsl:text>
        <xsl:value-of select="$Name"/><xsl:text>|</xsl:text>
        <xsl:value-of select="$PubDate"/><xsl:text>|</xsl:text>
        <xsl:value-of select="$Desc"/><xsl:text>|</xsl:text>
        <xsl:value-of select="$Notes"/><xsl:text>|</xsl:text>
        <xsl:value-of select="$Image"/><xsl:text>|</xsl:text>
        <xsl:value-of select="$LinkA"/><xsl:text>|</xsl:text>
        <xsl:value-of select="$LinkB"/><xsl:text>|</xsl:text>
        <xsl:value-of select="normalize-space(TableA)"/><xsl:text>|</xsl:text>
        <xsl:value-of select="normalize-space(TableB)"/><xsl:text>|</xsl:text>
        <xsl:value-of select="normalize-space(TableC)"/><xsl:text>|</xsl:text>
        <xsl:value-of select="normalize-space(TableD)"/><xsl:text>|</xsl:text>
        <xsl:value-of select="normalize-space(TableE)"/><xsl:text>|</xsl:text>
        <xsl:value-of select="normalize-space(TableF)"/><xsl:text>|</xsl:text>
        <xsl:value-of select="normalize-space(TableG)"/><xsl:text>|</xsl:text>
        <xsl:value-of select="normalize-space(TableH)"/><xsl:text>|</xsl:text>
        <xsl:value-of select="normalize-space(TableI)"/><xsl:text>|</xsl:text>
        <xsl:value-of select="normalize-space(TableJ)"/><xsl:text>|</xsl:text>
        <xsl:text>&#xa;</xsl:text>             <!-- LINE BREAK -->      

        <!-- ALL RELATED NODES TO ITEM ID -->
        <xsl:for-each select="key('item_key', generate-id())[position() != last()]" >
            <xsl:value-of select="$ID"/><xsl:text>|</xsl:text>
            <xsl:value-of select="$Type"/><xsl:text>|</xsl:text>
            <xsl:value-of select="$Name"/><xsl:text>|</xsl:text>    
            <xsl:value-of select="$PubDate"/><xsl:text>|</xsl:text>
            <xsl:value-of select="$Desc"/><xsl:text>|</xsl:text>
            <xsl:value-of select="$Notes"/><xsl:text>|</xsl:text>
            <xsl:value-of select="$Image"/><xsl:text>|</xsl:text>
            <xsl:value-of select="$LinkA"/><xsl:text>|</xsl:text>
            <xsl:value-of select="$LinkB"/><xsl:text>|</xsl:text>
            <xsl:value-of select="normalize-space(TableA)"/><xsl:text>|</xsl:text>
            <xsl:value-of select="normalize-space(TableB)"/><xsl:text>|</xsl:text>
            <xsl:value-of select="normalize-space(TableC)"/><xsl:text>|</xsl:text>
            <xsl:value-of select="normalize-space(TableD)"/><xsl:text>|</xsl:text>
            <xsl:value-of select="normalize-space(TableE)"/><xsl:text>|</xsl:text>
            <xsl:value-of select="normalize-space(TableF)"/><xsl:text>|</xsl:text>
            <xsl:value-of select="normalize-space(TableG)"/><xsl:text>|</xsl:text>
            <xsl:value-of select="normalize-space(TableH)"/><xsl:text>|</xsl:text>
            <xsl:value-of select="normalize-space(TableI)"/><xsl:text>|</xsl:text>
            <xsl:value-of select="normalize-space(TableJ)"/><xsl:text>|</xsl:text>
            <xsl:text>&#xa;</xsl:text>             <!-- LINE BREAK -->      
        </xsl:for-each>

   </xsl:template>

</xsl:stylesheet>

Python (runs XSLT, saves CSV, imports into Pandas)

import pandas as pd
from lxml import etree

url = "http://h17007.www1.hpe.com/data/xml/eos/eos.xml?a=0.9317168944148095.xml"

# LOAD XML AND XSL
xml = etree.parse(url)
xsl = etree.parse("XSLT_Script.xsl")

# TRANSFORM SOURCE
transformer = etree.XSLT(xsl)    
result = transformer(xml)

# SAVE PIPE-DELIMITED FILE
with open("Output.txt", 'wb') as f:
    f.write(result)

# IMPORT PIPE-DELIMITED FILE
hp_df = pd.read_table("Output.txt", sep="|", index_col=False)

# ALTERNATIVE: IMPORT DIRECTLY (BYPASS .TXT SAVE)
from io import StringIO

hp_df = pd.read_table(StringIO(str(result)), sep="|", index_col=False)

Output (before Pandas import)

ID|Type|Name|PubDate|Desc|Notes|Image|LinkA|LinkB|TableA|TableB|TableC|TableD|TableE|TableF|TableG|TableH|TableI|TableJ
1|1|Select HP A3100 series switches|June 1, 2011|The HP A3100 SI series switches indicated below have served H3C and HP Customers with Fast Ethernet for many years. Due to market factors, we are announcing the end of sale of the devices effective Jul 1, 2011.||a3100-24.png|HP-A3100SI-ES-Announcement.pdf||||||||||||
1|1|Select HP A3100 series switches|June 1, 2011|The HP A3100 SI series switches indicated below have served H3C and HP Customers with Fast Ethernet for many years. Due to market factors, we are announcing the end of sale of the devices effective Jul 1, 2011.||a3100-24.png|HP-A3100SI-ES-Announcement.pdf||JD298A|HP 1 Port Gig-T 3100 SI Module|N/A||||||||
1|1|Select HP A3100 series switches|June 1, 2011|The HP A3100 SI series switches indicated below have served H3C and HP Customers with Fast Ethernet for many years. Due to market factors, we are announcing the end of sale of the devices effective Jul 1, 2011.||a3100-24.png|HP-A3100SI-ES-Announcement.pdf||JD299A|HP 1 Port Gig-LX SC 3100 SI Module|N/A||||||||
1|1|Select HP A3100 series switches|June 1, 2011|The HP A3100 SI series switches indicated below have served H3C and HP Customers with Fast Ethernet for many years. Due to market factors, we are announcing the end of sale of the devices effective Jul 1, 2011.||a3100-24.png|HP-A3100SI-ES-Announcement.pdf||JD300A|HP 1 Port Gig-SX SC 3100 SI Module|N/A||||||||
1|1|Select HP A3100 series switches|June 1, 2011|The HP A3100 SI series switches indicated below have served H3C and HP Customers with Fast Ethernet for many years. Due to market factors, we are announcing the end of sale of the devices effective Jul 1, 2011.||a3100-24.png|HP-A3100SI-ES-Announcement.pdf||JD301A|HP 1-Port 10/100Base-T POE 3100 SI Module|N/A||||||||
1|1|Select HP A3100 series switches|June 1, 2011|The HP A3100 SI series switches indicated below have served H3C and HP Customers with Fast Ethernet for many years. Due to market factors, we are announcing the end of sale of the devices effective Jul 1, 2011.||a3100-24.png|HP-A3100SI-ES-Announcement.pdf||JD308A|HP A3100-16 SI Switch with 2 Module Slots|JD305A|HP A3100-16 SI Switch|||||||
1|1|Select HP A3100 series switches|June 1, 2011|The HP A3100 SI series switches indicated below have served H3C and HP Customers with Fast Ethernet for many years. Due to market factors, we are announcing the end of sale of the devices effective Jul 1, 2011.||a3100-24.png|HP-A3100SI-ES-Announcement.pdf||JD309A|HP A3100-24 SI Switch with 2 Slots|JD306A|HP A3100-24 SI Switch|||||||
1|1|Select HP A3100 series switches|June 1, 2011|The HP A3100 SI series switches indicated below have served H3C and HP Customers with Fast Ethernet for many years. Due to market factors, we are announcing the end of sale of the devices effective Jul 1, 2011.||a3100-24.png|HP-A3100SI-ES-Announcement.pdf||JF444A|3100 series module|N/A||||||||
2|2|HP V10ag Wireless Access Point (NA only)|July 26, 2010|The HP V10ag Wireless Access Point has provided secure, reliable 802.11a and 802.11b/g wireless connectivity for small business networks since 2007. Due to the availability of the next generation 802.11n technology and the introduction of the HP V-M200 802.11n Access Point, HP networking is announcing the End of Sale of the HP V10ag Wireless Access Point (J9140A). For specific product rollover details see the announcement.||WAP10ag-1.png|10agAnnouncement-AM-only.pdf|http://h10010.www1.hp.com/wwpc/us/en/sm/WF05a/12883-12883-1137927-3836040-4172284-3637595.html?jumpid=reg_R1002_USEN|||||||||||
3|2|HP ProCurve Mobility Access Point Series - M110|September 2, 2009|<b>MAC Address Schema Change:</b> We are finalizing the integration of Colubris (previous acquisition) products by transitioning MAC Address assignments to HP ProCurve MAC address assignments. HP will be doing a Product Roll to support this requirement. <b>HP ProCurve Statement on New DFS EU Standards</b> As of July 1st 2010, all wireless devices sold in the EU countries and any country that participates in the EU free market, must meet stringent Dynamic Frequency Selection (DFS) requirements for radar detection and avoidance. HP will be doing a Product Roll to support this requirement. For specific product roll details see our MAC Address A-to-B Roll and DFS Disablement Announcement.||M110_100x100.png|A-to-BRollforVariousHPProCurveAccessPoints.pdf|http://h20195.www2.hp.com/v2/GetDocument.aspx?docname=4AA0-8273ENW&cc=en&lc=en|||||||||||

To filter out this master dataframe use pandas methods:

# SPECIFIC PRODUCT WITH [...]
filtered_df = hp_df[hp_df['Name'] == 'HPE 1410 Fast Ethernet Switches']

# SPECIFIC PRODUCT WITH .query()
filtered_df = hp_df.query("Name == 'HPE FlexNetwork 5940 Switch Series'")

# PASS A LIST WITH .isin()
filtered_df = hp_df[hp_df['Name'].isin(['HPE FlexNetwork 5120 SI Switch Series',
                                        'HPE 1410 Fast Ethernet Switches',
                                        'HPE OfficeConnect 1910 Switch Series'])]
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • thanks for the detailed answer but the error is __Error reading file 'XSLT_Script.xsl': failed to load external entity "XSLT_Script.xsl"__ –  May 07 '18 at 20:29
  • As advised, you need to save the XSLT as its own .xsl file, here I named generically as *XSLT_Script.xsl* assumed to be in same folder as Python script. – Parfait May 07 '18 at 20:46
  • I wrote both scripts. And yes, entire XML page except the first *Item* node (explaining data content) is imported. – Parfait May 07 '18 at 21:13
  • thank you for writing that script ! It is appreciated. But it does not import the tables and only the tables on that page –  May 07 '18 at 21:19
  • Carefully check the text or pandas data. The HTML content is a subset of the source XML. I just did a search on the bold titles and all show up on text file search! You can always filter out by names in pandas. – Parfait May 07 '18 at 21:25
  • See extension update at bottom showing you how to run various filters in pandas. Simply Google/Bing search and many examples/sites will show. – Parfait May 07 '18 at 21:42