I'm trying to scrape a table from the NYSE website (http://www1.nyse.com/about/listed/IPO_Index.html) into a pandas dataframe. In order to do so, I have a setup like this:
def htmltodf(url):
page = requests.get(url)
soup = BeautifulSoup(page.text)
tables = soup.findAll('table')
test = pandas.io.html.read_html(str(tables))
return(test) #return dataframe type object
However, when I run this on the page, all of the table returned in the list are essentially empty. When I further investigated, I found that the table is generated by javascript. When using the developer tools in my web browser, I see that the table looks like any other HTML table with the tags, etc. However, a view of the source code revealed something like this instead:
<script language="JavaScript">
.
.
.
<script>
var year = [["ICC","21st Century Oncology Holdings, Inc.","22 May 2014","/about/listed/icc.html" ],
... more entries here ...
,["ZOES","Zoe's Kitchen, Inc.","11 Apr 2014","/about/listed/zoes.html" ]] ;
if(year.length != 0)
{
document.write ("<table width='619' border='0' cellspacing='0' cellpadding='0'><tr><td><span class='fontbold'>");
document.write ('2014' + " IPO Showcase");
document.write ("</span></td></tr></table>");
}
</script>
Therefore, when my HTML parser goes to look for the table tag, all it can find is the if condition, and no proper tags below that would indicate content. How can I scrape this table? Is there a tag that I can search for instead of table that will reveal the content? Because the code is not in traditional html table form, how do I read it in with pandas--do I have to manually parse the data?