1

I use the code below to read tables from websites. With the first example everything works as expected. with the second example (commented variables) I only get the first column. I don't find the reason for it. Can somebody help here?

Also nice would be a simple ways to create a nicer output of the tables.

import urllib2
import pprint
from bs4 import BeautifulSoup

URL = 'http://www.proplanta.de/Markt-und-Preis/MATIF-Raps/'
TABLENR = 36

#URL = 'http://www1.chineseshipping.com.cn/en/indices/ccfinew.jsp'
#TABLENR = 4

req = urllib2.Request(URL, headers={'User-Agent' : "My Browser"}) 
con = urllib2.urlopen( req )
html = con.read()
soup = BeautifulSoup(html)

tables = soup.find_all('table')

data = []

rows = tables[TABLENR].find_all('tr')
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele]) # Get rid of empty values

pprint.pprint (data)
jrjc
  • 21,103
  • 9
  • 64
  • 78
robvoi
  • 61
  • 8
  • 2
    In your second example (I didn't check the first), the data in the other columns is generated by javascript – jDo Mar 17 '16 at 14:27
  • ok - this explains the issue. Any suggestion on how I can read the table? – robvoi Mar 17 '16 at 14:36
  • I think the standard solution is to use Selenium, phantomJS, Ghostery or some other javascript-engine or "robot browser". I don't know much about any of them but just keep hearing those three described as straight-forward solutions to scraping JS content. But even better, maybe you can access the site's API directly. If you're lucky, it'll return nicely formatted json or xml – jDo Mar 17 '16 at 14:39
  • @robvoi Yep, you're lucky. The API returns [jsonp data](http://index.chineseshipping.com.cn/servlet/ccfiGetContrast?SpecifiedDate=&jc=jsonp1458225894956&_=1458225895739) :) – jDo Mar 17 '16 at 14:47

2 Answers2

3

You could use the API instead. Much cleaner (even if my code might not be).

import requests
import json

url = "http://index.chineseshipping.com.cn/servlet/ccfiGetContrast?SpecifiedDate=&jc="
jsonp = requests.get(url)
table_data = json.loads(jsonp.text.encode("utf-8").split("(")[1].split(")")[0])

# SCRAPE RESPONSIBLY. WE DON'T WANT TO DDOS SOME POOR WEBSITE
jDo
  • 3,962
  • 1
  • 11
  • 30
2

The webpage which is not working uses javaScript. JavaScript is used to create dynamic content which it does by altering the DOM (Document object model). Browser receives the data and then runs java script to alter it. (In your case table data is getting changed). When you try to get the webpage using urllib, it receives the content but it does not do the latter (running javaScript on it). By using selenium we are getting our job done through the browser and reading the complete data.

import selenium
from bs4 import BeautifulSoup
webpage = selenium.webdriver.Firefox()
webpage.get('http://www1.chineseshipping.com.cn/en/indices/ccfinew.jsp')
html = webpage.page_source
soup = BeautifulSoup(html)
tables = soup.find_all('table')
Sharad
  • 1,867
  • 14
  • 33