I have been trying to extract the infobox content using the wikipedia python package.
My code is as follows (for this page):
import wikipedia
Aldi = wikipedia.page('Aldi')
When I enter:
Aldi.content
I get the article text but not the infobox.
I have tried getting the data from DBPedia but with no luck. I have also tried extracting the page with BeautifulSoup4 but the table is oddly structured (since there is an image spanning across both columns followed by unnamed columns.
This is as far as I've gone with BeautifulSoup:
from bs4 import BeautifulSoup
import urllib2
site= "http://en.wikipedia.org/wiki/Aldi"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(site,headers=hdr)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
print soup
I also looked in Wikidata, but it does not contain most of the information that I need from the table.
I am not necessarily fixated on the python package as a solution. Anything that can parse the table would be awesome.
Preferably, I would like to have a dictionary with the infobox values:
Type Private
Industry Retail
etc...