4

I have been trying to extract the infobox content using the wikipedia python package.

My code is as follows (for this page):

import wikipedia
Aldi = wikipedia.page('Aldi')

When I enter:

Aldi.content

I get the article text but not the infobox.

I have tried getting the data from DBPedia but with no luck. I have also tried extracting the page with BeautifulSoup4 but the table is oddly structured (since there is an image spanning across both columns followed by unnamed columns.

This is as far as I've gone with BeautifulSoup:

from bs4 import BeautifulSoup
import urllib2
site= "http://en.wikipedia.org/wiki/Aldi"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(site,headers=hdr)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
print soup

I also looked in Wikidata, but it does not contain most of the information that I need from the table.

I am not necessarily fixated on the python package as a solution. Anything that can parse the table would be awesome.

Preferably, I would like to have a dictionary with the infobox values:

Type     Private
Industry Retail

etc...

Michal
  • 1,863
  • 7
  • 30
  • 50
  • Possible duplicate of [Content of infobox of Wikipedia](http://stackoverflow.com/questions/8088226/content-of-infobox-of-wikipedia) – Nemo Nov 14 '15 at 18:38

2 Answers2

5

A solution based on BeautifulSoup:

from bs4 import BeautifulSoup
import urllib2
site= "http://en.wikipedia.org/wiki/Aldi"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(site,headers=hdr)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page.read())
table = soup.find('table', class_='infobox vcard')
result = {}
exceptional_row_count = 0
for tr in table.find_all('tr'):
    if tr.find('th'):
        result[tr.find('th').text] = tr.find('td').text
    else:
        # the first row Logos fall here
        exceptional_row_count += 1
if exceptional_row_count > 1:
    print 'WARNING ExceptionalRow>1: ', table
print result

Tested on http://en.wikipedia.org/wiki/Aldi, but not fully tested on other wiki pages.

ZZY
  • 3,689
  • 19
  • 22
-1

My solution

from bs4 import BeautifulSoup as bs
query = 'albert einstien'
url = 'https://en.wikipedia.org/wiki/'+query
def infobox() :
raw = urllib.urlopen(url)
soup = bs(raw)
table = soup.find('table',{'class':'infobox vcard'})
for tr in table.find_all('tr') :
    print tr.text
Polish
  • 554
  • 1
  • 4
  • 18