-1

I am trying to scrape a Wiki infobox and put the data into a dictionary where the first column of the infobox is the key and the second column is the value. I also have to ignore all rows that do not have 2 columns. I am having trouble understanding how to get the value associated to the key. The Wikipedia page I am trying to scrape is https://en.wikipedia.org/w/index.php?title=Titanic&oldid=981851347 where I am trying to pull the information from the first infobox.

The results should look like: {"Name": "RMS Titanic", "Owner": "White Star Line", "Operator": "White Star Line", "Port of registry": "Liverpool, UK", "Route": "Southampton to New York City".....}

Here's what I've tried:

    import requests
    from bs4 import BeautifulSoup

    def get_infobox(url):
       response = requests.get(url)
       bs = BeautifulSoup(response.text)

       table = bs.find('table', {'class' :'infobox'})
       result = {}
       row_count = 0
       if table is None:
         pass
       else:
         for tr in table.find_all('tr'):
             if tr.find('th'):
                 pass
             else:
                 row_count += 1
         if row_count > 1:
             if tr is not None:
               result[tr.find('td').text.strip()] = tr.find('td').text
         return result

print(get_infobox("https://en.wikipedia.org/w/index.php?title=Titanic&oldid=981851347"))

Any help would be greatly appreciated!

baduker
  • 19,152
  • 9
  • 33
  • 56
Anna Botts
  • 23
  • 6
  • Does this answer your question? [How to extract information from a Wikipedia infobox?](https://stackoverflow.com/questions/33862336/how-to-extract-information-from-a-wikipedia-infobox) – Tgr Oct 10 '20 at 05:05

1 Answers1

0

If you do not need or want to use a scraper, you could use the API

https://www.mediawiki.org/wiki/API:Main_page/de

The english endpoint is https://en.wikipedia.org/w/api.php

E.g.:

https://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=json&titles=Titanic&rvsection=0

Nico Bleiler
  • 475
  • 4
  • 14