How to extract data from the infobox of a wikipedia page?

Question

My aim is to extract the 'Founded' and 'Products' information from the infobox of the wikipedia page of Microsoft. I am using python 3 and I used the following code that I found online but it is not working

# importing modules 
import requests 
from lxml import etree 
# manually storing desired URL 
url='https://en.wikipedia.org/wiki/Microsoft'

# fetching its url through requests module   
req = requests.get(url)  

store = etree.fromstring(req.text) 

# trying to get the 'Founded' portion of above  
# URL's info box of Wikipedia's page 
output = store.xpath('//table[@class="infoboxvcard"]/tr[th/text()="Founded"]/td/i')  

# printing the text portion 
print output[0].text   

#Expected result:
 Founded:April 4, 1975; 43 years ago in Albuquerque, New Mexico, U.S.

you can use [wikidata API](https://www.wikidata.org/wiki/Wikidata:Data_access) instead of scraping. — deadvoid, Oct 20 '18 at 10:15

Thaer A · Answer 1 · 2018-10-21T04:10:00.510

2

An incorrect Xpath was being used. I retrieved the correct XPath to the element from the the Wikipedia page provided in the question. I also added parenthesis the print statement for Python 3 compatibility.

Try:

# importing modules
import requests
from lxml import etree
# manually storing desired URL
url='https://en.wikipedia.org/wiki/Microsoft'

# fetching its url through requests module
req = requests.get(url)

store = etree.fromstring(req.text)

# an incorrect xpath was being used
output = store.xpath('//*[@id="mw-content-text"]/div/table[2]/tbody/tr[7]/td')

# added parenthesis python 3 
print (output[0].text)

I get:

April 4, 1975

edited Oct 21 '18 at 04:10

answered Oct 20 '18 at 10:32

Thaer A

2,243
1
10
14

Perhaps go into a bit more detail about why you imported the modules you imported and the specifics of what the solution you've come up with. It looks like you might have the answer desired but it's just code -- code answers can be helpful but usually lack any lasting meaning for future viewers of this question. – Brandon Buck Oct 20 '18 at 10:38
@BrandonBuck he just modified the code of the OP, so he didn't add any imports by himself. The question is very specific, so I don't see point in making answer broader than the question is. If anyone would like to start with web-scraping, I think there plenty of better places to start than this question. – Dluzak Oct 20 '18 at 10:59
@Dluzak SO is a Q/A site, I'll give you that. There is value of having a specific answer to a question and call it a day. However, this question (unless deleted) will be here until such a time as it's removed from the SO database (if they shut down, or had some kind of failure, etc...). Given that, many people may come here seeking an answer to their problem. Maybe the exact specific issue, maybe a very related issue. If they don't find an answer here, or learn anything here -- they'll ask a new question when we could have helped them now. So code-only answers aren't as useful as the could be. – Brandon Buck Oct 20 '18 at 11:02
@BrandonBuck I agree that some comment about what was wrong with the xpath, what and why he changed would definitely increase the value of this answer. But I think that comments about imported modules would an off-topic here. There are much better places for web-scraping newbies like [this question](https://stackoverflow.com/questions/2081586/web-scraping-with-python), [this article](https://docs.python-guide.org/scenarios/scrape/) or [this question](https://stackoverflow.com/questions/2861/options-for-html-scraping). – Dluzak Oct 20 '18 at 11:30
1

BrandonBuck @Dluzak I edited my answer accordingly. I hope this is enough. This is my first post on the website. I have a lot of learning to do. Dluzak is right in that I merely edited the OPs original code. – Thaer A Oct 21 '18 at 04:12

score 1 · Answer 2 · answered Oct 20 '18 at 10:45

You should probably use the mwparserfromhell trying to parse mediawiki markup on its own is... trying. With the mwparsefromhell you can filter out templates then extract their individual parameters.

code = mwparserfromhell.parse(text)
for template in code.filter_templates():
    if template.name.matches("infobox"):
       for p in template:#...

https://github.com/earwig/mwparserfromhell

How to extract data from the infobox of a wikipedia page?

2 Answers2