Fetching text from Wikipedia’s Infobox in Python

Question

want to get infobox contents of https://en.wikipedia.org/wiki/Air_Alg%C3%A9rie

import requests 
from lxml import etree 

url='https://en.wikipedia.org/wiki/Air_Alg%C3%A9rie'

req = requests.get(url)  

store = etree.fromstring(req.text) 

# this will give Motto portion of above  
# URL's info box of Wikipedia's page 
output = store.xpath('//table[@class="infobox vcard"]/tr[th/text()="Destinations"]/td/i')  

# printing the text portion 
print output[0].text

but it is null

even though req.text exists, returns null. How can I get this infobox contents? especially,

IATA ICAO
AH DAH

I need IATA, ICAO code. please help.

Also remember that DBPedia is not synchronized in real-time with Wikipedia, you may experience a few months delay between wikipedia version and corresponding entry in DBPedia. I don't want DBPedia contents.

first try `//table[@class="infobox vcard"]`, next `'//table[@class="infobox vcard"]/tr'` , etc. and maybe you find where is problem — furas, Aug 01 '19 at 03:13
`` is not directly in `` so you have to use `//` between `
` and `` - `//table[@class="infobox vcard"]//tr` — furas, Aug 01 '19 at 03:20
to get `AH`, `DAH`, `AIR ALGERIE` you can use `xpath( '//td[@class="nickname"]' )` — furas, Aug 01 '19 at 03:23
`output = store.xpath('//table[@class="infobox vcard"]')` got `[]` @furas — horoyoi o, Aug 01 '19 at 04:06
@furas `output = store.xpath('//table[@class="infobox vcard"]/tr')` got `[]` — horoyoi o, Aug 01 '19 at 04:08
as i said: in this HTML `` is not directly after `` so you have to use `//` between `
` and `` - `'//table[@class="infobox vcard"]//tr'` - or you would have to use all tags which are between `
` and `` - `'//table[@class="infobox vcard"]/tbody/tr'` — furas, Aug 01 '19 at 04:12
your comment works! !!!!! thank you! `to get AH, DAH, AIR ALGERIE you can use xpath( '//td[@class="nickname"]' ) ` — horoyoi o, Aug 01 '19 at 04:13
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/197316/discussion-between-horoyoi-o-and-furas). — horoyoi o, Aug 01 '19 at 04:17
Possible duplicate of [How to extract information from a Wikipedia infobox?](https://stackoverflow.com/questions/33862336/how-to-extract-information-from-a-wikipedia-infobox) — Tgr, Aug 01 '19 at 13:58

furas · Answer 1 · 2019-08-01T09:31:30.590

To get AH, DAH, AIR ALGERIE you can use

xpath( '//td[@class="nickname"]' )

As for your xpath: in this HTML there is <tbody> between <table> and <tr> so you would have to use it in xpath

'//table[@class="infobox vcard"]/tbody/tr[th/text()="Destinations"]/td'

or use // and it will work even if there is more tags between <table> and <tr>

'//table[@class="infobox vcard"]//tr[th/text()="Destinations"]/td'

I also skiped <i> at the end because row "Destinations" doesn't use <i>

import requests 
from lxml import etree 

url='https://en.wikipedia.org/wiki/Air_Alg%C3%A9rie'

req = requests.get(url)  
store = etree.fromstring(req.text) 

output = store.xpath('//td[@class="nickname"]')  
for x in output:
    print(x.text.strip())

#output = store.xpath('//table[@class="infobox vcard"]//tr[th/text()="Destinations"]/td')
output = store.xpath('//table[@class="infobox vcard"]/tbody/tr[th/text()="Destinations"]/td')
print(output[0].text)

Result

AH
DAH
AIR ALGERIE
69

EDIT:

I use another xpath to get names "IATA", "ICAO", "Callsign" and then I use zip() to groups them with "AH", "DAH", "AIR ALGERIE"

import requests 
from lxml import etree 

url='https://en.wikipedia.org/wiki/Air_Alg%C3%A9rie'

req = requests.get(url)  
store = etree.fromstring(req.text) 

keys = store.xpath('//table[@class="infobox vcard"]//table//tr[1]//a')
#for x in keys:
#    print(x.text.strip())

values = store.xpath('//td[@class="nickname"]')  
#for x in values:
#    print(x.text.strip())

some_dict = dict()

for k, v in zip(keys, values):
    k = k.text.strip()
    v = v.text.strip()
    some_dict[k] = v
    print(k, '=', v)

print(some_dict)

Result:

IATA = AH
ICAO = DAH
Callsign = AIR ALGERIE

{'IATA': 'AH', 'ICAO': 'DAH', 'Callsign': 'AIR ALGERIE'}

can you edit this code with `IATA ICAO Callsign` ?? I mean `AH` is `IATA` code. — horoyoi o, Aug 01 '19 at 05:04
I don't get "IATA" from xpath. I use directly string "IATA" . Why to use xpath for string which I already know ? — furas, Aug 01 '19 at 09:18
but sometimes pages may have different names - some names can be skiped on page - and then it is good to use xpath to get them. I added example which gets strings "IATA", "ICAO", "Callsign" from page and groups them with "AH", "DAH", "AIR ALGERIE" — furas, Aug 01 '19 at 09:33

Fetching text from Wikipedia’s Infobox in Python

1 Answers1