0

want to get infobox contents of https://en.wikipedia.org/wiki/Air_Alg%C3%A9rie

I followed this article.

import requests 
from lxml import etree 

url='https://en.wikipedia.org/wiki/Air_Alg%C3%A9rie'

req = requests.get(url)  

store = etree.fromstring(req.text) 

# this will give Motto portion of above  
# URL's info box of Wikipedia's page 
output = store.xpath('//table[@class="infobox vcard"]/tr[th/text()="Destinations"]/td/i')  

# printing the text portion 
print output[0].text   

but it is null enter image description here

even though req.text exists, returns null. How can I get this infobox contents? especially,

IATA ICAO
AH DAH

I need IATA, ICAO code. please help.

Also remember that DBPedia is not synchronized in real-time with Wikipedia, you may experience a few months delay between wikipedia version and corresponding entry in DBPedia. I don't want DBPedia contents.

horoyoi o
  • 584
  • 1
  • 8
  • 29
  • first try `//table[@class="infobox vcard"]`, next `'//table[@class="infobox vcard"]/tr'` , etc. and maybe you find where is problem – furas Aug 01 '19 at 03:13
  • `` is not directly in `` so you have to use `//` between `
    ` and `` - `//table[@class="infobox vcard"]//tr`
    – furas Aug 01 '19 at 03:20
  • row with `"Destinations"` doesn't have `` – furas Aug 01 '19 at 03:20
  • to get `AH`, `DAH`, `AIR ALGERIE` you can use `xpath( '//td[@class="nickname"]' )` – furas Aug 01 '19 at 03:23
  • `output = store.xpath('//table[@class="infobox vcard"]')` got `[]` @furas – horoyoi o Aug 01 '19 at 04:06
  • @furas `output = store.xpath('//table[@class="infobox vcard"]/tr')` got `[]` – horoyoi o Aug 01 '19 at 04:08
  • as i said: in this HTML `` is not directly after `` so you have to use `//` between `
    ` and `` - `'//table[@class="infobox vcard"]//tr'` - or you would have to use all tags which are between `
    ` and `` - `'//table[@class="infobox vcard"]/tbody/tr'`
    – furas Aug 01 '19 at 04:12
  • your comment works! !!!!! thank you! `to get AH, DAH, AIR ALGERIE you can use xpath( '//td[@class="nickname"]' ) ` – horoyoi o Aug 01 '19 at 04:13
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/197316/discussion-between-horoyoi-o-and-furas). – horoyoi o Aug 01 '19 at 04:17
  • Possible duplicate of [How to extract information from a Wikipedia infobox?](https://stackoverflow.com/questions/33862336/how-to-extract-information-from-a-wikipedia-infobox) – Tgr Aug 01 '19 at 13:58

1 Answers1

1

To get AH, DAH, AIR ALGERIE you can use

xpath( '//td[@class="nickname"]' ) 

As for your xpath: in this HTML there is <tbody> between <table> and <tr> so you would have to use it in xpath

'//table[@class="infobox vcard"]/tbody/tr[th/text()="Destinations"]/td'

or use // and it will work even if there is more tags between <table> and <tr>

'//table[@class="infobox vcard"]//tr[th/text()="Destinations"]/td'

I also skiped <i> at the end because row "Destinations" doesn't use <i>


import requests 
from lxml import etree 

url='https://en.wikipedia.org/wiki/Air_Alg%C3%A9rie'

req = requests.get(url)  
store = etree.fromstring(req.text) 

output = store.xpath('//td[@class="nickname"]')  
for x in output:
    print(x.text.strip())

#output = store.xpath('//table[@class="infobox vcard"]//tr[th/text()="Destinations"]/td')
output = store.xpath('//table[@class="infobox vcard"]/tbody/tr[th/text()="Destinations"]/td')
print(output[0].text) 

Result

AH
DAH
AIR ALGERIE
69

EDIT:

I use another xpath to get names "IATA", "ICAO", "Callsign" and then I use zip() to groups them with "AH", "DAH", "AIR ALGERIE"

import requests 
from lxml import etree 

url='https://en.wikipedia.org/wiki/Air_Alg%C3%A9rie'

req = requests.get(url)  
store = etree.fromstring(req.text) 

keys = store.xpath('//table[@class="infobox vcard"]//table//tr[1]//a')
#for x in keys:
#    print(x.text.strip())

values = store.xpath('//td[@class="nickname"]')  
#for x in values:
#    print(x.text.strip())

some_dict = dict()

for k, v in zip(keys, values):
    k = k.text.strip()
    v = v.text.strip()
    some_dict[k] = v
    print(k, '=', v)

print(some_dict)

Result:

IATA = AH
ICAO = DAH
Callsign = AIR ALGERIE

{'IATA': 'AH', 'ICAO': 'DAH', 'Callsign': 'AIR ALGERIE'}
furas
  • 134,197
  • 12
  • 106
  • 148
  • can you edit this code with `IATA ICAO Callsign` ?? I mean `AH` is `IATA` code. – horoyoi o Aug 01 '19 at 05:04
  • then do `some_dict["IATA"] = output[0].text` – furas Aug 01 '19 at 08:51
  • no. what i mean, I cannot get "IATA" from store.xpath – horoyoi o Aug 01 '19 at 09:13
  • I don't get "IATA" from xpath. I use directly string "IATA" . Why to use xpath for string which I already know ? – furas Aug 01 '19 at 09:18
  • but sometimes pages may have different names - some names can be skiped on page - and then it is good to use xpath to get them. I added example which gets strings "IATA", "ICAO", "Callsign" from page and groups them with "AH", "DAH", "AIR ALGERIE" – furas Aug 01 '19 at 09:33