1

I am extracting image data from Flickr via their API and what I get printed is a few thousand xml objects that look like this:

<photo accuracy="15" context="0" dateupload="1398279194" farm="8" geo_is_contact="0" geo_is_family="0" geo_is_friend="0" geo_is_public="1" height_n="320" id="13986079375" isfamily="0" isfriend="0" ispublic="1" latitude="41.828482" license="0" longitude="-87.624506" owner="100231432@N02" pathalias="perspectivesschools" place_id="cF8n.mJTWrhYf0uBEw" secret="f46eef0b1d" server="7308" title="Sean Gallagher, Pulitzer Photojournalist visits MSA" url_n="https://farm8.staticflickr.com/7308/13986079375_f46eef0b1d_n.jpg" width_n="213" woeid="28297331" />
<photo accuracy="12" context="0" dateupload="1394558054" farm="4" geo_is_contact="0" geo_is_family="0" geo_is_friend="0" geo_is_public="1" height_n="213" id="13086071753" isfamily="0" isfriend="0" ispublic="1" latitude="51.451914" license="2" longitude="-0.122882" owner="96189004@N04" pathalias="" place_id="JYdWRftQUbMvFA" secret="265103ac38" server="3040" title="" url_n="https://farm4.staticflickr.com/3040/13086071753_265103ac38_n.jpg" width_n="320" woeid="13978" />
<photo accuracy="12" context="0" dateupload="1394558019" farm="8" geo_is_contact="0" geo_is_family="0" geo_is_friend="0" geo_is_public="1" height_n="213" id="13086343854" isfamily="0" isfriend="0" ispublic="1" latitude="51.451914" license="2" longitude="-0.122882" owner="96189004@N04" pathalias="" place_id="JYdWRftQUbMvFA" secret="a6858f84d2" server="7451" title="" url_n="https://farm8.staticflickr.com/7451/13086343854_a6858f84d2_n.jpg" width_n="320" woeid="13978" />

Now I want to extract data for attributes 'lat' and 'long' in one run. And the data for the attribute 'url_n' in the other. How can I do that in Python? I have no experience with parsing xml data and don't know where to start.

Thanks a lot!

dwitvliet
  • 7,242
  • 7
  • 36
  • 62
bcrvc
  • 315
  • 5
  • 20

2 Answers2

1

Parsing XML with regex is not a good idea. Try BeautifulSoup - it not only parses XML, but it also has functions to get the next/parent/etc element in relation to one selected and their attributes easily.

Example use:

from bs4 import BeautifulSoup
(...)
soup = BeautifulSoup(flickr_xml)
for photo in soup.find_all('photo'):
    print(photo.get('url_n'))
Community
  • 1
  • 1
dwitvliet
  • 7,242
  • 7
  • 36
  • 62
1

Use lxml

While there are multiple XML related packages in Python, incl. stdlib one, I prefer using lxml, as it offers all what I need (good XPath support, schema validation etc.) and I prefer to keep number of packages I use small.

For the xml documents from Flickr, the solution could look like

Script flickr.py

from lxml import etree
xmllines = """
<photo accuracy="15" context="0" dateupload="1398279194" farm="8" geo_is_contact="0" geo_is_family="0" geo_is_friend="0" geo_is_public="1" height_n="320" id="13986079375" isfamily="0" isfriend="0" ispublic="1" latitude="41.828482" license="0" longitude="-87.624506" owner="100231432@N02" pathalias="perspectivesschools" place_id="cF8n.mJTWrhYf0uBEw" secret="f46eef0b1d" server="7308" title="Sean Gallagher, Pulitzer Photojournalist visits MSA" url_n="https://farm8.staticflickr.com/7308/13986079375_f46eef0b1d_n.jpg" width_n="213" woeid="28297331" />
<photo accuracy="12" context="0" dateupload="1394558054" farm="4" geo_is_contact="0" geo_is_family="0" geo_is_friend="0" geo_is_public="1" height_n="213" id="13086071753" isfamily="0" isfriend="0" ispublic="1" latitude="51.451914" license="2" longitude="-0.122882" owner="96189004@N04" pathalias="" place_id="JYdWRftQUbMvFA" secret="265103ac38" server="3040" title="" url_n="https://farm4.staticflickr.com/3040/13086071753_265103ac38_n.jpg" width_n="320" woeid="13978" />
<photo accuracy="12" context="0" dateupload="1394558019" farm="8" geo_is_contact="0" geo_is_family="0" geo_is_friend="0" geo_is_public="1" height_n="213" id="13086343854" isfamily="0" isfriend="0" ispublic="1" latitude="51.451914" license="2" longitude="-0.122882" owner="96189004@N04" pathalias="" place_id="JYdWRftQUbMvFA" secret="a6858f84d2" server="7451" title="" url_n="https://farm8.staticflickr.com/7451/13086343854_a6858f84d2_n.jpg" width_n="320" woeid="13978" />
"""

for line in xmllines.strip().splitlines():
    doc = etree.fromstring(line)
    urls = doc.xpath("/photo/@url_n")
    if urls:
        url = urls[0]
        print url
    else:
        print "---no attribute url_n was found---"

which would output:

$ python flickr.py
https://farm8.staticflickr.com/7308/13986079375_f46eef0b1d_n.jpg
https://farm4.staticflickr.com/3040/13086071753_265103ac38_n.jpg
https://farm8.staticflickr.com/7451/13086343854_a6858f84d2_n.jpg
Jan Vlcinsky
  • 42,725
  • 12
  • 101
  • 98
  • It works, thanks! I managed to solve the problem with minidom, but the code breaks for some reason when parsing larger dataset... – bcrvc Jul 21 '14 at 11:46
  • 1
    @loop_digga You are welcome. I guess, the code which breaks is the `minidom` one, not the `lxml`. As `minidom` does all the work in memory, it is possible to have troubles with larger documents (even though your question shows a lot of very small XML documents). With `lxml`, there are few options how to process even large (even endless) documents using limited memory (with `iterparse` or even with SAX parsing). – Jan Vlcinsky Jul 21 '14 at 11:51
  • Thanks for additional explanation! But actually, both codes break with a larger dataset. Tried yours now with a bigger data and after printing 648 URLs, the error occurs: 'IndexError: list index out of range'. Not sure what is happening. – bcrvc Jul 21 '14 at 12:03
  • 1
    @loop_digga It is likely, the `` xml document number 649 has not "url_n" attribute. I have modified the answer to handle that. – Jan Vlcinsky Jul 21 '14 at 12:06
  • You are right! And I obviously had the same problem in minidom code. Thanks! – bcrvc Jul 21 '14 at 12:13