0

I'm pulling data from an API with the following:

url = "http://sitename"
response = requests.get(url)
data = response.text
print (data)

I get the output of raw xml, below is the browser output:

<projects count="8" href="/httpAuth/app/rest/projects/">
<project id="_Root" name="" description="" href="" webUrl=""/>
<project id="_Root1" name="" description="" href="" webUrl=""/>
<project id="_Root2" name="" description="" href="" webUrl=""/>
<project id="_Root3" name="" description="" href="" webUrl=""/>
<project id="_Root4" name="" description="" href="" webUrl=""/>
<project id="_Root5" name="" description="" href="" webUrl=""/>
<project id="_Root6" name="" description="" href="" webUrl=""/>
<project id="_Root7" name="" description="" href="" webUrl=""/>
</projects>

How do I get each rows information into usable form, such as looping through the list for each project id I pull the id/name/desc/href of each and store it?

I tried doing an conversion to json in the accept headers section for requests.get() but it still spit back xml data so I think I'm stuck working with this content structure.

Adam Smith
  • 52,157
  • 12
  • 73
  • 112
SudoGaron
  • 339
  • 6
  • 22

2 Answers2

1

I'd use lxml.

import requests
from lxml import etree

url = "http://sitename"
response = requests.get(url)
data = response.text
tree = etree.fromstring(data)
for leaf in tree:
    print(leaf.tag, leaf.attrib['id'], leaf.attrib['name'],
          leaf.attrib['description'], leaf.attrib['href'],
          leaf.attrib['webUrl'])

Which gives you:

project _Root
project _Root1
project _Root2
project _Root3
project _Root4
project _Root5
project _Root6
project _Root7
Adam Smith
  • 52,157
  • 12
  • 73
  • 112
0

For well-structured xml files, you can use (@Adam Smith) lxml, which is a quite famous library for parsing xml data.

For example, parsing the data you mention will take a snippet like the following:

>> from lxml import etree
>> root = etree.fromstring(s) # your input string in question
>> for element in root.getchildren(): print element.items() # dict-like
[('id', '_Root'), ('name', ''), ('description', ''), ('href', ''), ('webUrl', '')]
[('id', '_Root1'), ('name', ''), ('description', ''), ('href', ''), ('webUrl', '')]
[('id', '_Root2'), ('name', ''), ('description', ''), ('href', ''), ('webUrl', '')]
[('id', '_Root3'), ('name', ''), ('description', ''), ('href', ''), ('webUrl', '')]
[('id', '_Root4'), ('name', ''), ('description', ''), ('href', ''), ('webUrl', '')]
[('id', '_Root5'), ('name', ''), ('description', ''), ('href', ''), ('webUrl', '')]
[('id', '_Root6'), ('name', ''), ('description', ''), ('href', ''), ('webUrl', '')]
[('id', '_Root7'), ('name', ''), ('description', ''), ('href', ''), ('webUrl', '')]

Now, it is a known problem(?) that if your xml file is somehow corrupted, like missing a character in closing tag and whatnot, then lxml doesn't work. Well, it is not supposed to.

In that case you need to relay on regular expression (regex), i.e. Python's re module. Corrupted data will force you to man up and compile your own regular expression. For example, given the data you have, you can use the following regex:

(?:\<project id="(\w*?)" name="(\w*?)" description="(\w*?)" href="(\w*?)" webUrl="(\w*?)"\/\>)

This will extract out five groups per match, each match contains one line while each group corresponds to an attribute, which can be empty strings. For more details, check out the Python Doc. Also this site is a good tool for prototyping/testing regex.

Patrick the Cat
  • 2,138
  • 1
  • 16
  • 33
  • 1
    And god forbid you have to do that, because [as famously mentioned ad nauseum](http://stackoverflow.com/a/1732454/3058609) parsing XML with regular expressions can drive a man to madness. – Adam Smith Dec 08 '15 at 02:59
  • 1
    @AdamSmith If `lxml` works, no one will want to reinvent the wheel. The latter parts describe cases when `lxml` and other libraries give parsing errors. You will be surprise how many archived datasets are corrupted here and there...`re`, necessary evil I'd say (god punching me in the face). – Patrick the Cat Dec 08 '15 at 03:12
  • Oh I'll never debate that, I'm just praying for his mortal soul! :) – Adam Smith Dec 08 '15 at 07:35