Python extract data from xml

Question

I'm trying to get the values from this web page:

This XML file does not appear to have any style information associated with it. The document tree is shown below.
<ArrayOfVwHistoryDetail xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://tempuri.org/">
<vwHistoryDetail>
<idVariable>2561</idVariable>
<DateTime>2020-12-01T00:00:00</DateTime>
<idPeriodType>1</idPeriodType>
<Value>28671555</Value>
<ValueDetail>4415</ValueDetail>
</vwHistoryDetail>
<vwHistoryDetail>
<idVariable>2561</idVariable>
<DateTime>2020-12-02T00:00:00</DateTime>
<idPeriodType>1</idPeriodType>
<Value>28675970</Value>
<ValueDetail>4279</ValueDetail>
</vwHistoryDetail>
<vwHistoryDetail>
<idVariable>2561</idVariable>
<DateTime>2020-12-03T00:00:00</DateTime>
<idPeriodType>1</idPeriodType>
<Value>28680249</Value>
<ValueDetail>3975</ValueDetail>
</vwHistoryDetail>
<vwHistoryDetail>
<idVariable>2561</idVariable>
<DateTime>2020-12-04T00:00:00</DateTime>
<idPeriodType>1</idPeriodType>
<Value>28684224</Value>
<ValueDetail>4236</ValueDetail>
</vwHistoryDetail>
</ArrayOfVwHistoryDetail>

I tested with this code:

import xml.etree.ElementTree as ET
from urllib import request


url = "http://SomeSite/WebService.asmx/LoadVariableHistory?username=USERNAME&password=PASSWORD&variableName=CBT2_G_PRM_FB2&startDateTime=2020-12-01&endDateTime=2020-12-02&sampling=3"

print ("Obter: ", url)
html = request.urlopen(url)
data = html.read()
print("Obtido: ",len(data),"caracteres")

tree = ET.fromstring(data)
results = tree.findall('Value')
for i in results:
  print(i)

I hid the full URL for safety reasons. What I'm doing wrong to not get the values? I need to get thru this part so I can build a dictionary with DataTime : Value

Thank you in advance

I don't think you want the HTML of the page...if you use the `requests` library, you can get your data like so: `requests.get(url).content`. Note that you will have to install requests vie pip or some such. The XML file probably won't be parsed correctly with the "This XML file does not appear to have..." in the beginning. — marsnebulasoup, Dec 07 '20 at 16:32
when I print `requests.get(url).content`, I get this: `b'\r\n\r\n \r\n 2561\r\n 2020-12-01T00:00:00\r\n 1\r\n 28671555\r\n 4415\r\n \r\n'` I still get no value. — Nuno Félix, Dec 07 '20 at 16:53
That looks correct. XML does not care about whitespace such as \r\n which are just line breaks — WombatPM, Dec 07 '20 at 16:55

Parfait · Accepted Answer · 2020-12-07T18:37:28.780

Several issues emerge in your current implementation:

Your XML contains an default namespace, xmlns="http://tempuri.org/" which requires you to define a prefix in order to parse node content; findall maintains a namespaces argument.
Your path expression assumes Value is a child of root. You need to employ a double slash path, .//, since Value is a descendant of root.
You need to extract the text of iterator variable. Otherwise, you will return <Element ... > object which is usually not useful in end-use needs.

Consider adjustment

tree = ET.fromstring(data)
nmsp = {'doc': 'http://tempuri.org/'}                         # NAMESPACE PREFIX ASSIGNMENT
results = tree.findall('.//doc:Value', namespaces = nmsp)     # NAMESPACE PREFIX USE WITH './/' PATH 
for i in results:
  print(i.text)                                               # RETRIEVE TEXT VALUE

# 28671555
# 28675970
# 28680249
# 28684224

Even better, return a dictionary of .Value and its siblings with list/dict comprehension (where split removes default namespace in dict keys):

data_list_of_dicts = [{i.tag.split('}')[-1]: i.text for i in hd} 
                        for hd in tree.findall('.//doc:vwHistoryDetail', namespaces = nmsp)]

print(data_list_of_dicts)
# [{'idVariable': '2561', 'DateTime': '2020-12-01T00:00:00', 'idPeriodType': '1', 'Value': '28671555', 'ValueDetail': '4415'}, 
#  {'idVariable': '2561', 'DateTime': '2020-12-02T00:00:00', 'idPeriodType': '1', 'Value': '28675970', 'ValueDetail': '4279'}, 
#  {'idVariable': '2561', 'DateTime': '2020-12-03T00:00:00', 'idPeriodType': '1', 'Value': '28680249', 'ValueDetail': '3975'}, 
#  {'idVariable': '2561', 'DateTime': '2020-12-04T00:00:00', 'idPeriodType': '1', 'Value': '28684224', 'ValueDetail': '4236'}]

For time-keyed value dictionary:

time_value_dict = {hd.find('doc:DateTime', namespaces=nmsp).text: 
                   hd.find('doc:Value', namespaces=nmsp).text 
                      for hd in tree.findall('.//doc:vwHistoryDetail', namespaces=nmsp)}

print(time_value_dict)
# {'2020-12-01T00:00:00': '28671555', 
#  '2020-12-02T00:00:00': '28675970', 
#  '2020-12-03T00:00:00': '28680249', 
#  '2020-12-04T00:00:00': '28684224'}

Online Demo

Thank you @Parfait, I tried to change your code with the suggestions from @balderman to just get the DataTime : Value Pair: `tree = ET.fromstring(data) nmsp = {'doc': 'http://tempuri.org/'} # NAMESPACE PREFIX ASSIGNMENT results = tree.findall('.//doc:Value', namespaces = nmsp) # NAMESPACE PREFIX USE WITH './/' PATH DataTimeValue_dict = [{i.find('DateTime').text: i.find('Value').text for i in hd} for hd in tree.findall('.//doc:vwHistoryDetail', namespaces = nmsp)]` — Nuno Félix, Dec 07 '20 at 17:52
Output: File "g:/My Drive/Projectos/Python/teste/get.py", line 19, in DataTimeValue_dict = [{i.find('DateTime').text: i.find('Value').text for i in hd} File "g:/My Drive/Projectos/Python/teste/get.py", line 19, in DataTimeValue_dict = [{i.find('DateTime').text: i.find('Value').text for i in hd} File "g:/My Drive/Projectos/Python/teste/get.py", line 19, in DataTimeValue_dict = [{i.find('DateTime').text: i.find('Value').text for i in hd} AttributeError: 'NoneType' object has no attribute 'text' PS G:\My Drive\Projectos\Python\teste> — Nuno Félix, Dec 07 '20 at 17:54
@NunoFélix, your dictionary comprehension is incorrect. Try `d = {hd.find('./doc:DateTime', namespaces=nmsp).text: hd.find('./doc:Value', namespaces=nmsp).text for hd in tree.findall('.//doc:vwHistoryDetail', namespaces=nmsp)}` — , Dec 07 '20 at 18:12
You need to pass `namespace` in `.find` just as you did with `.findall`. See edit and demo update. — Parfait, Dec 07 '20 at 18:39
Thank you @Parfait and every one o helped me, finally, it worked the way I was looking for. I have just started learning Python at Udemy, so I'm making a lot of newbie mistakes. — Nuno Félix, Dec 09 '20 at 09:29

score 0 · Answer 2 · answered Dec 07 '20 at 16:54

0

tree = ET.fromstring(data)
for detail in tree.findall('vwHistoryDetail'):
  v = detail.find('Value').text
  print(v)

You are better off looping through an object and extracting children elements, instead of just grabbing the children directly since Value may be a tag reused in different parts of the document

answered Dec 07 '20 at 16:54

WombatPM

2,561
2
22
22

I tested your suggestions but still, no value is printed. Thank you anyway – Nuno Félix Dec 07 '20 at 17:03

score 0 · Answer 3 · answered Dec 07 '20 at 16:58

See below

import xml.etree.ElementTree as ET
import re

#
xml = '''<ArrayOfVwHistoryDetail xmlns:xsd="http://www.w3.org/2001/XMLSchema"
                                 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                                 xmlns="http://tempuri.org/">
   <vwHistoryDetail>
      <idVariable>2561</idVariable>
      <DateTime>2020-12-01T00:00:00</DateTime>
      <idPeriodType>1</idPeriodType>
      <Value>28671555</Value>
      <ValueDetail>4415</ValueDetail>
   </vwHistoryDetail>
   <vwHistoryDetail>
      <idVariable>2561</idVariable>
      <DateTime>2020-12-02T00:00:00</DateTime>
      <idPeriodType>1</idPeriodType>
      <Value>28675970</Value>
      <ValueDetail>4279</ValueDetail>
   </vwHistoryDetail>
   <vwHistoryDetail>
      <idVariable>2561</idVariable>
      <DateTime>2020-12-03T00:00:00</DateTime>
      <idPeriodType>1</idPeriodType>
      <Value>28680249</Value>
      <ValueDetail>3975</ValueDetail>
   </vwHistoryDetail>
   <vwHistoryDetail>
      <idVariable>2561</idVariable>
      <DateTime>2020-12-04T00:00:00</DateTime>
      <idPeriodType>1</idPeriodType>
      <Value>28684224</Value>
      <ValueDetail>4236</ValueDetail>
   </vwHistoryDetail>
</ArrayOfVwHistoryDetail>'''
xml = re.sub(' xmlns="[^"]+"', '', xml, count=1)
root = ET.fromstring(xml)
data = {v.find('DateTime').text: v.find('Value').text for v in root.findall('.//vwHistoryDetail')}
print(data)

output

{'2020-12-01T00:00:00': '28671555', '2020-12-02T00:00:00': '28675970', '2020-12-03T00:00:00': '28680249', '2020-12-04T00:00:00': '28684224'}

[Running regex on XML](https://stackoverflow.com/a/1732454/1422451)? All compliant DOM libraries should handle default namespaces and not need to remove it from tree. — Parfait, Dec 07 '20 at 17:06

Python extract data from xml

3 Answers3