24

I want to import OData XML datafeeds from the Dutch Bureau of Statistics (CBS) into our database. Using lxml and pandas I thought this should be straigtforward. By using OrderDict I want to preserve the order of the columns for readability, but somehow I can't get it right.

from collections import OrderedDict
from lxml import etree
import requests
import pandas as pd


# CBS URLs
base_url = 'http://opendata.cbs.nl/ODataFeed/odata'
datasets = ['/37296ned', '/82245NED']

feed = requests.get(base_url + datasets[1] + '/TypedDataSet')
root = etree.fromstring(feed.content)

# all record entries start at tag m:properties, parse into data dict
data = []
for record in root.iter('{{{}}}properties'.format(root.nsmap['m'])):
    row = OrderedDict()
    for element in record:
        row[element.tag.split('}')[1]] = element.text
    data.append(row)

df = pd.DataFrame.from_dict(data)
df.columns

Inspecting data, the OrderDict is in the right order. But looking at df.head() the columns have been sorted alphabetically with CAPS first?

Help, anyone?

dkapitan
  • 859
  • 2
  • 10
  • 21

2 Answers2

38

Something in your example seems to be inconsistent, as data is a list and no dict, but assuming you really have an OrderedDict:

Try to explicitly specify your column order when you create your DataFrame:

# ... all your data collection
df = pd.DataFrame(data, columns=data.keys())

This should give you your DataFrame with the columns ordered just in exact the way they are in the OrderedDict (via the data.keys() generated list)

chris-sc
  • 1,698
  • 11
  • 21
  • 7
    thanks @chris-sc, this works. The `data` is a list of OrderedDicts, so in fact each item in that list is a record. Slight modification to your solution `df = pd.DataFrame(data, columns=data[0].keys())`. Yet, a bit disappointing that it doesn't infer this automatically, but that may be just me ... – dkapitan Nov 17 '15 at 15:01
  • You can also go with the more concise `columns=data` since iterating over `data` iterates over the keys. – 1'' Aug 17 '17 at 09:26
  • 1
    does not work for me in python 3, unless the to_html function is doing something to order, but I highly doubt that. – user637338 Jun 08 '18 at 17:50
  • @user637338 you could test that easily, right? How does it look in plain python? I guess you are using ipython notebooks (therefore to_html) - Could you try that in a console and see if its still not working? – chris-sc Jun 11 '18 at 08:24
2

The above answer doesn't work for me and keep giving me "ValueError: cannot use columns parameter with orient='columns'".

Later I found a solution by doing this below and worked:

df = pd.DataFrame.from_dict (dict_data) [list (dict_data[0].keys())]
Daniel Wu
  • 21
  • 1