2

In Pandas, I have a dataframe, written from a csv. My end goal is to generate an XML schema from that CSV, because each of the items in the CSV correspond to a schema variable. The only solution (that I could think of) would be to read each item from that dataframe so that it generates a text file, with each value in the dataframe surrounded by a string.

TableName  Variable      Interpretation   Col4   Col5
CRASH      CRASH_ID      integer          1     
CRASH      SER_NO        range            0
CRASH      SER_NO        code             99999
CRASH      CRASH_MO_NO   code             1      January
CRASH      CRASH_MO_NO   code             2      February

Which would generate a text file that results in something along the lines of (using the first row as an example):

<table = "CRASH">
<name = "CRASH_ID">
<type = "integer">
<value = "1">

Where <table = >, <name = >, are all strings. They don't have to be formatted that way specifically (although that would be nice)-- I just need a faster way to generate this schema than typing it all out by hand from a CSV file.

It seems like the best way to do that would be to read through each row and generate a string while writing it to the output file. I've looked at the .iterrows() method, but that doesn't let me concatenate strings and tuples. I've also looked at some posts from other users, but their focus seems to be more on calculating things within dataframes, or changing the data itself, rather than generating a string from each row.

My current code is below. I understand that pandas is based off Numpy arrays, and that running "for i in df" loops is not an efficient method, but I am not really sure where to start.

EDIT: Some of the rows might need to loop through to display a certain way. For instance, the schema has multiple value codes that have strings attached:

<values>
<value code = "01">January</value>
<value code = "02">February</value>
<value code = "03">March</value>
</values>

I am thinking maybe I could group the values by "interpretation"? And then, if they have the "code" interpretation, I could do some kind of iteration through the group so that it displayed all the codes.

Here is my current code, for reference. I have updated it to reflect Randy's excellent suggestion below. I have also edited the above post to reflect some updated concerns.

import pandas as pd

text_file = open(r'oregon_output.txt', 'w')

df = pd.read_csv(r'oregon_2013_var_list.csv')

#selects only CRASH variables
crash = df['Col1'] == 'CRASH'
df_crash = df[crash]

#value which will be populated with code values from codebook
code_fill = " " 

#replaces NaN values in dataframe wih code_fill
df_crash.fillna(code_fill, inplace = True)

for row_id, row in df.iterrows():
    print '<variable>'
    for k, v in row.iterkv():
        if v is not None:
            print '<{} = "{}">'.format(k, v)
    print '</variable>'
    print
Community
  • 1
  • 1
ale19
  • 1,327
  • 7
  • 23
  • 38
  • 2
    duplicate of this? http://stackoverflow.com/questions/18574108/how-do-convert-a-pandas-dataframe-to-xml – JohnE Aug 20 '15 at 19:44
  • Sort of. However, my schema has slightly different needs-- I added a little more on to my question to expand upon it. Also, I do not fully understand the answers that were given to that question, and I do not want to necro-bump that thread since it was posted two years ago...Thank you for your response, though, it was still helpful! – ale19 Aug 24 '15 at 17:00

2 Answers2

0

It probably does make sense to just iterate through it and spit out individual rows:

for row_id, row in df.iterrows():
    for k, v in row.iterkv():
        if v is not None:
            print '<{} = "{}">'.format(k, v)
    print

enter image description here

Randy
  • 14,349
  • 2
  • 36
  • 42
  • Thank you for your comment. Unfortunately, I am not entirely sure I understand the .iterkv() method, and because it seems like it uses arguments that you arbitrarily chose (i.e. k,v), I can't google it. Could you expand a little more on what that method is, and why you chose it? – ale19 Aug 24 '15 at 16:58
  • k, v stands for each key value pair. – buzyjess Oct 17 '17 at 05:05
0

There is a to_dict() method you may want to consider in this case:

In [178]:
df.columns = ['table','name','type','value']
[["<%s='%s'>"%(k,v) for k,v in D.items()] for D in df.to_dict('records')]

Out[178]:
[["<table='CRASH'>", "<type='integer'>", "<name='CRASH_ID'>", "<value='1.0'>"],
 ["<table='CRASH'>", "<type='range'>", "<name='SER_NO'>", "<value='0.0'>"],
 ["<table='CRASH'>", "<type='code'>", "<name='SER_NO'>", "<value='99999.0'>"],
 ["<table='CRASH'>", "<type='string'>", "<name='CRASH_DT'>", "<value='nan'>"]]
CT Zhu
  • 52,648
  • 17
  • 120
  • 133