Reading a CSV-file with uncomplete headers

Question

I have not used the csv module in python before, but it seems like a great tool to use for my needs.

The problem I am facing is that my csv-file looks a bit strange. (it was hard to notice at first)

See the example below for what I mean:

A|B|C|D|x|F|G|x|H
1||2||3|4||5|6
1||2||3|4||5|6
1||2||3|4||5|6
1||2||3|4||5|6

There are some indexes (such as column-x) above which should be read as Dx and Gx. The D-column and G-column are then empty columns.

So what I need is that when csv is processing the header it should create the headers like Dx and Gx and neglect the D and G columns and take the value from the x-column instead.

(I tried to read the file as is with the following code first and post process it, but it is not possible since the column name x is always the same, and csv.DictReader seems to discard repetitive column-names)

with open('myCSV.log') as csvFile:
    reader = csv.DictReader(csvFile, dialect='pipes')
    for row in reader:
        print row

Can this be achieved with a dialect, or some other fancy way of configuring the csv-module or do I have to parse this myself?

root · Accepted Answer · 2013-01-17T10:12:42.783

1

Use pandas:

import pandas as pd
print pd.read_csv('test.csv',sep='|')

out:

   A   B  C   D  x  F   G  x.1  H
0  1 NaN  2 NaN  3  4 NaN    5  6
1  1 NaN  2 NaN  3  4 NaN    5  6
2  1 NaN  2 NaN  3  4 NaN    5  6
3  1 NaN  2 NaN  3  4 NaN    5  6

and if you need you can convert it to a dict using to_dict() method:

{'A': {0: 1, 1: 1, 2: 1, 3: 1},
 'B': {0: nan, 1: nan, 2: nan, 3: nan},
 'C': {0: 2, 1: 2, 2: 2, 3: 2},
 'D': {0: nan, 1: nan, 2: nan, 3: nan},
 'F': {0: 4, 1: 4, 2: 4, 3: 4},
 'G': {0: nan, 1: nan, 2: nan, 3: nan},
 'H': {0: 6, 1: 6, 2: 6, 3: 6},
 'x': {0: 3, 1: 3, 2: 3, 3: 3},
 'x.1': {0: 5, 1: 5, 2: 5, 3: 5}}

EDIT: If you need certain names for columns you can do this:

import pandas as pd
df = pd.read_csv('test.csv',sep='|')
df.columns = [df.columns[index-1]+'x' if 'x' in name 
              else name for index,name in enumerate(df.columns)]
print df

   A   B  C   D  Dx  F   G  Gx  H
0  1 NaN  2 NaN   3  4 NaN   5  6
1  1 NaN  2 NaN   3  4 NaN   5  6
2  1 NaN  2 NaN   3  4 NaN   5  6
3  1 NaN  2 NaN   3  4 NaN   5  6

If you want to lose the empty cols:

print df.dropna(axis=1,how='all')

   A  C  Dx  F  Gx  H
0  1  2   3  4   5  6
1  1  2   3  4   5  6
2  1  2   3  4   5  6
3  1  2   3  4   5  6

edited Jan 17 '13 at 10:12

answered Jan 17 '13 at 09:28

root

76,608
25
108
120

thank, pandas is new to me. However looking at your example, it seems like something I can not use. The first x column is named x, the second one is x.1 and I suppose that this continues for remaining x, meaning that I have lost the original name of the columns and replaced with something else. – theAlse Jan 17 '13 at 09:35
python magic, could you rewrite the online as an ugly for-loop (and if else ) so I can understand this better. – theAlse Jan 17 '13 at 10:30
I am accepting your answer! it should be the way to go, but I choose actually to parse the file myself in the end. – theAlse Jan 18 '13 at 14:18
@theAlse -- well, maybe it will useful for someone in the future :) – root Jan 18 '13 at 14:21

score 0 · Answer 2 · answered Jan 17 '13 at 09:27

I think reading the file and post-processing it is indeed the way to go. Just don't use DictReader:

with open('myCSV.log') as csvFile:
    for row in csv.reader(csvFile, delimiter='|'):
        print row


['A', 'B', 'C', 'D', 'x', 'F', 'G', 'x', 'H']
['1', '', '2', '', '3', '4', '', '5', '6']
['1', '', '2', '', '3', '4', '', '5', '6']
['1', '', '2', '', '3', '4', '', '5', '6']
['1', '', '2', '', '3', '4', '', '5', '6']

Gijs · Answer 3 · 2013-01-17T09:55:46.353

0

You have nine headers and nine columns, and I don't really see a problem with the .csv except for the doubled columnheaders. You can override the fieldnames in the DictReader, to rename these. Also, if you want to rename the column 'D' to 'Dx', it can be done with this parameter. .

import csv
from pprint import pprint
with open('testfile.csv') as csvFile:
reader = csv.DictReader(csvFile, fieldnames = ['A','B','C','meh','Dx','F','moe','Gx','H'], delimiter = '|'  )
for row in reader:
        pprint(row)

Using dialects here doesn't really seem necessary here! This is my testoutput, where the first line is now the header.

{'A': 'A',
 'B': 'B',
 'C': 'C',
 'Dx': 'x',
 'F': 'F',
 'Gx': 'x',
 'H': 'H',
 'meh': 'D',
 'moe': 'G'}
{'A': '1',
 'B': '',
 'C': '2',
 'Dx': '3',
 'F': '4',
 'Gx': '5',
 'H': '6',
 'meh': '',
 'moe': ''}
 ....

edited Jan 17 '13 at 09:55

answered Jan 17 '13 at 09:36

Gijs

10,346
5
27
38

this was just an example, the real file is dynamic with many columns (over 100 columns). What I need is something much more dynamic! – theAlse Jan 17 '13 at 09:39
But then you can dynamically pass the fieldnames parameter! – Gijs Jan 17 '13 at 09:41
I don´t even know the exact name of the columns. I just know the name of the x-columns. – theAlse Jan 17 '13 at 09:45
Then you will have to first read the header, generate a suitable fieldnames list, and then parse the file, I guess. – Gijs Jan 17 '13 at 09:58

score 0 · Answer 4 · answered Jan 17 '13 at 10:38

from StringIO import StringIO
import csv

with open('test.csv') as f:
    lines = f.read().split('\n')
    line_0 = lines[0].split('|')
    lines[0] = '|'.join(line_0[i-1] if 'x' in n else n for i,n in 
                                               enumerate(line_0))

    reader = csv.DictReader(StringIO('\n'.join(lines)),delimiter = '|' )
    for row in reader:
        print row

out:

{'A': '1', 'C': '2', 'B': '', 'D': '3', 'G': '5', 'F': '4', 'H': '6'}
{'A': '1', 'C': '2', 'B': '', 'D': '3', 'G': '5', 'F': '4', 'H': '6'}
{'A': '1', 'C': '2', 'B': '', 'D': '3', 'G': '5', 'F': '4', 'H': '6'}
{'A': '1', 'C': '2', 'B': '', 'D': '3', 'G': '5', 'F': '4', 'H': '6'}

Reading a CSV-file with uncomplete headers

4 Answers4