Handling Variable Number of Columns with Pandas - Python

Question

I have a data set that looks like this (at most 5 columns - but can be less)

1,2,3
1,2,3,4
1,2,3,4,5
1,2
1,2,3,4
....

I am trying to use pandas read_table to read this into a 5 column data frame. I would like to read this in without additional massaging.

If I try

import pandas as pd
my_cols=['A','B','C','D','E']
my_df=pd.read_table(path,sep=',',header=None,names=my_cols)

I get an error - "column names have 5 fields, data has 3 fields".

Is there any way to make pandas fill in NaN for the missing columns while reading the data?

score 83 · Accepted Answer · answered Mar 06 '13 at 15:55

83

One way which seems to work (at least in 0.10.1 and 0.11.0.dev-fc8de6d):

>>> !cat ragged.csv
1,2,3
1,2,3,4
1,2,3,4,5
1,2
1,2,3,4
>>> my_cols = ["A", "B", "C", "D", "E"]
>>> pd.read_csv("ragged.csv", names=my_cols, engine='python')
   A  B   C   D   E
0  1  2   3 NaN NaN
1  1  2   3   4 NaN
2  1  2   3   4   5
3  1  2 NaN NaN NaN
4  1  2   3   4 NaN

Note that this approach requires that you give names to the columns you want, though. Not as general as some other ways, but works well enough when it applies.

answered Mar 06 '13 at 15:55

DSM

342,061
65
592
494

2

Thank You! This worked - the engine='python' seems to be key. Adding this attribute makes both read_table and read_csv work. – Jackie Shephard Mar 06 '13 at 16:34
2

This seems pretty warty to me. Adding a github issue: https://github.com/pydata/pandas/issues/2981 – Wes McKinney Mar 06 '13 at 19:53
4

What fixed it for me was `names=my_cols` where `my_cols` was at least as long as the line with the most fields. If the max number of fields isn't known in advance, you can dynamically extract by reading the file beforehand via `with open('my.csv') as f: num_cols = max(len(line.split(',')) for line in f); f.seek(0); df = pd.read_csv(f, names=range(num_cols))` but the down side is that the file is read twice. – EliadL Nov 08 '18 at 09:53
with Pandas version 0.23.4 `pd.read_csv(file, names=my_cols)` works even if `len(my_cols)` is less than the number of fields on one or more lines. The extra fields just get discarded. – Luca Mar 18 '19 at 17:31
1

with pandas 0.25.3 it fails if some column is longer than my_cols – Gena Kukartsev Feb 20 '20 at 18:47

score 18 · Answer 2 · answered Mar 06 '13 at 09:58

18

I'd also be interested to know if this is possible, from the doc it doesn't seem to be the case. What you could probably do is read the file line by line, and concatenate each reading to a DataFrame:

import pandas as pd

df = pd.DataFrame()

with open(filepath, 'r') as f:
    for line in f:
        df = pd.concat( [df, pd.DataFrame([tuple(line.strip().split(','))])], ignore_index=True )

It works but not in the most elegant way, I guess...

answered Mar 06 '13 at 09:58

herrfz

4,814
4
26
37

This method may not work because comma maybe escaped by quotes(ie. `"hello, world"`) – lowzhao Jun 27 '22 at 08:47

score 1 · Answer 3 · answered Mar 06 '13 at 15:40

Ok. Not sure how efficient this is - but here is what I have done. Would love to hear if there is a better way to do this. Thanks !

from pandas import DataFrame

list_of_dicts=[]
labels=['A','B','C','D','E']
for line in file:
    line=line.rstrip('\n')
    list_of_dicts.append(dict(zip(labels,line.split(','))))
frame=DataFrame(list_of_dicts)

Handling Variable Number of Columns with Pandas - Python

3 Answers3

Linked

Related