Overriding 'default' number of fields when reading csv

Question

The following code:

import pandas as pd

from StringIO import StringIO
data = StringIO("""a,b,c
               1,2,3
               4,5,6
               6,7,8,9
               1,2,5
               3,4,5""")
pd.read_csv(data, warn_bad_lines=True, error_bad_lines=False)

produces this output:

Skipping line 4: expected 3 fields, saw 4

   a  b  c
0  1  2  3
1  4  5  6
2  1  2  5
3  3  4  5

That is, third line is rejected because it contains four (and not the expected three) values. This csv datafile is considered to be malformed.

What if I wanted instead a different behavior, i.e. not skipping lines having more fields than expected, but keeping their values by using a larger dataframe.

In the given example this would be the behavior ('UNK' is just an example, might be any other string):

   a  b  c UNK
0  1  2  3 nan
1  4  5  6 nan
2  6  7  8  9
3  1  2  5 nan 
4  3  4  5 nan

This is just an example in which there is only one additional value, what about an arbitrary (and a priori unknown) number of fields? Is this obtainable by some way through pandas read_csv?

Please note: I can do this by using csv.reader, I am just trying to switch now to pandas.

Any help/hints is appreciated.

You haven't specified all columns in `StringIO` constructor. Change `StringIO("""a,b,c....""")` to `StringIO("""a,b,c,ukk...""")` — Sohaib Farooqi, Jan 29 '18 at 10:33
No I intentionally did this. The idea is to work with a 'malformed' csv file in which some lines have nr of fields greater than the nr of fields in the header — Lorenzo, Jan 29 '18 at 10:34
maybe a duplicate of https://stackoverflow.com/questions/15242746/handling-variable-number-of-columns-with-pandas-python — Lorenzo, Jan 29 '18 at 10:46

score 0 · Answer 1 · answered Jan 29 '18 at 10:24

0

Looks like you need the names argument when reading a csv

import pandas as pd

from StringIO import StringIO
data = StringIO("""a,b,c
               1,2,3
               4,5,6
               6,7,8,9
               1,2,5
               3,4,5""")
df = pd.read_csv(data, warn_bad_lines=True, error_bad_lines=False, names = ["a", "b", "c", "UNK"])

print(df)

Output:

                  a  b  c  UNK
0                 a  b  c  NaN
1                 1  2  3  NaN
2                 4  5  6  NaN
3                 6  7  8  9.0
4                 1  2  5  NaN
5                 3  4  5  NaN

answered Jan 29 '18 at 10:24

Rakesh

81,458
17
76
113

The import should be `from io import StringIO` – Sohaib Farooqi Jan 29 '18 at 10:26
I acknowledge that your answer works in this case. But what about the case in which I do not know how many columns are there? Shall I edit the question to explicitly add this case? – Lorenzo Jan 29 '18 at 10:30

score 0 · Accepted Answer · answered Jan 29 '18 at 15:01

Supposing that Afile.csv contains :

a,b,c#Incomplete Header
1,2,3
4,5,6
6,7,8,9
1,2,5
3,4,5,,8

The following function yields a DataFrame containing all fields:

def readRawValuesFromCSV(file1, separator=',', commentMark='#'):
    df = pd.DataFrame()

    with open(file1, 'r') as f:
        for line in f:
            b = line.strip().split(commentMark)
            if len(b[0])>0:
                lineList = tuple(b[0].strip().split(separator))
                df = pd.concat( [df, pd.DataFrame([lineList])], ignore_index=True )
    return df

You can test it with this code:

file1 = 'Afile.csv'
# Read all values of a (maybe malformed) CSV file
df = readRawValuesFromCSV (file1, ',', '#')

That yields:

df
   0  1  2    3    4
0  a  b  c  NaN  NaN
1  1  2  3  NaN  NaN
2  4  5  6  NaN  NaN
3  6  7  8    9  NaN
4  1  2  5  NaN  NaN
5  3  4  5         8

I am indebted with herrfz for his answer in Handling Variable Number of Columns with Pandas - Python. The present question might be a generalization of the other.

Overriding 'default' number of fields when reading csv

2 Answers2