1

The following code:

import pandas as pd

from StringIO import StringIO
data = StringIO("""a,b,c
               1,2,3
               4,5,6
               6,7,8,9
               1,2,5
               3,4,5""")
pd.read_csv(data, warn_bad_lines=True, error_bad_lines=False)

produces this output:

Skipping line 4: expected 3 fields, saw 4

   a  b  c
0  1  2  3
1  4  5  6
2  1  2  5
3  3  4  5

That is, third line is rejected because it contains four (and not the expected three) values. This csv datafile is considered to be malformed.

What if I wanted instead a different behavior, i.e. not skipping lines having more fields than expected, but keeping their values by using a larger dataframe.

In the given example this would be the behavior ('UNK' is just an example, might be any other string):

   a  b  c UNK
0  1  2  3 nan
1  4  5  6 nan
2  6  7  8  9
3  1  2  5 nan 
4  3  4  5 nan

This is just an example in which there is only one additional value, what about an arbitrary (and a priori unknown) number of fields? Is this obtainable by some way through pandas read_csv?

Please note: I can do this by using csv.reader, I am just trying to switch now to pandas.

Any help/hints is appreciated.

Lorenzo
  • 162
  • 13

2 Answers2

0

Looks like you need the names argument when reading a csv

import pandas as pd

from StringIO import StringIO
data = StringIO("""a,b,c
               1,2,3
               4,5,6
               6,7,8,9
               1,2,5
               3,4,5""")
df = pd.read_csv(data, warn_bad_lines=True, error_bad_lines=False, names = ["a", "b", "c", "UNK"])

print(df)

Output:

                  a  b  c  UNK
0                 a  b  c  NaN
1                 1  2  3  NaN
2                 4  5  6  NaN
3                 6  7  8  9.0
4                 1  2  5  NaN
5                 3  4  5  NaN
Rakesh
  • 81,458
  • 17
  • 76
  • 113
  • The import should be `from io import StringIO` – Sohaib Farooqi Jan 29 '18 at 10:26
  • I acknowledge that your answer works in this case. But what about the case in which I do not know how many columns are there? Shall I edit the question to explicitly add this case? – Lorenzo Jan 29 '18 at 10:30
0

Supposing that Afile.csv contains :

a,b,c#Incomplete Header
1,2,3
4,5,6
6,7,8,9
1,2,5
3,4,5,,8

The following function yields a DataFrame containing all fields:

def readRawValuesFromCSV(file1, separator=',', commentMark='#'):
    df = pd.DataFrame()

    with open(file1, 'r') as f:
        for line in f:
            b = line.strip().split(commentMark)
            if len(b[0])>0:
                lineList = tuple(b[0].strip().split(separator))
                df = pd.concat( [df, pd.DataFrame([lineList])], ignore_index=True )
    return df

You can test it with this code:

file1 = 'Afile.csv'
# Read all values of a (maybe malformed) CSV file
df = readRawValuesFromCSV (file1, ',', '#')

That yields:

df
   0  1  2    3    4
0  a  b  c  NaN  NaN
1  1  2  3  NaN  NaN
2  4  5  6  NaN  NaN
3  6  7  8    9  NaN
4  1  2  5  NaN  NaN
5  3  4  5         8

I am indebted with herrfz for his answer in Handling Variable Number of Columns with Pandas - Python. The present question might be a generalization of the other.

Lorenzo
  • 162
  • 13