How to read csv files with different amounts of columns using Python

Question

import glob
files = glob.glob("Data/*.csv")
df = pd.concat((pd.read_csv(f) for f in files))
print(df)

I get an error that says: "ParserError: Error tokenizing data. C error: Expected 39 fields in line 273, saw 40". Then as per this question: import csv with different number of columns per row using Pandas , I tried passing in the names of the columns, using StringIO and BytesIO, then I got errors like: "TypeError: initial_value must be str or None, not list" or "TypeError: a bytes-like object is required, not 'list'". I am looking at over 20 csv files.

have you tried this answer in the link you mentioned? https://stackoverflow.com/a/52890095/8805842 it should work — NoobVB, May 21 '22 at 17:53
I guess, sharing the example file is the only solution... chop it a bit to smaller size (or remove any private data and hope pandaNinjas will be interested — NoobVB, May 21 '22 at 18:05
@NoobVB, here's the example data https://www.ncdc.noaa.gov/stormevents/listevents.jsp?eventType=%28C%29+Tornado&beginDate_mm=12&beginDate_dd=01&beginDate_yyyy=1999&endDate_mm=12&endDate_dd=31&endDate_yyyy=1999&hailfilter=0.00&tornfilter=0&windfilter=000&sort=DT&submitbutton=Search&statefips=-999%2CALL , I am downloading csv's from 1990 to 1999 — Michael, May 21 '22 at 18:16
Can you narrow down to **which particular file/date it fails?** It is a particular thing from _that_ file. Maybe you can upload that one. Year 1999 imports fine for me. — Zaero Divide, May 21 '22 at 20:30
Year 1999 by itself works fine for me too, but i have downloaded 10 years of data. For example, year 1990 has different columns. I cant tell exactly which year it fails on because I have many csv files. — Michael, May 21 '22 at 21:38

NoobVB · Answer 1 · 2022-05-21T18:59:40.407

it looks like you have not tried all solutions as you actually had an answer in the link you shared: https://stackoverflow.com/a/57824142/8805842 if you inspect the last row/last column cell in your .csv file you will see why you get error.

Solution (simple copy/paste from your question link) with 2 more rows to remove unwanted/empty columns

    ### Loop the data lines
    with open("storm_data_search_results.csv", 'r') as temp_f:
        # get No of columns in each line
        col_count = [ len(l.split(",")) for l in temp_f.readlines() ]
    
    ### Generate column names  (names will be 0, 1, 2, ..., maximum columns - 1)
    column_names = [i for i in range(0, max(col_count))]
    
    ### Read csv
    df = pd.read_csv("storm_data_search_results.csv", header=None, delimiter=",", names=column_names)
    
    # my addition
    df.columns = df.iloc[0] # create headers from the first row
    df = df.iloc[:, 0:39] # keeping data frame with named headers only

Update OMG, be careful... the data they give in .csv actually is not structured properly.... just scroll all it down... if you can use any other source, - use it, Unless you do not need "comments" and you can drop them.

@Michael, what he means is - load the CSV into a text editor and look at it. It is not a well-formed CSV. The problem is in 1998 data. — jch, May 22 '22 at 01:15
I tried what @NoobVB did and I got the error: TypeError: expected str, bytes or os.PathLike object, not list — Michael, May 22 '22 at 05:18

score 0 · Answer 2 · answered May 21 '22 at 21:27

Assuming that the problem comes from the text fields that are multiline and can easily get messed up... ...you can remove them using RegEx: re.subn(r'(".*?")',"_______________",xx,xx.count('"'), re.DOTALL)

Also, assuming constant headers in all files, you can process all in text and then parse once.


# Read headers
headers = open(files[0]).read().split('\n',1)[0].split(',')

# Read all files and remove headers
xx = [open(ff).read().split('\n',1)[1] for ff in files]

# Remove the comments fields
dd = [re.sub(r'(".*?")',"__",x,x.count('"'), re.DOTALL) for x in xx]

# Load as CSV
df = pd.read_csv(StringIO(''.join(dd), names = headers)

How to read csv files with different amounts of columns using Python

2 Answers2