1
import glob
files = glob.glob("Data/*.csv")
df = pd.concat((pd.read_csv(f) for f in files))
print(df)

I get an error that says: "ParserError: Error tokenizing data. C error: Expected 39 fields in line 273, saw 40". Then as per this question: import csv with different number of columns per row using Pandas , I tried passing in the names of the columns, using StringIO and BytesIO, then I got errors like: "TypeError: initial_value must be str or None, not list" or "TypeError: a bytes-like object is required, not 'list'". I am looking at over 20 csv files.

Michael
  • 63
  • 9
  • 1
    have you tried this answer in the link you mentioned? https://stackoverflow.com/a/52890095/8805842 it should work – NoobVB May 21 '22 at 17:53
  • It expects a string, not a list – Michael May 21 '22 at 18:01
  • 1
    I guess, sharing the example file is the only solution... chop it a bit to smaller size (or remove any private data and hope pandaNinjas will be interested – NoobVB May 21 '22 at 18:05
  • @NoobVB, here's the example data https://www.ncdc.noaa.gov/stormevents/listevents.jsp?eventType=%28C%29+Tornado&beginDate_mm=12&beginDate_dd=01&beginDate_yyyy=1999&endDate_mm=12&endDate_dd=31&endDate_yyyy=1999&hailfilter=0.00&tornfilter=0&windfilter=000&sort=DT&submitbutton=Search&statefips=-999%2CALL , I am downloading csv's from 1990 to 1999 – Michael May 21 '22 at 18:16
  • I have managed to find year 1999 only... – NoobVB May 21 '22 at 18:28
  • Can you narrow down to **which particular file/date it fails?** It is a particular thing from _that_ file. Maybe you can upload that one. Year 1999 imports fine for me. – Zaero Divide May 21 '22 at 20:30
  • Year 1999 by itself works fine for me too, but i have downloaded 10 years of data. For example, year 1990 has different columns. I cant tell exactly which year it fails on because I have many csv files. – Michael May 21 '22 at 21:38

2 Answers2

1

it looks like you have not tried all solutions as you actually had an answer in the link you shared: https://stackoverflow.com/a/57824142/8805842 if you inspect the last row/last column cell in your .csv file you will see why you get error.

Solution (simple copy/paste from your question link) with 2 more rows to remove unwanted/empty columns

    ### Loop the data lines
    with open("storm_data_search_results.csv", 'r') as temp_f:
        # get No of columns in each line
        col_count = [ len(l.split(",")) for l in temp_f.readlines() ]
    
    ### Generate column names  (names will be 0, 1, 2, ..., maximum columns - 1)
    column_names = [i for i in range(0, max(col_count))]
    
    ### Read csv
    df = pd.read_csv("storm_data_search_results.csv", header=None, delimiter=",", names=column_names)
    
    # my addition
    df.columns = df.iloc[0] # create headers from the first row
    df = df.iloc[:, 0:39] # keeping data frame with named headers only

Update OMG, be careful... the data they give in .csv actually is not structured properly.... just scroll all it down... if you can use any other source, - use it, Unless you do not need "comments" and you can drop them.

NoobVB
  • 989
  • 6
  • 10
0

Assuming that the problem comes from the text fields that are multiline and can easily get messed up... ...you can remove them using RegEx: re.subn(r'(".*?")',"_______________",xx,xx.count('"'), re.DOTALL)

Also, assuming constant headers in all files, you can process all in text and then parse once.


# Read headers
headers = open(files[0]).read().split('\n',1)[0].split(',')

# Read all files and remove headers
xx = [open(ff).read().split('\n',1)[1] for ff in files]

# Remove the comments fields
dd = [re.sub(r'(".*?")',"__",x,x.count('"'), re.DOTALL) for x in xx]

# Load as CSV
df = pd.read_csv(StringIO(''.join(dd), names = headers)
Zaero Divide
  • 699
  • 2
  • 10