pandas.io.common.CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file

Question

I have large csv files with size more than 10 mb each and about 50+ such files. These inputs have more than 25 columns and more than 50K rows.

All these have same headers and I am trying to merge them into one csv with headers to be mentioned only one time.

Option: One Code: Working for small sized csv -- 25+ columns but size of the file in kbs.

import pandas as pd
import glob

interesting_files = glob.glob("*.csv")
df_list = []
for filename in sorted(interesting_files):
    df_list.append(pd.read_csv(filename))

full_df = pd.concat(df_list)

full_df.to_csv('output.csv')

But the above code does not work for the larger files and gives the error.

Error:

Traceback (most recent call last):
  File "merge_large.py", line 6, in <module>
    all_files = glob.glob("*.csv", encoding='utf8', engine='python')     
TypeError: glob() got an unexpected keyword argument 'encoding'
lakshmi@lakshmi-HP-15-Notebook-PC:~/Desktop/Twitter_Lat_lon/nasik_rain/rain_2$ python merge_large.py 
Traceback (most recent call last):
  File "merge_large.py", line 10, in <module>
    df = pd.read_csv(file_,index_col=None, header=0)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 562, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 325, in _read
    return parser.read()
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 815, in read
    ret = self._engine.read(nrows)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 1314, in read
    data = self._reader.read(nrows)
  File "pandas/parser.pyx", line 805, in pandas.parser.TextReader.read (pandas/parser.c:8748)
  File "pandas/parser.pyx", line 827, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:9003)
  File "pandas/parser.pyx", line 881, in pandas.parser.TextReader._read_rows (pandas/parser.c:9731)
  File "pandas/parser.pyx", line 868, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:9602)
  File "pandas/parser.pyx", line 1865, in pandas.parser.raise_parser_error (pandas/parser.c:23325)
pandas.io.common.CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

Code: Columns 25+ but size of the file more than 10mb

Option: Two Option: Three

Option: Four

import pandas as pd
import glob

    interesting_files = glob.glob("*.csv")
    df_list = []
    for filename in sorted(interesting_files):
        df_list.append(pd.read_csv(filename))

    full_df = pd.concat(df_list)

    full_df.to_csv('output.csv')

Error:

Traceback (most recent call last):
  File "merge_large.py", line 6, in <module>
    allFiles = glob.glob("*.csv", sep=None)
TypeError: glob() got an unexpected keyword argument 'sep'

I have searched extensively but I am not able to find a solution to concatenate large csv files with same headers into one file.

Edit:

Code:

import dask.dataframe as dd  

ddf = dd.read_csv('*.csv')

ddf.to_csv('master.csv',index=False)

Error:

Traceback (most recent call last):
  File "merge_csv_dask.py", line 5, in <module>
    ddf.to_csv('master.csv',index=False)
  File "/usr/local/lib/python2.7/dist-packages/dask/dataframe/core.py", line 792, in to_csv
    return to_csv(self, filename, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/dask/dataframe/io.py", line 762, in to_csv
    compute(*values)
  File "/usr/local/lib/python2.7/dist-packages/dask/base.py", line 179, in compute
    results = get(dsk, keys, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/dask/threaded.py", line 58, in get
    **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/dask/async.py", line 481, in get_async
    raise(remote_exception(res, tb))
dask.async.ValueError: could not convert string to float: {u'type': u'Point', u'coordinates': [4.34279, 50.8443]}

Traceback
---------
  File "/usr/local/lib/python2.7/dist-packages/dask/async.py", line 263, in execute_task
    result = _execute_task(task, data)
  File "/usr/local/lib/python2.7/dist-packages/dask/async.py", line 245, in _execute_task
    return func(*args2)
  File "/usr/local/lib/python2.7/dist-packages/dask/dataframe/csv.py", line 49, in bytes_read_csv
    coerce_dtypes(df, dtypes)
  File "/usr/local/lib/python2.7/dist-packages/dask/dataframe/csv.py", line 73, in coerce_dtypes
    df[c] = df[c].astype(dtypes[c])
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 2950, in astype
    raise_on_error=raise_on_error, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 2938, in astype
    return self.apply('astype', dtype=dtype, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 2890, in apply
    applied = getattr(b, f)(**kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 434, in astype
    values=values, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 477, in _astype
    values = com._astype_nansafe(values.ravel(), dtype, copy=True)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/common.py", line 1920, in _astype_nansafe
    return arr.astype(dtype

)

Linwoodc3 · Accepted Answer · 2016-08-26T15:31:32.537

If I understand your problem, you have large csv files with the same structure that you want to merge into one big CSV file.

My suggestion is to use dask from Continuum Analytics to handle this job. You can merge your files but also perform out-of-core computations and analysis of the data just like pandas.

### make sure you include the [complete] tag
pip install dask[complete]

Solution Using Your Sample Data from DropBox

First, check versions of dask. For me, dask = 0.11.0 and pandas = 0.18.1

import dask
import pandas as pd
print (dask.__version__)
print (pd.__version__)

Here's the code to read in ALL your csvs. I had no errors using your DropBox example data.

import dask.dataframe as dd
from dask.delayed import delayed
import dask.bag as db
import glob

filenames = glob.glob('/Users/linwood/Downloads/stack_bundle/rio*.csv')

'''
The key to getting around the CParse error was using sep=None
Came from this post
http://stackoverflow.com/questions/37505577/cparsererror-error-tokenizing-data
'''

# custom saver function for dataframes using newfilenames
def reader(filename):
    return pd.read_csv(filename,sep=None)

# build list of delayed pandas csv reads; then read in as dask dataframe

dfs = [delayed(reader)(fn) for fn in filenames]
df = dd.from_delayed(dfs)


'''
This is the final step.  The .compute() code below turns the 
dask dataframe into a single pandas dataframe with all your
files merged. If you don't need to write the merged file to
disk, I'd skip this step and do all the analysis in 
dask. Get a subset of the data you want and save that.  
'''
df = df.reset_index().compute()
df.to_csv('./test.csv')

The rest of this is extra stuff

# print the count of values in each column; perfect data would have the same count
# you have dirty data as the counts will show

print (df.count().compute())

The next step is doing some pandas-like analysis. Here is some code of me first "cleaning" your data for the 'tweetFavoriteCt' column. All of the data is not an integer, so I replace strings with "0" and convert everything else to an integer. Once I get the integer conversion, I show a simple analytic where I filter the entire dataframe to only include the rows where the favoriteCt is greater than 3

# function to convert numbers to integer and replace string with 0; sample analytics in dask dataframe
# you can come up with your own..this is just for an example
def conversion(value):
    try:
        return int(value)
    except:
        return int(0)

# apply the function to the column, create a new column of cleaned data
clean = df['tweetFavoriteCt'].apply(lambda x: (conversion(x)),meta=('stuff',str))

# set new column equal to our cleaning code above; your data is dirty :-(
df['cleanedFavoriteCt'] = clean

Last bit of code shows dask analysis and how to load this merged file into pandas and also write the merged file to disk. Be warned, if you have tons of CSVs, when you use the .compute() code below, it will load this merged csv into memory.

# retreive the 50 tweets with the highest favorite count 
print(df.nlargest(50,['cleanedFavoriteCt']).compute())

# only show me the tweets that have been favorited at least 3 times
# TweetID 763525237166268416, is VERRRRY popular....7000+ favorites
print((df[df.cleanedFavoriteCt.apply(lambda x: x>3,meta=('stuff',str))]).compute())

'''
This is the final step.  The .compute() code below turns the 
dask dataframe into a single pandas dataframe with all your
files merged. If you don't need to write the merged file to
disk, I'd skip this step and do all the analysis in 
dask. Get a subset of the data you want and save that.  
'''
df = df.reset_index().compute()
df.to_csv('./test.csv')

Now, if you want to switch to pandas for the merged csv file:

import pandas as pd
dff = pd.read_csv('./test.csv')

Let me know if this works.

Stop here

ARCHIVE: Previous solution; good to example of using dask to merge CSVs

The first step is making sure you have dask installed. There are install instructions for dask in the documentation page but this should work:

With dask installed it's easy to read in the files.

Some housekeeping first. Assume we have a directory with csvs where the filenames are my18.csv, my19.csv, my20.csv, etc. Name standardization and single directory location are key. This works if you put your csv files in one directory and serialize the names in some way.

In steps:

Import dask, read all the csv files in using wildcard. This merges all csvs into one single dask.dataframe object. You can do pandas-like operation immediately after this step if you want.

import dask.dataframe as dd  
ddf = dd.read_csv('./daskTest/my*.csv')
ddf.describe().compute()

Write merged dataframe file to disk in the same directory as original files and name it master.csv

ddf.to_csv('./daskTest/master.csv',index=False)

Optional, read master.csv, a much bigger in size, into dask.dataframe object for computations. This can also be done after step one above; dask can perform pandas like operations on the staged files...this is a way to do "big data" in Python

# reads in the merged file as one BIG out-of-core dataframe; can perform functions like pangas    
newddf = dd.read_csv('./daskTest/master.csv')

#check the length; this is now length of all merged files. in this example, 50,000 rows times 11 = 550000 rows.
len(newddf)

# perform pandas-like summary stats on entire dataframe
newddf.describe().compute()

Hopefully this helps answer your question. In three steps, you read in all the files, merge to single dataframe, and write that massive dataframe to disk with only one header and all your rows.

Thank you so much for the detail explanation. This is a lot informative. Let me try the code and will let you know if I have any doubts. Thanks again :) — Sitz Blogz, Aug 04 '16 at 14:11
Apologies for late reply. I am getting error and I have included the code and error in the edit section. Please could you check. — Sitz Blogz, Aug 18 '16 at 19:39
Do you have some sample data; I see you have a geojson point, but do you have any sample data? — Linwoodc3, Aug 19 '16 at 23:36
Please check the link, for the one of the datasets and yes they have one column with json geo coordinates https://www.dropbox.com/s/xggx8xy9gm9ujpp/rio2016_18_8.csv?dl=0 — Sitz Blogz, Aug 20 '16 at 08:42
Will look; a few minutes ago, I figured out how to use dask to clean merge and clean multiple JSON files from the Twitter API..so that's good news for figuring out your solution. Will look at your data since you used csv instead of json. In the meantime, here's my pipeline for json and merging: http://stackoverflow.com/questions/38760864/how-do-you-transpose-a-dask-dataframe-convert-columns-to-rows-to-approach-tidy — Linwoodc3, Aug 20 '16 at 15:14
@SitzBlogz, take a look at the new edited solution above. See if you can at least perform pandas like operations. — Linwoodc3, Aug 20 '16 at 21:42
I am still having same kind of error .. I am posting the entire bundle of code, error and couple of input files. The input files range from few kbs to mbs so few of them I have put in dropbox link https://www.dropbox.com/s/gxspyzvfq2nzgv0/stack_bundle.zip?dl=0 — Sitz Blogz, Aug 21 '16 at 08:16
Solution should work now; I used the example data from your DropBox and had no problems reading in all the csv files and, working with it in dask, and writing a master merged csv back to disk. Let me know if this worked. — Linwoodc3, Aug 24 '16 at 01:42

pandas.io.common.CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file

1 Answers1

Solution Using Your Sample Data from DropBox

The rest of this is extra stuff

ARCHIVE: Previous solution; good to example of using dask to merge CSVs

Linked