I have large csv files with size more than 10 mb each and about 50+ such files. These inputs have more than 25 columns and more than 50K rows.
All these have same headers and I am trying to merge them into one csv with headers to be mentioned only one time.
Option: One Code: Working for small sized csv -- 25+ columns but size of the file in kbs.
import pandas as pd
import glob
interesting_files = glob.glob("*.csv")
df_list = []
for filename in sorted(interesting_files):
df_list.append(pd.read_csv(filename))
full_df = pd.concat(df_list)
full_df.to_csv('output.csv')
But the above code does not work for the larger files and gives the error.
Error:
Traceback (most recent call last):
File "merge_large.py", line 6, in <module>
all_files = glob.glob("*.csv", encoding='utf8', engine='python')
TypeError: glob() got an unexpected keyword argument 'encoding'
lakshmi@lakshmi-HP-15-Notebook-PC:~/Desktop/Twitter_Lat_lon/nasik_rain/rain_2$ python merge_large.py
Traceback (most recent call last):
File "merge_large.py", line 10, in <module>
df = pd.read_csv(file_,index_col=None, header=0)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 562, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 325, in _read
return parser.read()
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 815, in read
ret = self._engine.read(nrows)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 1314, in read
data = self._reader.read(nrows)
File "pandas/parser.pyx", line 805, in pandas.parser.TextReader.read (pandas/parser.c:8748)
File "pandas/parser.pyx", line 827, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:9003)
File "pandas/parser.pyx", line 881, in pandas.parser.TextReader._read_rows (pandas/parser.c:9731)
File "pandas/parser.pyx", line 868, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:9602)
File "pandas/parser.pyx", line 1865, in pandas.parser.raise_parser_error (pandas/parser.c:23325)
pandas.io.common.CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.
Code: Columns 25+ but size of the file more than 10mb
Option: Four
import pandas as pd
import glob
interesting_files = glob.glob("*.csv")
df_list = []
for filename in sorted(interesting_files):
df_list.append(pd.read_csv(filename))
full_df = pd.concat(df_list)
full_df.to_csv('output.csv')
Error:
Traceback (most recent call last):
File "merge_large.py", line 6, in <module>
allFiles = glob.glob("*.csv", sep=None)
TypeError: glob() got an unexpected keyword argument 'sep'
I have searched extensively but I am not able to find a solution to concatenate large csv files with same headers into one file.
Edit:
Code:
import dask.dataframe as dd
ddf = dd.read_csv('*.csv')
ddf.to_csv('master.csv',index=False)
Error:
Traceback (most recent call last):
File "merge_csv_dask.py", line 5, in <module>
ddf.to_csv('master.csv',index=False)
File "/usr/local/lib/python2.7/dist-packages/dask/dataframe/core.py", line 792, in to_csv
return to_csv(self, filename, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/dask/dataframe/io.py", line 762, in to_csv
compute(*values)
File "/usr/local/lib/python2.7/dist-packages/dask/base.py", line 179, in compute
results = get(dsk, keys, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/dask/threaded.py", line 58, in get
**kwargs)
File "/usr/local/lib/python2.7/dist-packages/dask/async.py", line 481, in get_async
raise(remote_exception(res, tb))
dask.async.ValueError: could not convert string to float: {u'type': u'Point', u'coordinates': [4.34279, 50.8443]}
Traceback
---------
File "/usr/local/lib/python2.7/dist-packages/dask/async.py", line 263, in execute_task
result = _execute_task(task, data)
File "/usr/local/lib/python2.7/dist-packages/dask/async.py", line 245, in _execute_task
return func(*args2)
File "/usr/local/lib/python2.7/dist-packages/dask/dataframe/csv.py", line 49, in bytes_read_csv
coerce_dtypes(df, dtypes)
File "/usr/local/lib/python2.7/dist-packages/dask/dataframe/csv.py", line 73, in coerce_dtypes
df[c] = df[c].astype(dtypes[c])
File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 2950, in astype
raise_on_error=raise_on_error, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 2938, in astype
return self.apply('astype', dtype=dtype, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 2890, in apply
applied = getattr(b, f)(**kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 434, in astype
values=values, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 477, in _astype
values = com._astype_nansafe(values.ravel(), dtype, copy=True)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/common.py", line 1920, in _astype_nansafe
return arr.astype(dtype
)