How can I concatenate 3 large tweet dataframe (csv) files each having approximately 5M tweets?

Question

I have three csv dataframes of tweets, each ~5M tweets. The following code for concatenating them exists with low memory error. My machine has 32GB memory. How can I assign more memory for this task in pandas?

df1 = pd.read_csv('tweets.csv')
df2 = pd.read_csv('tweets2.csv')
df3 = pd.read_csv('tweets3.csv')

frames = [df1, df2, df3]
result = pd.concat(frames)

result.to_csv('tweets_combined.csv')

The error is:

$ python concantenate_dataframes.py 
sys:1: DtypeWarning: Columns (0,1,2,3,4,5,6,8,9,10,11,12,13,14,19,22,23,24) have mixed types.Specify dtype option on import or set low_memory=False.
Traceback (most recent call last):
  File "concantenate_dataframes.py", line 19, in <module>
    df2 = pd.read_csv('tweets2.csv')
  File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 454, in _read
    data = parser.read(nrows)
  File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 1133, in read
    ret = self._engine.read(nrows)
  File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 2037, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 859, in pandas._libs.parsers.TextReader.read

UPDATE: tried the suggestions in the answer and still get error

$ python concantenate_dataframes.py 
Traceback (most recent call last):
  File "concantenate_dataframes.py", line 18, in <module>
    df1 = pd.read_csv('tweets.csv', low_memory=False, error_bad_lines=False)
  File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 454, in _read
    data = parser.read(nrows)
  File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 1133, in read
    ret = self._engine.read(nrows)
  File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 2037, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 862, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 943, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 2070, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

      File "pandas/_libs/parsers.pyx", line 874, in pandas._libs.parsers.TextReader._read_low_memory
      File "pandas/_libs/parsers.pyx", line 928, in pandas._libs.parsers.TextReader._read_rows
      File "pandas/_libs/parsers.pyx", line 915, in pandas._libs.parsers.TextReader._tokenize_rows
      File "pandas/_libs/parsers.pyx", line 2070, in pandas._libs.parsers.raise_parser_error
    pandas.errors.ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

I am running the code on Ubuntu 20.04 OS

so I have 32GB memory I expect to be able to assign more memories to pandas not sure how. I'll check now — Mona Jalal, Jun 19 '20 at 05:10
Have you tried to do as stated in the error message? I.e., have you used the `low_memory=False` option? From the error message I get the impression that it is rather the csv file that is strange than a "real" out-of-memory error. — JohanL, Jun 19 '20 at 05:11
Use dask to process the CSV's, then convert them to a dataframe. https://stackoverflow.com/questions/38757713/pandas-io-common-cparsererror-error-tokenizing-data-c-error-buffer-overflow-c — Info5ek, Jun 19 '20 at 05:14
looks like you are appending rows. You can use simple file write to combine all files into one file — deadshot, Jun 19 '20 at 05:15
So, with the changed command line you ran into problem already with the first file being loaded? It is very hard to say what the issue is, without the actual csv files. Perhaps you can identify the issue by cutting away parts of the files in a text editor (i.e. some kind of interval halving) until you find your issue(s)? — JohanL, Jun 19 '20 at 05:36

score 0 · Answer 1 · answered Jun 19 '20 at 05:22

I think this is problem with malformed data (some data not structure properly in tweets2.csv) for that you can use error_bad_lines=False and try to chnage engine from c to python like engine='python' ex : df2 = pd.read_csv('tweets2.csv', error_bad_lines=False)

or ex : df2 = pd.read_csv('tweets2.csv', engine='python')

or maybe ex : df2 = pd.read_csv('tweets2.csv', engine='python', error_bad_lines=False)

but I recommand to identify those revord and repair that.

And also if you want hacky way to do this than use

1) https://askubuntu.com/questions/941480/how-to-merge-multiple-files-of-the-same-format-into-a-single-file

2) https://askubuntu.com/questions/656039/concatenate-multiple-files-without-header enter link description here

score -1 · Answer 2 · edited Jun 19 '20 at 05:21

-1

Specify dtype option on import or set low_memory=False

edited Jun 19 '20 at 05:21

deadshot

8,881
4
20
39

answered Jun 19 '20 at 05:20

Araujojuan

1

How can I concatenate 3 large tweet dataframe (csv) files each having approximately 5M tweets?

2 Answers2