running df.shape on a 6.5G csv dataframe throws error

Question

How should I handle the following situation that something as simple as finding the shape of my csv dataframe throws an error?

import pandas as pd

df = pd.read_csv("tweets_withheader.csv")

print(df.shape)

Error is:

Traceback (most recent call last):
  File "explore.py", line 4, in <module>
    df = pd.read_csv("tweets_withheader.csv")
  File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 454, in _read
    data = parser.read(nrows)
  File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 1133, in read
    ret = self._engine.read(nrows)
  File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 2037, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 859, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 874, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 928, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 915, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2070, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 25 fields in line 20415302, saw 26

With a little change, I get this other error:

Traceback (most recent call last):
  File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 2891, in _next_iter_line
    return next(self.data)
_csv.Error: line contains NUL

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "explore.py", line 4, in <module>
    df = pd.read_csv("tweets_withheader.csv", engine="python")
  File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 454, in _read
    data = parser.read(nrows)
  File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 1133, in read
    ret = self._engine.read(nrows)
  File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 2431, in read
    content = self._get_lines(rows)
  File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 3181, in _get_lines
    new_row = self._next_iter_line(row_num=self.pos + rows + 1)
  File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 2914, in _next_iter_line
    self._alert_malformed(msg, row_num)
  File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 2872, in _alert_malformed
    raise ParserError(msg)
pandas.errors.ParserError: NULL byte detected. This byte cannot be processed in Python's native csv library at the moment, so please pass in engine='c' instead

So, changing the engine to c gave me the following error:

Traceback (most recent call last):
  File "explore.py", line 4, in <module>
    df = pd.read_csv("tweets_withheader.csv", engine="c")
  File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 454, in _read
    data = parser.read(nrows)
  File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 1133, in read
    ret = self._engine.read(nrows)
  File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 2037, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 859, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 874, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 928, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 915, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2070, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 25 fields in line 20415302, saw 26

I have changed it to the following and it's been running like for the past 20 minutes and not yet done for something as simple as df.shape. How can I acceletate this? I have 12 cores and 32G memory.

import pandas as pd

df = pd.read_csv("tweets_withheader.csv", engine="c", error_bad_lines=False)

print(df.shape)

the first 1 lines of the csv file

$ head -10 tweets_withheader.csv 
,coordinates,created_at,favorite_count,favorited,tweet_id,lang,quote_count,reply_count,retweet_count,retweeted,text,timestamp_ms,user_id,user_description,user_followers_count,user_favorite_count,user_following_count,user_friends_count,user_location,user_screenname,user_statuscount,user_profile_image,user_name,user_verified
0,,Tue Sep 10 19:48:55 +0000 2019,0,False,1171510884419588097,en,0,0,0,False,"Minister of Climate Change visits Dubai’s Waterfront Market
#wamnews
",1568144935122,2789527352,The Official Account for Emirates News Agency - WAM / English,27961,1,,2,UAE,,50437,http://pbs.twimg.com/profile_images/1079742896746782722/DSl4mVFS_normal.jpg,WAM News / English,True
1,,Tue Sep 10 19:48:55 +0000 2019,0,False,1171510884889321474,en,0,0,0,False,"RT @NASAClimate: While the Sun can influence Earth’s climate, the warming seen over the last few decades is too large to be caused by chang…",1568144935234,749609111390674944,,10,36,,40,,,13,http://pbs.twimg.com/profile_images/1170446717386416128/WgLEF4P4_normal.jpg,嘎呗叽,False
2,,Tue Sep 10 19:48:55 +0000 2019,0,False,1171510885094846465,en,0,0,0,False,"RT @pocockdavid: Saturday was #ThreatenedSpeciesDay - the anniversary of the death of the last known Thylacine.

Australia has one of the h…",1568144935283,2800740344,PhD student @SFU @E2ocean studying the invasion ecology of zebra mussels. #freshwatermussels (he/him)  ️‍,233,841,,815,xwməθkwəy̓əm territory,,1589,http://pbs.twimg.com/profile_images/1097371698851065856/mcFt5BFu_normal.png,Steven Brownlee,False
3,,Tue Sep 10 19:48:55 +0000 2019,0,False,1171510884797075456,en,0,0,0,False,"RT @CNN: This stadium has been transformed into a forest. The installation, inspired by a dystopian drawing from decades ago, is intended t…",1568144935212,793517851109883904,"I don't really like your Tweets 
Doctor of Veterinary Medicine

is the reading csv part taking time or calling `df.shape` ? [may be try reading them in chunks](https://stackoverflow.com/questions/25962114/how-do-i-read-a-large-csv-file-with-pandas) , though the trace-back shows there may be some data issues which pandas isnt able to parse — anky, Jun 22 '20 at 03:54
What I am trying to say is something as simple as df.shape is taking forever. I am not sure about your answer as I am running it with command prompt not something like Jupyter. Perhaps I could find the answer with timeit but I haven't done so — Mona Jalal, Jun 22 '20 at 03:56
the last piece of code is working now (not sure if I should go with it) but now it is been running for like around 40 min — Mona Jalal, Jun 22 '20 at 03:57
there's an issue with reading the data, make it ignore errors or do something with them — Derek Eden, Jun 22 '20 at 04:56

Christian Eslabon · Answer 1 · 2020-06-22T05:39:04.640

1

Try using dask.

import pandas as pd
import dask.dataframe as dd

df= dd.read_csv("tweets_withheader.csv", quoting=csv.QUOTE_NONE, header=None, lineterminator='\\n')
df = df.compute()
print(df.shape)

Source:

edited Jun 22 '20 at 05:39

answered Jun 22 '20 at 03:49

Christian Eslabon

685
4
8

I am using modin.pandas and am getting distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting error – Mona Jalal Jun 22 '20 at 03:53
I got this error using your code ParserError: Error tokenizing data. C error: EOF inside string starting at row 559 – Mona Jalal Jun 22 '20 at 03:55
Thanks Ian, with your new update I get this other error ParserError: Error tokenizing data. C error: Expected 25 fields in line 5, saw 26 complete log here: https://pastebin.com/raw/776SCY08 – Mona Jalal Jun 22 '20 at 04:03
setting the header to None still throws the same error plus I need to later on access columns based on their header column names ParserError: Error tokenizing data. C error: Expected 25 fields in line 5, saw 26 for line of code df= dd.read_csv("tweets_withheader.csv", quoting=csv.QUOTE_NONE, header=None) – Mona Jalal Jun 22 '20 at 04:13
yeah I already tried it because I had it also in my code and unfortunately no chance https://pastebin.com/raw/83u46uSE it also doesn't print the shape after showing all these warnings (I didn't copy everything) – Mona Jalal Jun 22 '20 at 04:27
please check the last part of my post. Seems the problem might have been caused due to newline char in tweets texts – Mona Jalal Jun 22 '20 at 04:32
With your last edit I get this error ParserError: Error tokenizing data. C error: Expected 25 fields in line 5, saw 26 – Mona Jalal Jun 22 '20 at 15:24
Can you show what line 4, 5 and line 18 of your csv looks like so we can compare? Please run the 3 codes below – Christian Eslabon Jun 22 '20 at 16:19
pd.read_csv("tweets_withheader.csv",skiprows = 3, nrows = 1) – Christian Eslabon Jun 22 '20 at 16:20
pd.read_csv("tweets_withheader.csv",skiprows = 4, nrows = 1) – Christian Eslabon Jun 22 '20 at 16:20
pd.read_csv("tweets_withheader.csv",skiprows = 17, nrows = 1) – Christian Eslabon Jun 22 '20 at 16:21
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/216444/discussion-between-mona-jalal-and-ian). – Mona Jalal Jun 22 '20 at 16:27

running df.shape on a 6.5G csv dataframe throws error

1 Answers1