pd.read_csv fails after converting the timezone

Question

So after I converted the UTC timezone in the Time column of my dataframe and saved it to a new csv file, I decided to draw a time plot of frequency of tweets. My time plot was initially working when timezone was UTC but after being converted to Eastern, it gives me the error below. How should I fix it?

import pandas as pd
import matplotlib.pyplot as plt


time_interval = pd.offsets.Second(10)

fig, ax = plt.subplots(figsize=(6, 3.5))

ax = (
    pd.read_csv('converted_timezone_tweets.csv', parse_dates=['Time'])
          .resample(time_interval, on='Time')['ID']
          .count()
          .plot.line(ax=ax)
)

plt.show()

And the error is:

/scratch/sjn/anaconda/bin/python /scratch2/debate_tweets/temporal_analysis.py
Traceback (most recent call last):
  File "/scratch2/debate_tweets/temporal_analysis.py", line 18, in <module>
    pd.read_csv('converted_timezone_tweets.csv', parse_dates=['Time'])
  File "/scratch/sjn/anaconda/lib/python3.6/site-packages/pandas/io/parsers.py", line 655, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/scratch/sjn/anaconda/lib/python3.6/site-packages/pandas/io/parsers.py", line 411, in _read
    data = parser.read(nrows)
  File "/scratch/sjn/anaconda/lib/python3.6/site-packages/pandas/io/parsers.py", line 1005, in read
    ret = self._engine.read(nrows)
  File "/scratch/sjn/anaconda/lib/python3.6/site-packages/pandas/io/parsers.py", line 1748, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 890, in pandas._libs.parsers.TextReader.read (pandas/_libs/parsers.c:10862)
  File "pandas/_libs/parsers.pyx", line 912, in pandas._libs.parsers.TextReader._read_low_memory (pandas/_libs/parsers.c:11138)
  File "pandas/_libs/parsers.pyx", line 966, in pandas._libs.parsers.TextReader._read_rows (pandas/_libs/parsers.c:11884)
  File "pandas/_libs/parsers.pyx", line 953, in pandas._libs.parsers.TextReader._tokenize_rows (pandas/_libs/parsers.c:11755)
  File "pandas/_libs/parsers.pyx", line 2184, in pandas._libs.parsers.raise_parser_error (pandas/_libs/parsers.c:28765)
pandas.errors.ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.


Process finished with exit code 1

converted_timezone_tweets.csv look like this:

,Candidate,ID,Time,Username,Tweet
0,Clinton,788948653016842240,2016-10-19 23:43:11-04:00,Tamayo_castle,Hillary Clinton dresses as Christian Bale at the debate via /r/pics 
1,Clinton,788948666501464064,2016-10-19 23:43:14-04:00,ThinkCenter1968,"It's like I told my kids, a reason U don't want 2 vote 4 Hillary is U want the inheritance I'm leaving U, Right? They changed their minds!"
2,Clinton,788948673594097664,2016-10-19 23:43:16-04:00,21stCenRevolt,When hearing about Saudi Arabia murdering people for being gay. Hillary laughed with glee. She disgusting and disgraceful. #debatenight
3,Both,788948662881751040,2016-10-19 23:43:13-04:00,mikeywan,MEGYN IS A PAID HILLARY WHORE #TrumpPence2016 #TrumpTrain 
4,Both,788948675313696769,2016-10-19 23:43:16-04:00,erwoti,Can't wait to hear @realDonaldTrump call that Nasty Woman (Hillary Clinton) - Madam President  #debatenight #ChrisWallace
5,Clinton,788948671756955650,2016-10-19 23:43:15-04:00,isaac_urner,"The Clinton campaign already has  redirecting to their site. That's what a real campaign looks like.
#badhombres2016"

Same code works for valid_tweets.csv and creates a plot like below: valid_tweets.csv lines look like:

Candidate,ID,Time,Username,Tweet
Clinton,788948653016842240,2016-10-20 03:43:11+00:00,Tamayo_castle,Hillary Clinton dresses as Christian Bale at the debate via /r/pics
Clinton,788948666501464064,2016-10-20 03:43:14+00:00,ThinkCenter1968,"It's like I told my kids, a reason U don't want 2 vote 4 Hillary is U want the inheritance I'm leaving U, Right? They changed their minds!"
Clinton,788948673594097664,2016-10-20 03:43:16+00:00,21stCenRevolt,When hearing about Saudi Arabia murdering people for being gay. Hillary laughed with glee. She disgusting and disgraceful. #debatenight
Both,788948662881751040,2016-10-20 03:43:13+00:00,mikeywan,MEGYN IS A PAID HILLARY WHORE #TrumpPence2016 #TrumpTrain 
Both,788948675313696769,2016-10-20 03:43:16+00:00,erwoti,Can't wait to hear @realDonaldTrump call that Nasty Woman (Hillary Clinton) - Madam President  #debatenight #ChrisWallace
Clinton,788948671756955650,2016-10-20 03:43:15+00:00,isaac_urner,"The Clinton campaign already has redirecting to their site. That's what a real campaign looks like.
#badhombres2016"

Update: in my first file I have:

import pandas as pd
import matplotlib.pyplot as plt

#2016-10-20 03:43:11+00:00
tweets_df = pd.read_csv('valid_tweets.csv')

tweets_df['Time'] = pd.Index(pd.to_datetime(tweets_df['Time'], utc=True)).tz_localize('UTC').tz_convert('US/Eastern')

tweets_df.to_csv('converted_timezone_tweets.csv', index=False)

In my second file I have:

import pandas as pd
import matplotlib.pyplot as plt


time_interval = pd.offsets.Second(10)

fig, ax = plt.subplots(figsize=(6, 3.5))

ax = (
    pd.read_csv('converted_timezone_tweets.csv', engine='python', parse_dates=['Time'])
          .resample(time_interval, on='Time')['ID']
          .count()
          .plot.line(ax=ax)
)

plt.show()

After using the engine='python' as in one of the answers, I get this error:

/scratch/sjn/anaconda/bin/python /scratch2/debate_tweets/temporal_analysis.py
Traceback (most recent call last):
  File "/scratch2/debate_tweets/temporal_analysis.py", line 11, in <module>
    .resample(time_interval, on='Time')['ID']
  File "/scratch/sjn/anaconda/lib/python3.6/site-packages/pandas/core/generic.py", line 4729, in resample
    base=base, key=on, level=level)
  File "/scratch/sjn/anaconda/lib/python3.6/site-packages/pandas/core/resample.py", line 969, in resample
    return tg._get_resampler(obj, kind=kind)
  File "/scratch/sjn/anaconda/lib/python3.6/site-packages/pandas/core/resample.py", line 1091, in _get_resampler
    "but got an instance of %r" % type(ax).__name__)
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Index'

Process finished with exit code 1

I did a vimdiff of the first 5 lines of each csv and this is what I get:

Let `df = pd.read_csv('converted_timezone_tweets.csv')`. Then assign `mask = pd.isnull(pd.to_datetime(df['Time'], errors='coerce'))`. What does `df.loc[mask, 'Time']` return? This should show us if there are any malformed date strings in the Time field of the CSV. — unutbu, Jan 23 '18 at 02:52
@unutbu running https://pastebin.com/WVvfxFYe I get https://pastebin.com/R7n6EX43 and running https://pastebin.com/Qq1Qs8rQ gives me https://paste.linux.community/view/b1e2a6ff please suggest possible fixes — Mona Jalal, Jan 24 '18 at 17:42

score 1 · Answer 1 · answered Jan 22 '18 at 20:27

It seems like the error is with using the C engine to parse the csv. I'm not knowledgeable enough to know why that might be, but a possible workaround to to force the df.read_csv() bit to use the python engine by passing the engine = 'python' argument. As per the Pandas documentation, pd.read_csv() defaults to using the C engine for speed. Given that your error is hinting at a problem with the C engine, that might be a good place to start. so, try pd.read_csv('converted_timezone_tweets.csv', parse_dates=['Time'], engine = 'python') There was also something on GitHub hinting towards similar problems and fixes

your answer is partially correct because I was able to read and print the df however in this specific case I get this new error when I use engine='python' when using pd.read_csv method https://paste.linux.community/view/708d55f5 — Mona Jalal, Jan 22 '18 at 20:34

unutbu · Answer 2 · 2018-01-25T14:44:17.173

Per the comment, this code

df1 = pd.read_csv('converted_timezone_tweets.csv', engine='python')
mask = pd.isnull(pd.to_datetime(df1['Time'], errors='coerce'))
print(df1.loc[mask, 'Time'])

prints

9941      None
27457     None
27458     None
...

this implies there are a number of entries in converted_timezone_tweets.csv whose Time field is the string 'None'.

You might want to go back and investigate what these values were in your original CSV:

df1 = pd.read_csv('converted_timezone_tweets.csv', engine='python')
mask = pd.isnull(pd.to_datetime(df1['Time'], errors='coerce'))
tweets_df = pd.read_csv('valid_tweets.csv')
print(tweets_df.loc[mask, 'Time'])

If there is no Time data for these tweets perhaps the most sensible thing to do is throw them away since we can't classify what time interval they belong to. You could use df1 = df1.loc[mask, :] to remove the offending rows:

import pandas as pd
import matplotlib.pyplot as plt

df1 = pd.read_csv('converted_timezone_tweets.csv', engine='python')
df1['Time'] = pd.to_datetime(df1['Time'], errors='coerce')
mask = pd.notnull(df1['Time'])
df1 = df1.loc[mask, :]
df1 = df1.set_index('Time')
counts = df1.resample('10S')['ID'].count()

fig, ax = plt.subplots(figsize=(6, 3.5))
counts.plot.line(ax=ax)
plt.show()

To avoid parsing errors, we call pd.read_csv (above) without setting the parse_dates parameter. So pd.read_csv returns a DataFrame whose Time column contains date strings:

df1 = pd.read_csv('converted_timezone_tweets.csv', engine='python')
#    ID                       Time
# 0   5  2016-10-19 23:43:00-04:00
# 1   5  2016-10-19 23:43:05-04:00
# 2   5  2016-10-19 23:43:10-04:00
# 3   5  2016-10-19 23:43:15-04:00
# ...

We then use pd.to_datetime to parse the date strings into datetimes. pd.to_datetime parses the date strings by converting them to UTC while taking timezone offsets into account. The resulting datetimes are naive -- no timezone information is attached. This behavior is derived from the underlying NumPy datetime64[ns] data type used by Pandas to represent datetimes.

Therefore, to make the datetimes once again timezone-aware, you would need to call tz_localize/tz_convert again:

df1['Time'] = pd.Index(df1['Time']).tz_localize('UTC').tz_convert('US/Eastern')

But this also shows there was nothing gained by calling tz_convert the first time and storing the result in converted_timezone_tweets.csv the first time.

So a better solution (which does not require calling tz_convert after loading converted_timezone_tweets.csv) is to write converted_timezone_tweets.csv without the timezone offset. You can do that by dropping the timezone offset by calling tz_localize(None):

df1['Time'] = pd.Index(pd.to_datetime(df1['Time'], utc=True)).tz_localize('UTC').tz_convert('US/Eastern').tz_localize(None)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

N = 10
df = pd.DataFrame({'Time':pd.date_range('2016-10-20 03:43:00', periods=N, freq='5S'), 'ID':np.random.randint(N)})
df1 = df.copy()

df1['Time'] = pd.Index(pd.to_datetime(df1['Time'], utc=True)).tz_localize('UTC').tz_convert('US/Eastern').tz_localize(None)
df1.to_csv('converted_timezone_tweets.csv', index=False)

df1 = pd.read_csv('converted_timezone_tweets.csv', engine='python')
df1['Time'] = pd.to_datetime(df1['Time'], errors='coerce')
mask = pd.notnull(df1['Time'])
df1 = df1.loc[mask, :]

df = df.set_index('Time')
df1 = df1.set_index('Time')
counts1 = df1.resample('10S')['ID'].count()
counts = df.resample('10S')['ID'].count()

fig, ax = plt.subplots(figsize=(6, 3.5), nrows=2)
counts.plot.line(ax=ax[0])
counts1.plot.line(ax=ax[1])
plt.show()

Note that it might be more appealing to store all time-related data in UTC rather than with respect to some other local timezone. That way, if you have many CSV files you do not have to keep track of which timezone the time data is relative to. From this point of view, it would be preferrable to keep valid_tweets.csv, drop converted_timezone_tweets.csv, and do the conversion to US/Eastern only when necessary:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('valid_tweets.csv')
df['Time'] = pd.to_datetime(df['Time'], errors='coerce')
mask = pd.notnull(df['Time'])
df = df.loc[mask, :]
df['Time'] = pd.Index(df['Time']).tz_localize('UTC').tz_convert('US/Eastern')

df = df.set_index('Time')
counts = df.resample('10S')['ID'].count()

fig, ax = plt.subplots(figsize=(6, 3.5))
counts.plot.line(ax=ax)
plt.show()

thanks a lot for looking into this. So, the code works except it doesn't use the converted times as in converted_timezone_tweets.csv and is using the same time as in valid_tweets.csv. Do you know why is that? https://imgur.com/a/mHlni — Mona Jalal, Jan 25 '18 at 00:07
like the Time items are all like 2016-10-19 21:58:13-04:00, and are 20, 21, 22, 23 ish not its UTC equivalent — Mona Jalal, Jan 25 '18 at 00:09
The problem arises because the `converted_timezone_tweets.csv` file saves the timezone offset. When the date strings are parsed, the date strings are converted to datetimes in UTC but with the timezone offset taken into account. If you want to load a CSV without having to convert from UTC to US/Eastern, then I think you would need to drop the timezone offsets when saving converted_timezone_tweets.csv. I've modified the post above to show what I mean. — unutbu, Jan 25 '18 at 14:25
so the last you have doesn't seem to work correctly in the sense of the time range it is showing on the time axis please see https://imgur.com/a/ElsPu it should show something like 20-23 or so not 19 and over 2-3 minutes — Mona Jalal, Jan 25 '18 at 20:07
You didn't explain how you obtained that result. It looks like the times have been converted twice so the tweets in hour 3 have been converted to hour 23, then hour 19. If you start with a CSV with times that start around `2016-10-20 03:43:11+00:00` then the last code block I posted will produce a DataFrame (and plot) with times that start around `2016-10-19 23:43:11-04:00`. (Make sure you start with `valid_tweets.csv`, not `converted_timezone_tweets.csv`.) — unutbu, Jan 25 '18 at 20:37
can you please have a look at this question (somewhat related) the same original date I made this one out of https://askubuntu.com/questions/1022089/searching-for-created-at-followed-in-the-next-line-by-retweeted-in-lots-of — Mona Jalal, Apr 05 '18 at 01:07

pd.read_csv fails after converting the timezone

2 Answers2