Problem: df.assign
of Pandas
creates NaN
values when it's not supposed to when working with dates, and I want to know why.
I have a pandas dataframe of tweets. Here are the columns:
id_str object
coordinates object
created_at object ***
text object
user object
favorite_count int64
retweet_count int64
username object
clean_text object
The time of the tweet is in the column created_at
. Even though it's an object
, it's formatted properly, like Mon Oct 16 23:58:55 +0000 2017
.
Now, I want to make a new column created_date
that will only contain the month, day, year. This ended up being my solution, inspired by this question and this question.
df = df.assign(created_date = pd.Series([x.date() for x in pd.to_datetime(df['created_at'])]))
# Bonus question: how effective is this code in runtime or memory? If not, why?
The pd.Series
constructor is there as I thought it was necessary, but I'm leaving it here in case that is what's causing the problems.
I later found out that 2764 out of the 29984 tweets had NaN
values instead of the correct output (e.g. 2017-10-16
of dtype 'O'
), and I thought that the Series
constructor was to blame, but apparently not; the problem was actually in the df.assign
.
I used the Counter
from collections
for checking.
I first checked if the Series
I created was correct.
# pd.Series([x.date() for x in pd.to_datetime(df['created_at'])])
Counter({datetime.date(2017, 10, 9): 8165,
datetime.date(2017, 10, 10): 5898,
datetime.date(2017, 10, 11): 3104,
datetime.date(2017, 10, 12): 2067,
datetime.date(2017, 10, 13): 1647,
datetime.date(2017, 10, 14): 2750,
datetime.date(2017, 10, 15): 2778,
datetime.date(2017, 10, 16): 3575})
Nothing wrong there, the total number of tweets is correct, too. So, I rechecked the newly assigned column, and there's the problem.
# Counter(df['created_date'])
Counter({datetime.date(2017, 10, 16): 3240,
datetime.date(2017, 10, 15): 2413,
datetime.date(2017, 10, 14): 2431,
datetime.date(2017, 10, 13): 1369,
datetime.date(2017, 10, 12): 1680,
datetime.date(2017, 10, 11): 2736,
datetime.date(2017, 10, 10): 5409,
datetime.date(2017, 10, 9): 7942,
nan: 2764})
I've fixed the problem by removing the pd.Series
constructor, but now I want to know why this is happening, since I can't find anything similar. I apologize in advance if the only problem is that I wasn't thinking hard enough.