1

Problem: df.assign of Pandas creates NaN values when it's not supposed to when working with dates, and I want to know why.

I have a pandas dataframe of tweets. Here are the columns:

id_str            object
coordinates       object
created_at        object   ***
text              object
user              object
favorite_count     int64
retweet_count      int64
username          object
clean_text        object

The time of the tweet is in the column created_at. Even though it's an object, it's formatted properly, like Mon Oct 16 23:58:55 +0000 2017.

Now, I want to make a new column created_date that will only contain the month, day, year. This ended up being my solution, inspired by this question and this question.

df = df.assign(created_date = pd.Series([x.date() for x in pd.to_datetime(df['created_at'])]))
# Bonus question: how effective is this code in runtime or memory? If not, why?

The pd.Series constructor is there as I thought it was necessary, but I'm leaving it here in case that is what's causing the problems.

I later found out that 2764 out of the 29984 tweets had NaN values instead of the correct output (e.g. 2017-10-16 of dtype 'O'), and I thought that the Series constructor was to blame, but apparently not; the problem was actually in the df.assign.

I used the Counter from collections for checking.

I first checked if the Series I created was correct.

# pd.Series([x.date() for x in pd.to_datetime(df['created_at'])])
Counter({datetime.date(2017, 10, 9): 8165,
        datetime.date(2017, 10, 10): 5898,
        datetime.date(2017, 10, 11): 3104,
        datetime.date(2017, 10, 12): 2067,
        datetime.date(2017, 10, 13): 1647,
        datetime.date(2017, 10, 14): 2750,
        datetime.date(2017, 10, 15): 2778,
        datetime.date(2017, 10, 16): 3575})

Nothing wrong there, the total number of tweets is correct, too. So, I rechecked the newly assigned column, and there's the problem.

# Counter(df['created_date'])
Counter({datetime.date(2017, 10, 16): 3240,
         datetime.date(2017, 10, 15): 2413,
         datetime.date(2017, 10, 14): 2431,
         datetime.date(2017, 10, 13): 1369,
         datetime.date(2017, 10, 12): 1680,
         datetime.date(2017, 10, 11): 2736,
         datetime.date(2017, 10, 10): 5409,
         datetime.date(2017, 10, 9): 7942,
         nan: 2764})

I've fixed the problem by removing the pd.Series constructor, but now I want to know why this is happening, since I can't find anything similar. I apologize in advance if the only problem is that I wasn't thinking hard enough.

dTanMan
  • 137
  • 7
  • That date format is insane. Did you explicitly specify the format or let it get inferred? I wouldn't be surprised if some ambiguity in the format created the problem. – Paul Nov 23 '17 at 04:21
  • Also, I suspect you know this, but calling Series on a list comprehension over a series is not efficient. Use the `.dt` attribute to access the series datetime functions if necessary (though you can fix this particular problem earlier in the pipeline). – Paul Nov 23 '17 at 04:24

0 Answers0