Updating Each Row with Minutes Since First Row

Question

I have a file with a million tweets. The first tweet occurred 2013-04-15 20:17:18 UTC. I want to update each tweet row afterward with the minutes since minsSince that first tweet.

I have found help with datetime here, and converting time here, but when I put the two together I don't get the right times. It could be something with the UTC string at the end of each published_at value.

The error it throws is:

tweets['minsSince'] = tweets.apply(timesince,axis=1)
...
TypeError: ('string indices must be integers, not str', u'occurred at index 0')

Thanks for any help.

#Import stuff
from datetime import datetime
import time
import pandas as pd
from pandas import DataFrame

#Read the csv file
tweets = pd.read_csv('BostonTWEETS.csv')
tweets.head()

#The first tweet's published_at time
starttime = datetime (2013, 04, 15, 20, 17, 18)

#Run through the document and calculate the minutes since the first tweet
def timesince(row):
    minsSince = int()
    tweetTime = row['published_at']
    ts = time.strftime('%Y-%m-%d %H:%M:%S', time.strptime(tweetTime['published_at'], '%Y-%m-%d %H:%M:%S %UTC'))
    timediff = (tweetTime - starttime)
    minsSince.append("timediff")
    return ",".join(minsSince)

tweets['minsSince'] = tweets.apply(timesince,axis=1)

df = DataFrame(tweets)

print(df)

Sample csv file of first 5 rows.

I have provided a sample csv file in the description above. Thanks — user3230654, Oct 04 '16 at 03:58

Ajay Pal · Accepted Answer · 2016-10-06T02:29:53.567

0

#Import stuff
from datetime import datetime
import time
import pandas as pd
from pandas import DataFrame

#Read the csv file
tweets = pd.read_csv('sample.csv')
tweets.head()

#The first tweet's published_at time
starttime = tweets.published_at.values[0]
starttime = datetime.strptime(starttime, '%Y-%m-%d %H:%M:%S UTC')

#Run through the document and calculate the minutes since the first tweet
def timesince(row):
    ts = datetime.strptime(row, '%Y-%m-%d %H:%M:%S UTC')
    timediff = (ts- starttime)
    timediff = divmod(timediff.days * 86400 + timediff.seconds, 60)
    return timediff[0]

tweets['minSince'] = 0
tweets.minSince = tweets.published_at.map(timesince)

df = DataFrame(tweets)

print(df)

I hope this is what you are looking for.

edited Oct 06 '16 at 02:29

answered Oct 04 '16 at 03:20

Ajay Pal

543
4
13

I get back the error `AttributeError: 'DataFrame' object has no attribute 'minsSince'` – user3230654 Oct 04 '16 at 03:39
your csv does not have a minsSince as a header, use tweets.published_at .. This is the column you are working at.. – Ajay Pal Oct 04 '16 at 04:12
Thanks, this works perfectly. I really appreciate the help – user3230654 Oct 13 '16 at 02:36
I have a follow-up question. I want now to have the answer in seconds since instead of minutes since. I tried `timediff = (timediff.days * 86400 + timediff.seconds)` but I get a `int object has no attribute` – user3230654 Nov 29 '16 at 03:22

Updating Each Row with Minutes Since First Row

1 Answers1