1

I've looked through a bunch of similar questions, but I cannot figure out how to actually apply the principles to my own case. I'm therefore trying to figure out a simple example I can work from - basically I need the idiots' guide before I can look at more complex examples

Consider a dataframe that contains a list of names and times, and a known start time. I then want to update the dataframe with the finish time, which is calculated from starttime + Time

import pandas as pd
import datetime

df = pd.DataFrame({"Name": ["Kate","Sarah","Isabell","Connie","Elsa","Anne","Lin"],
                  "Time":[3, 6,1, 7, 23,3,4]})
starttime = datetime.datetime.strptime('2020-02-04 00:00:00', '%Y-%m-%d %H:%M:%S')

I know that for each case I can calculate the finish time using

finishtime = starttine + datetime.datetime.timedelta(minutes = df.iloc[0,1])

what I can't figure out is how to use this while iterating over the df rows and updating a third column in the dataframe with the output.

I tried

df["FinishTime"] = np.nan

for row in df.itertuples():
    df.at[row,"FinishTime"] = starttine + datetime.datetime.timedelta(minutes = row.Time)

but it gave a lot of errors I couldn't unravel. How am I meant to do this?

I am aware that the advice to iterating over a dataframe is don't - I'm not committed to iterating, I just need some way to calculate that final column and add it to the dataframe. My real data is about 200k lines.

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
Esme_
  • 1,360
  • 3
  • 18
  • 30

1 Answers1

2

Use pd.to_timedelta()

import datetime
starttime = datetime.datetime.strptime('2020-02-04 00:00:00', '%Y-%m-%d %H:%M:%S')
df = pd.DataFrame({"Name": ["Kate","Sarah","Isabell","Connie","Elsa","Anne","Lin"],
    "Time":[3, 6,1, 7, 23,3,4]})
df.Time = pd.to_timedelta(df.Time, unit='m')
# df = df.assign(FinishTime = df.Time + starttime)
df['FinishTime'] = df.Time + starttime # as pointed out by Trenton McKinney, .assign() is only one way to create new columns
# creating with df['new_col'] has the benefit of not having to copy the full df

print(df)

Output

      Name     Time          FinishTime
0     Kate 00:03:00 2020-02-04 00:03:00
1    Sarah 00:06:00 2020-02-04 00:06:00
2  Isabell 00:01:00 2020-02-04 00:01:00
3   Connie 00:07:00 2020-02-04 00:07:00
4     Elsa 00:23:00 2020-02-04 00:23:00
5     Anne 00:03:00 2020-02-04 00:03:00
6      Lin 00:04:00 2020-02-04 00:04:00

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_timedelta.html

Avoid looping in pandas at all cost

Maybe not at all cost, but pandas takes advantage of C implementations to improve performance by several orders of magnitude. There are many (many) common functions already implemented for our convenience.

Here is a great stackoverflow conversation about this very topic.

RichieV
  • 5,103
  • 2
  • 11
  • 24
  • Thanks! @RichieV Since I need that time information for other calcualtions, is it best to duplicate that column, convert one, calculate the FinishTime and then drop the unneeded column? Or is there a better option? I suspect I duplicate data way more than I need to and trying to improve my coding practices. – Esme_ Aug 02 '20 at 03:13
  • There are cases when having a duplicate column will increase already large memory usage and avoiding duplicates is the best option. But it is ok to create new columns as needed when you are starting to explore your data and still don't know for sure which form and shape will be of more use. Try to make it work first, then when you dominate it you can start optimizing. – RichieV Aug 02 '20 at 03:19