1

I am trying to run a data cleanup script in Python. I have a base class with a function called cleanData(). Depending on the dataset returned, there are a number of date fields, all of which end in _DT, but could start with anything (such as SCHEDULED_START_DT, SERVICE_DISRUPT_DT, etc). This code will support hundreds of reports, so instead of overloading the object and method for each report, I would like to dynamically run a function on every field that ends in _DT and just call the parent method if there is additional cleanup unique to a report. All this code does is change a UTC epoch timestamp into a readable Local Time Zone. Following are data sample and code I have:

sample data

ID         SCHEDULED_START_DT               SERVICE_DISRUPTION_START_DT
0          1597669200                       1597712400
1          1597667496                       None

code snippets

from datetime import datetime, timezone
import datetime as dt
import time
import requests
import pandas as pd

d = {'ID': [0, 1], 
     'SCHEDULED_START_DT': [1597669200, 1597667496], 
     'SERVICE_DISRUPTION_START_DT' : [1597712400, None]
    }
df = pd.DataFrame(data=d)
df['SCHEDULED_START_DT'] = df['SCHEDULED_START_DT'].apply(lambda x : dt.datetime.fromtimestamp(x) if pd.notnull(x) else x)
df['SERVICE_DISRUPTION_START_DT'] = df['SERVICE_DISRUPTION_START_DT'].apply(lambda x : dt.datetime.fromtimestamp(x) if pd.notnull(x) else x)

output from code

ID  SCHEDULED_START_DT          SERVICE_DISRUPTION_START_DT
0   2020-08-17 08:00:00         2020-08-17 20:00:00
1   2020-08-17 07:31:36         NaT

I think there is a way to dynamically apply the function on all fields ending in _DT without looping and logic constructs. I have seen some problems that are kind of like this, but I can't figure out how to do it.

Thank you in advance for any help.

Pete

2 Answers2

2

list(df.columns) wiil return the the column names.

Loop over it and do the thing:

for name in list(df.columns):
  if name.endswith('_DT'):
     #your logic goes here 
balderman
  • 22,927
  • 7
  • 34
  • 52
  • Thank you for the reply, I will try this and I am sure it will work. I am not an experienced Python programmer, so I thought there would be a more efficient way? Or does python just do this type of loop under the covers? – thePetester Aug 17 '20 at 16:09
  • @thePetester if you use exclusively pandas function, numpy implementations do the loops in C, making them much faster – RichieV Aug 17 '20 at 16:35
2

You can combine filter (to get all columns that end with '_DT') with update (to bring those changes back). In this dummy example I'll just add 20.

import pandas as pd
df = pd.DataFrame(np.random.randint(1, 10, (3, 5)),
                  columns=['DT', 'foo', 'foo_DT', 'bar_DT', 'start_DT_end'])

df.update(df.filter(regex='_DT$')+20)

print(df)
   DT  foo  foo_DT  bar_DT  start_DT_end
0   9    3      23      21             6
1   9    4      26      27             7
2   5    7      26      26             4

In your case we can do away with the slow Series.apply in favor of DataFrame.apply(axis=0), which will loop over each column Series instead of the rows. But we need to be careful because pandas is inconsistent with the datetime module. So to get your exact output, I need to deal with timezones and subtract the difference.

Also there is an issue with update when it tries to overwrite int columns with datetime64[ns]. So we'll instead concat the results back.

from datetime import datetime
import pytz

my_date = datetime(2020, 2, 17)
my_date_aware = pytz.utc.localize(my_date)
# Seconds of disagreement
offs = datetime.timestamp(my_date) - pd.to_datetime(my_date).timestamp()
#18000.0

# Turn all into `datetime`
df1 = (df.filter(regex='_DT$').apply(pd.to_datetime, errors='coerce', unit='s')
       -pd.Timedelta(offs, unit='s'))

# Join                columns we didn't modify       those we did
df = pd.concat([df[df.columns.difference(df1.columns)], df1], axis=1)

print(df)
#   ID  SCHEDULED_START_DT SERVICE_DISRUPTION_START_DT
#0   0 2020-08-17 08:00:00         2020-08-17 20:00:00
#1   1 2020-08-17 07:31:36                         NaT
ALollz
  • 57,915
  • 7
  • 66
  • 89
  • 1
    how do you implement OP's function with this filter? I tried to avoid looping by building a list comprehension with `.endswith`, but same as your answer, it raises `ValueError: The truth value of a Series is ambiguous.` – RichieV Aug 17 '20 at 16:19
  • I get the same error, here is my code: _df.update(df.filter(regex='_DT$').apply(lambda x : dt.datetime.fromtimestamp(x) if pd.notnull(x) else x))_ – thePetester Aug 17 '20 at 16:25
  • according to the [docs](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html), passing a df to `pd.to_datetime` will assume the columns are `year, month, date` or other components... – RichieV Aug 17 '20 at 16:33
  • 1
    @RichieV @thePetester it's updated with a few nuances of how to handle your exact case. Turns out `update` has issues, so `filter` + `concat` in that case. – ALollz Aug 17 '20 at 16:46
  • The only thing this does not work for is daylight savings. If I pass in the following value: 1595514757 The time stamp returned is: 2020-07-23 08:32:37 when it should be: 2020-07-23 09:32:37 – thePetester Aug 17 '20 at 18:42