I am doing a machine learning (value prediction) task. While I am preprocessing data, it takes a very long time. I have a csv file with around 640000 rows, and I am trying to subtract the dates of consecutive rows and calculate the time duration. The csv file looks as attached. For example, 2011-08-17 to 2011-08-19 takes 2 days, and I would like to write 2 to the "time duration" column. I've used the python datetime function to do this. And it costs a lot of time.
data = pd.read_csv(f'{proj_dir}/raw data/measures.csv', encoding="cp1252")
file = data[['ID', 'date', 'value1', 'value2', 'duration']]
def time_subtraction(date, prev_date):
diff = datetime.strptime(date, '%Y-%m-%d') - datetime.strptime(prev_date, '%Y-%m-%d')
diff_days = diff.days
return diff_days
def calculate_time_duration(dataframe, set_0_indices):
for i in range(dataframe.shape[0]):
# For each patient, sets "Time Duration" at the first measurement to be 0
if i in set_time_0_indices.values:
dataframe.iloc[i, 4] = 0 # set time duration to 0 (beginning of this patient)
else: # time subtraction
dataframe.iloc[i, 4] = time_subtraction(date=dataframe.iloc[i, 1], prev_date=dataframe.iloc[i-1, 1])
return dataframe
# I am running on Google Colab. This line takes very long.
result = calculate_time_duration(dataframe = file, set_0_indices = set_time_0_indices)
I wonder if there are any ways to accelerate this process. Does using a GPU help? I have access to a remote GPU, but I don't know if using a GPU helps with data preprocessing. By the way, under what scenario can GPUs really make things faster? Thanks in advance!