I'm going through over 1 million patent applications and have to fix the dates, in addition to other things that I will work on later. I'm reading the file into a Pandas data frame, then running the following function:
def date_change():
new_dates = {'m/y': []}
for i, row in apps.iterrows():
try:
d = row['date'].rsplit('/')
new_dates['m/y'].append('{}/19{}'.format(d[0], d[2]))
except Exception as e:
print('{} {}\n{}\n{}'.format(i, e, row, d))
new_dates['m/y'].append(np.nan)
apps.join(pd.DataFrame(new_dates))
apps.drop('date')
Is there a quicker way of executing this? Is Pandas even the correct library to be using with a dataset this large? I've been told PySpark is good for big data, but how much will it improve the speed?