0

I'm going through over 1 million patent applications and have to fix the dates, in addition to other things that I will work on later. I'm reading the file into a Pandas data frame, then running the following function:

def date_change():
        new_dates = {'m/y': []}
        for i, row in apps.iterrows():
                try:
                        d = row['date'].rsplit('/')
                        new_dates['m/y'].append('{}/19{}'.format(d[0], d[2]))
                except Exception as e:
                        print('{}   {}\n{}\n{}'.format(i, e, row, d))
                        new_dates['m/y'].append(np.nan)
        apps.join(pd.DataFrame(new_dates))
        apps.drop('date')

Is there a quicker way of executing this? Is Pandas even the correct library to be using with a dataset this large? I've been told PySpark is good for big data, but how much will it improve the speed?

Táwros
  • 129
  • 3
  • 8

1 Answers1

1

So it seems like you are using a string to represent data instead of a date time object. I'd suggest to do something like

df['date'] = pd.to_datetime(df['date'])

So you don't need to iterate at all, as that function operate on the whole column. And then, you might want to check the following answer which uses dt.strftime to format your column appropriately.

If you could show input and expected output, I could add the full solution here.

Besides, 1 million rows should typically be manageable for pandas (depending on the number of columns of course)

Quickbeam2k1
  • 5,287
  • 2
  • 26
  • 42
  • Never used to_datetime before, thanks for bringing it to my attention! Input: 2/21/44, Output: 2/21/1944. For this there are only 6 columns, but that's good to know. – Táwros Dec 15 '19 at 07:24
  • Update: Worked well, but it interpreted a bunch of the values as being from the 20 aughts instead of 19 aughts (all data is from 20th century). – Táwros Dec 15 '19 at 07:42
  • You might want to check [this](https://stackoverflow.com/questions/32888124/pandas-out-of-bounds-nanosecond-timestamp-after-offset-rollforward-plus-adding-a), but I'm unsure if this will help in a straight forward way: – Quickbeam2k1 Dec 15 '19 at 20:50