0

I have a Pandas DataFrame from a csv file which indexes are Dates.

df = pd.read_csv('data.csv', index_col=0, parse_dates=True)  
df.index
DatetimeIndex(['2010-01-01', '2010-01-04', '2010-01-05', '2010-01-06',
               '2010-01-07', '2010-01-08', '2010-01-11', '2010-01-12',
               '2010-01-13', '2010-01-14',
               ...
               '2018-06-18', '2018-06-19', '2018-06-20', '2018-06-21',
               '2018-06-22', '2018-06-25', '2018-06-26', '2018-06-27',
               '2018-06-28', '2018-06-29'],
              dtype='datetime64[ns]', name='Date', length=2216, freq=None)

I need to calculate values according to the index numbers for each row, but df.index returns DatetimeIndex. How could I get the raw index number series for each row?

Expect:

df.raw_index  # return a Series [0, 1, 2, 3, ...]


df['result'] = (df.raw_index + 1) ** 2  ## [1, 4, 9, 16, ...]

I can use pd.Series(range(0, df.shape[0])) to create a Series by a range, but I think it is not efficient.

Xaree Lee
  • 3,188
  • 3
  • 34
  • 55
  • if you reset the index you get a range index. `df = df.reset_index()` – anky Mar 20 '21 at 05:32
  • @anky Is that efficient to create a Series without allocate new memory for indexes? I think pandas DataFrame has keep its own raw indexes (maybe), and just return a pointer to that array/list/seires? – Xaree Lee Mar 20 '21 at 05:38
  • I have not come across such a thing yet :) I think pandas on its own rights is quite efficient, but depending on what you want to do further you can create an array: `np.arange(len(df))` or `np.arange(1,len(df)+1)**2` – anky Mar 20 '21 at 05:41

1 Answers1

0

Thanks for @anky's comment and this answer. I compare the performance:

%timeit df.reset_index().index
549 µs ± 8.52 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit pd.Series(range(0, df.shape[0]))
81 µs ± 1.33 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit np.arange(result.shape[0])
3.15 µs ± 27 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit np.arange(len(df))
2.76 µs ± 48.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit np.arange(len(df.index))
2.51 µs ± 18.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


# df.index is well performant, but it returns DatetimeIndex, not raw indexes. 
%timeit df.index
127 ns ± 0.5 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

I'm still looking for more efficient/performant to get the raw index Series for a datetime-indexed DataFrame.

Xaree Lee
  • 3,188
  • 3
  • 34
  • 55