0

*RE Add missing dates to pandas dataframe, previously ask question

import pandas as pd  
import numpy as np

idx = pd.date_range('09-01-2013', '09-30-2013')  

df = pd.DataFrame(data = [2,10,5,1], index = ["09-02-2013","09-03-2013","09-06-2013","09-07-2013"], columns = ["Events"])

df.index = pd.DatetimeIndex(df.index);  #question (1)

df = df.reindex(idx, fill_value=np.nan)  
print(df)

In the above script what does the command noted as question one do? If you leave this command out of the script, the df will be re-indexed but the data portion of the original df will not be retained. As there is no reference to the df data in the DatetimeIndex command, why is the data from the starting df lost?

Community
  • 1
  • 1
Dick Eshelman
  • 1,103
  • 2
  • 12
  • 17

1 Answers1

2

Short answer: df.index = pd.DatetimeIndex(df.index); converts the string index of df to a DatetimeIndex.


You have to make the distinction between different types of indexes. In

df = pd.DataFrame(data = [2,10,5,1], index = ["09-02-2013","09-03-2013","09-06-2013","09-07-2013"], columns = ["Events"])

you have an index containing strings. When using

df.index = pd.DatetimeIndex(df.index);

you convert this standard index with strings to an index with datetimes (a DatetimeIndex). So the values of these two types of indexes are completely different.

Now, when you reindex with

idx = pd.date_range('09-01-2013', '09-30-2013')  
df = df.reindex(idx)

where idx is also an index with datetimes. When you reindex the original df with a string index, there are no matching index values, so no column values of the original df are retained. When you reindex the second df (after converting the index to a datetime index), there will be matching index values, so the column values on those indixes are retained.

See also http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.reindex.html

joris
  • 133,120
  • 36
  • 247
  • 202