So, I'm a data science student working with some data in Python Pandas, and I have two dataframes whose indices are dates (each generated by reading CSV files with pandas.readcsv("filepath", index_col="DATE", parse_dates=True, dayfirst=True)). What I want to do is to modify these dataframes so that they each only contain rows whose index value is shared by both of them - that way, I can directly compare them to look for correlations in the data.
I've spent the last few hours searching documentation and SO for ways to do this, and at the moment, I've got the following code:
common_dates = list(set(df1.index.values).intersection(df2.index.values))
print(common_dates)
print(normalized_house_index_data.index.values)
df1= df1.take(common_dates)
df2= df2.take(common_dates)
However, this is giving me an index out of bounds error, even though common_dates should be constructed from the items in the index.values array. When I look at the output of the print() statements I added in as part of my troubleshooting, I see the following for common_dates:
[numpy.datetime64('2000-12-31T00:00:00.000000000'), numpy.datetime64('2001-12-31T00:00:00.000000000'), numpy.datetime64('2004-12-31T00:00:00.000000000'), numpy.datetime64('2003-12-31T00:00:00.000000000'), #and more values
And the following for df1.index.values:
['2000-12-31T00:00:00.000000000' '2001-12-31T00:00:00.000000000'
'2002-12-31T00:00:00.000000000' '2003-12-31T00:00:00.000000000' #and more values
The values for df2.index.values look similar to df1.
['1947-12-31T00:00:00.000000000' '1948-12-31T00:00:00.000000000'
#lots of values
'1997-12-31T00:00:00.000000000' '1998-12-31T00:00:00.000000000'
'1999-12-31T00:00:00.000000000' '2000-12-31T00:00:00.000000000'
'2001-12-31T00:00:00.000000000' '2002-12-31T00:00:00.000000000'
#more values
This gives an "indices out of bounds" error. I've tried using list(map(str, common_dates) to convert common_dates to strings, since it looks like there might be some sort of type mismatch, but this gives an "invalid literal for int() with base 10: '2000-12-31T00:00:00.000000000'" error instead; I've tried to similarly convert them to int or numpy.datetime64, but these both give "index out of bounds" errors.
I've also tried an alternate approach using df1.iterrows():
droplist = []
for date, value in df1.iterrows():
if date not in common_dates:
droplist.append(date)
df1= df1.drop(droplist)
I also tried a version of this comparing each row's date directly to the values of df2.index.values. Both of these simply result in all rows being dropped from the table, rather than only the non-matching rows being dropped.
What am I doing wrong, here? Am I simply taking the wrong approach to this, or is there something I'm missing?