1

So, I'm a data science student working with some data in Python Pandas, and I have two dataframes whose indices are dates (each generated by reading CSV files with pandas.readcsv("filepath", index_col="DATE", parse_dates=True, dayfirst=True)). What I want to do is to modify these dataframes so that they each only contain rows whose index value is shared by both of them - that way, I can directly compare them to look for correlations in the data.

I've spent the last few hours searching documentation and SO for ways to do this, and at the moment, I've got the following code:

common_dates = list(set(df1.index.values).intersection(df2.index.values))
print(common_dates)
print(normalized_house_index_data.index.values)
df1= df1.take(common_dates)
df2= df2.take(common_dates)

However, this is giving me an index out of bounds error, even though common_dates should be constructed from the items in the index.values array. When I look at the output of the print() statements I added in as part of my troubleshooting, I see the following for common_dates:

[numpy.datetime64('2000-12-31T00:00:00.000000000'), numpy.datetime64('2001-12-31T00:00:00.000000000'), numpy.datetime64('2004-12-31T00:00:00.000000000'), numpy.datetime64('2003-12-31T00:00:00.000000000'), #and more values

And the following for df1.index.values:

['2000-12-31T00:00:00.000000000' '2001-12-31T00:00:00.000000000'
 '2002-12-31T00:00:00.000000000' '2003-12-31T00:00:00.000000000' #and more values

The values for df2.index.values look similar to df1.

['1947-12-31T00:00:00.000000000' '1948-12-31T00:00:00.000000000'
#lots of values
 '1997-12-31T00:00:00.000000000' '1998-12-31T00:00:00.000000000'
 '1999-12-31T00:00:00.000000000' '2000-12-31T00:00:00.000000000'
 '2001-12-31T00:00:00.000000000' '2002-12-31T00:00:00.000000000'
#more values

This gives an "indices out of bounds" error. I've tried using list(map(str, common_dates) to convert common_dates to strings, since it looks like there might be some sort of type mismatch, but this gives an "invalid literal for int() with base 10: '2000-12-31T00:00:00.000000000'" error instead; I've tried to similarly convert them to int or numpy.datetime64, but these both give "index out of bounds" errors.

I've also tried an alternate approach using df1.iterrows():

droplist = []
for date, value in df1.iterrows():
    if date not in common_dates:
        droplist.append(date)
df1= df1.drop(droplist)

I also tried a version of this comparing each row's date directly to the values of df2.index.values. Both of these simply result in all rows being dropped from the table, rather than only the non-matching rows being dropped.

What am I doing wrong, here? Am I simply taking the wrong approach to this, or is there something I'm missing?

nick012000
  • 159
  • 2
  • 12
  • I think you need `df = pd.merge(df1, df2, left_index=True, right_index=True)` and then `print (df.index)` are common dates. – jezrael Mar 26 '20 at 06:11
  • I don't want to merge the dataframes; I want to keep them separate so that I can compare them more easily later on. – nick012000 Mar 26 '20 at 06:12
  • hmm, so then problem is in question missing [minimal, complete, and verifiable example](http://stackoverflow.com/help/mcve) – jezrael Mar 26 '20 at 06:13
  • @jezrael How so? I've given the means by which I created my dataframes, my code manipulating them, and my debugging outputs. What else do you want? – nick012000 Mar 26 '20 at 06:16
  • I cannot run code for verify `index out of bounds error`, so how I can test your solution if correct? – jezrael Mar 26 '20 at 06:17
  • I see no data, so impossible test. – jezrael Mar 26 '20 at 06:18
  • Data isn't relevant; the only things I'm working with are the indices? And I gave you the outputs of my debugging that includes their values, as well as the function that I used to generate them? – nick012000 Mar 26 '20 at 06:19
  • OK, I think data for test `print (df2.index)` values, there are only `print (df1.index)` values. From them I can create `common datetimes` and test it. – jezrael Mar 26 '20 at 06:21
  • I need 5,6 values for testing, [how to provide a great pandas example](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – jezrael Mar 26 '20 at 06:22
  • Do you want me to post chunks of the CSVs or links to the open data site I got them from to begin with? I don't know how to build an identical DataFrame from scratch because I don't know what types it used to create the dates in the first place. I just invoked pandas.readcsv() with the parameters given and used the dataframes it returned. – nick012000 Mar 26 '20 at 06:37
  • Testing, now it is OK for me. – jezrael Mar 26 '20 at 06:38

1 Answers1

1

I think here is problem with take, for me working DataFrame.loc for selecting by common indices:

a = pd.DatetimeIndex(['2000-12-31T00:00:00.000000000',
                      '2001-12-31T00:00:00.000000000',
                      '2002-12-31T00:00:00.000000000', 
                      '2003-12-31T00:00:00.000000000'])

b = pd.DatetimeIndex(['1947-12-31T00:00:00.000000000',
                      '1948-12-31T00:00:00.000000000',
                      '1997-12-31T00:00:00.000000000',
                      '1998-12-31T00:00:00.000000000',
                      '1999-12-31T00:00:00.000000000',
                      '2000-12-31T00:00:00.000000000',
                      '2001-12-31T00:00:00.000000000',
                      '2002-12-31T00:00:00.000000000'])

df1 = pd.DataFrame(index=a)
df2 = pd.DataFrame(index=b)

common_dates = list(set(df1.index.values).intersection(df2.index.values))
print(common_dates)
[numpy.datetime64('2000-12-31T00:00:00.000000000'), 
 numpy.datetime64('2001-12-31T00:00:00.000000000'), 
 numpy.datetime64('2002-12-31T00:00:00.000000000')]

Also is possible use Index.intersection for common indices:

common_dates = df1.index.intersection(df2.index)
print(common_dates)
DatetimeIndex(['2000-12-31', '2001-12-31', '2002-12-31'], 
              dtype='datetime64[ns]', freq='A-DEC')

df1= df1.loc[common_dates]
df2= df2.loc[common_dates]
print (df1)
Empty DataFrame
Columns: []
Index: [2000-12-31 00:00:00, 2001-12-31 00:00:00, 2002-12-31 00:00:00]

print (df2)
Empty DataFrame
Columns: []
Index: [2000-12-31 00:00:00, 2001-12-31 00:00:00, 2002-12-31 00:00:00]
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252