2

I have two dataframes that are contain market daily end of day data. They are supposed to contain identical starting dates and ending dates and number of rows, but when I print the len of each, one is bigger by one than the other:

DF1
            close
date              
2008-01-01   45.92
2008-01-02   45.16
2008-01-03   45.33
2008-01-04   42.09
2008-01-07   46.98
...
[2870 rows x 1 columns]

DF2

             close
date              
2008-01-01   60.48
2008-01-02   59.71
2008-01-03   58.43
2008-01-04   56.64
2008-01-07   56.98
...
[2871 rows x 1 columns]

How can I show which row either:

  • has a duplicate row,
  • or has an extra date

so that I can delete the [probable] weekend/holiday date row that is in DF2 but not in DF1?

I have tried things like:

df1 = df1.drop_duplicates(subset='date', keep='first')
df2 = df1.drop_duplicates(subset='date', keep='first')

but can't get it to work [ValueError: not enough values to unpack (expected 2, got 0)].

Extra:

How do I remove weekend dates from a dataframe?

Ivan
  • 7,448
  • 14
  • 69
  • 134

1 Answers1

2

May using .loc

DF2=DF2.loc[DF1.index]

If check index different between DF1 and DF2

DF2.index.difference(DF1.index)

Check whether DF2 have duplicate index

DF2[DF2.index.duplicated(keep=False)]

Check the weekends

df.index.weekday_name.isin(['Sunday','Saturday'])

Fix your code

df1 = df1.reset_index().drop_duplicates(subset='date', keep='first').reset_index('date')
df2 = df2.reset_index().drop_duplicates(subset='date', keep='first').reset_index('date')

Also for this I recommend duplicated

df2 =df2 [df2.index.duplicated()]

About the business

def B_day(date):
    return bool(len(pd.bdate_range(date, date)))

df.index.map(B_day)
BENY
  • 317,841
  • 20
  • 164
  • 234
  • Thank you that worked. It turns out there is a weekend date in there. If you feel up to it, see edited post for "extra". – Ivan Jan 07 '19 at 02:55
  • @Ivan you mean the error ? I think I just fixed it . – BENY Jan 07 '19 at 02:56
  • the problem is that I don't necessarily want to drop _any_ date. First, I want to see if the data contains a row with a date that is a weekend date. If that solves, the problem, then probably stop. If not, then go on to remove duplicate dates. Data cleaning is hard... – Ivan Jan 07 '19 at 02:59
  • @Ivan if that still dose not address your problem , you may want to check B-day https://stackoverflow.com/questions/13019719/get-business-days-between-start-and-end-date-using-pandas – BENY Jan 07 '19 at 03:04
  • 1
    @Ivan you can also create a date_range to find any dates that fall outside your date range and also those that are not a business day: `df2[~df2.index.isin(pd.date_range(start, end, freq='B'))]` – It_is_Chris Jan 07 '19 at 03:06
  • Thanks Chris that is useful. I am actually surprised no one has built a library for market data "cleaning". – Ivan Jan 07 '19 at 18:54