I am trying to remove rows in my dataframe that are similar. There has been some mistakes in the data collection of my file, so I have this:
Dates Last Price Relative Share Price Momentum RSI 30 Day Relative 3 Month Eqty/Index Relative 1 Month Eqty/Index Sales/Diluted Sh Revenue Growth Year over Year
1/31/2018 3881.0 -2.132 51.4152 4.526 -0.989 5.7376 -32.4057 0.6103 8.723 ... 1.3726 2.0628 0.9059 16.7236 2.6494 2.7217 26.2718 9.9759 17.553 23.475
2/28/2018 3883.0 3.251 51.4332 10.254 4.225 5.7376 -32.4057 0.6103 8.803 ... 1.3726 2.0852 0.8181 16.7322 2.6507 2.7231 26.2718 9.9759 13.771 23.045
*3/1/2018* 3883.0 3.251 51.4332 10.254 4.225 8.8678 4.7481 -14.9557 8.803 ... 1.0180 2.0852 0.8181 16.7322 2.6507 2.7231 15.5694 9.1429 13.771 23.045
*3/30/2018* 3700.0 5.646 49.6923 0.773 -2.346 8.8678 4.7481 -14.9557 8.388 ... 1.0180 1.9431 0.8499 17.2796 2.4121 2.5267 15.5694 9.1429 15.880 22.033
4/30/2018 4281.0 6.475 54.7253 10.663 8.728 8.8678 4.7481 -14.9557 10.599 ... 1.0180 2.1033 1.1068 19.9930 2.7909 2.9234 15.5694 9.1429 28.096 21.213
5/31/2018 4215.0 13.367 54.0894 2.241 -3.708 8.8678
The data is supposed to be monthly, but for some reason there are a few spots in the data where there are two values in the same month.
I want this:
Dates Last Price Relative Share Price Momentum RSI 30 Day Relative 3 Month Eqty/Index Relative 1 Month Eqty/Index Sales/Diluted Sh Revenue Growth Year over Year
1/31/2018 3881.0 -2.132 51.4152 4.526 -0.989 5.7376 -32.4057 0.6103 8.723 ... 1.3726 2.0628 0.9059 16.7236 2.6494 2.7217 26.2718 9.9759 17.553 23.475
2/28/2018 3883.0 3.251 51.4332 10.254 4.225 5.7376 -32.4057 0.6103 8.803 ... 1.3726 2.0852 0.8181 16.7322 2.6507 2.7231 26.2718 9.9759 13.771 23.045
3/30/2018 3883.0 3.251 51.4332 10.254 4.225 8.8678 4.7481 -14.9557 8.803 ... 1.0180 2.0852 0.8181 16.7322 2.6507 2.7231 15.5694 9.1429 13.771 23.045
4/30/2018 4281.0 6.475 54.7253 10.663 8.728 8.8678 4.7481 -14.9557 10.599 ... 1.0180 2.1033 1.1068 19.9930 2.7909 2.9234 15.5694 9.1429 28.096 21.213
5/31/2018 4215.0 13.367 54.0894 2.241 -3.708 8.8678
I am guessing I should use something df.drop_duplicates
combined with df.loc
.
I need to make a code saying; if "month" in df['Dates'] is the same two rows in a row, delete one of them (it doesn't really matter which one tbh).
EDIT2: Since no one seems to know the answer, I changed my dataframe again:
Month Day Year Price names Variable Variable Variable
1 31.0 1990.0 1.2143 AAPL 47.0287 -24.3754 3.5821
2 28.0 1990.0 1.2143 AAPL 47.0287 -19.8995 -0.8467 36.713 39.377
3 31.0 1990.0 1.4375 AAPL 49.7818 18.7056 15.5790 0.3787 14.7951 40.891 42.742
4 29.0 1990.0 1.4063 AAPL 49.4099 15.2067 0.5290 0.3787 ... 0.0371 0.7548 3.1297 14.7951 35.632 39.694
4 30.0 1990.0 1.4732 AAPL 50.2341 11.4693 -4.0632 0.3787 ... 0.0371 0.7459 3.2787 14.7951 32.273 37.271
5 31.0 1990.0 1.5982 AAPL 51.7520
Now with this format, hopefully it will be easier for someone to work it out. I want to delete row if df['Month'] = df['Year'] = df['names'].
My thoughts are that one could do something like this: Delete rows from a pandas DataFrame based on a conditional expression involving len(string) giving KeyError
I tried doing this with no luck:
df = df.drop(df[(df.Month == df.Year) & (df.Month == df.names)].index)
EDIT2: I was able to do this:
df[~df.duplicated(['Month', 'Year', 'Name'], keep=False)]
Which completely removes the rows which have duplicate months, however it doesn't keep one row, it just deletes both, which isn't exactly what I want. Maybe someone can tweak this such that one of the rows remain?
I appreciate all help!