Delete "similar" rows in pandas

Question

I am trying to remove rows in my dataframe that are similar. There has been some mistakes in the data collection of my file, so I have this:

Dates   Last Price  Relative Share Price Momentum   RSI 30 Day  Relative 3 Month Eqty/Index     Relative 1 Month Eqty/Index     Sales/Diluted Sh    Revenue Growth Year over Year
1/31/2018   3881.0  -2.132  51.4152     4.526   -0.989  5.7376  -32.4057    0.6103  8.723   ...     1.3726  2.0628  0.9059  16.7236     2.6494  2.7217  26.2718     9.9759  17.553  23.475
2/28/2018   3883.0  3.251   51.4332     10.254  4.225   5.7376  -32.4057    0.6103  8.803   ...     1.3726  2.0852  0.8181  16.7322     2.6507  2.7231  26.2718     9.9759  13.771  23.045
*3/1/2018*  3883.0  3.251   51.4332     10.254  4.225   8.8678  4.7481  -14.9557    8.803   ...     1.0180  2.0852  0.8181  16.7322     2.6507  2.7231  15.5694     9.1429  13.771  23.045
*3/30/2018* 3700.0  5.646   49.6923     0.773   -2.346  8.8678  4.7481  -14.9557    8.388   ...     1.0180  1.9431  0.8499  17.2796     2.4121  2.5267  15.5694     9.1429  15.880  22.033
4/30/2018   4281.0  6.475   54.7253     10.663  8.728   8.8678  4.7481  -14.9557    10.599  ...     1.0180  2.1033  1.1068  19.9930     2.7909  2.9234  15.5694     9.1429  28.096  21.213
5/31/2018   4215.0  13.367  54.0894     2.241   -3.708  8.8678

The data is supposed to be monthly, but for some reason there are a few spots in the data where there are two values in the same month.

I want this:

Dates   Last Price  Relative Share Price Momentum   RSI 30 Day  Relative 3 Month Eqty/Index     Relative 1 Month Eqty/Index     Sales/Diluted Sh    Revenue Growth Year over Year
1/31/2018   3881.0  -2.132  51.4152     4.526   -0.989  5.7376  -32.4057    0.6103  8.723   ...     1.3726  2.0628  0.9059  16.7236     2.6494  2.7217  26.2718     9.9759  17.553  23.475
2/28/2018   3883.0  3.251   51.4332     10.254  4.225   5.7376  -32.4057    0.6103  8.803   ...     1.3726  2.0852  0.8181  16.7322     2.6507  2.7231  26.2718     9.9759  13.771  23.045
3/30/2018   3883.0  3.251   51.4332     10.254  4.225   8.8678  4.7481  -14.9557    8.803   ...     1.0180  2.0852  0.8181  16.7322     2.6507  2.7231  15.5694     9.1429  13.771  23.045
4/30/2018   4281.0  6.475   54.7253     10.663  8.728   8.8678  4.7481  -14.9557    10.599  ...     1.0180  2.1033  1.1068  19.9930     2.7909  2.9234  15.5694     9.1429  28.096  21.213
5/31/2018   4215.0  13.367  54.0894     2.241   -3.708  8.8678

I am guessing I should use something df.drop_duplicates combined with df.loc. I need to make a code saying; if "month" in df['Dates'] is the same two rows in a row, delete one of them (it doesn't really matter which one tbh).

EDIT2: Since no one seems to know the answer, I changed my dataframe again:

  Month Day     Year    Price names     Variable   Variable   Variable
    1   31.0    1990.0  1.2143  AAPL    47.0287     -24.3754    3.5821  
    2   28.0    1990.0  1.2143  AAPL    47.0287     -19.8995    -0.8467     36.713  39.377
    3   31.0    1990.0  1.4375  AAPL    49.7818     18.7056     15.5790     0.3787  14.7951     40.891  42.742
    4   29.0    1990.0  1.4063  AAPL    49.4099     15.2067     0.5290  0.3787  ...     0.0371  0.7548  3.1297  14.7951     35.632  39.694
    4   30.0    1990.0  1.4732  AAPL    50.2341     11.4693     -4.0632     0.3787  ...     0.0371  0.7459  3.2787  14.7951     32.273  37.271
    5   31.0    1990.0  1.5982  AAPL    51.7520

Now with this format, hopefully it will be easier for someone to work it out. I want to delete row if df['Month'] = df['Year'] = df['names'].

My thoughts are that one could do something like this: Delete rows from a pandas DataFrame based on a conditional expression involving len(string) giving KeyError

I tried doing this with no luck:

df = df.drop(df[(df.Month == df.Year) & (df.Month == df.names)].index)

EDIT2: I was able to do this:

df[~df.duplicated(['Month', 'Year', 'Name'], keep=False)]

Which completely removes the rows which have duplicate months, however it doesn't keep one row, it just deletes both, which isn't exactly what I want. Maybe someone can tweak this such that one of the rows remain?

I appreciate all help!

Would you mean something along the lines of (1) find the month, (2) if same month and year, (3) take last observation? — ifly6, Jul 11 '19 at 13:56
Kind of, it would need to be just two rows in a row, because I have several sections (in my case, several stocks). So there are already several rows with same month and year. — themli, Jul 11 '19 at 14:12
Alternatively, it could be done as you said, but it would have to be done over a folder of several CSV files (as the file I have now is a merged file of several CSVs). — themli, Jul 11 '19 at 14:14

cool king · Accepted Answer · 2019-07-20T07:50:52.103

0

Try querying the data frame using df.query

df = df.query("(month != year) & (month != names)")

Since there are same months for different stock names try grouping the data according to the names and marking the duplicate rows

# marks the rows with duplicate months within a stock name group
df['duplicate_months'] = df.groupby('name')['Month'].diff().fillna(1).clip(0,1)

# querying the df would eliminate these duplicate rows
# keeps the month row which is marked as 1 in the df 
df = df.query('duplicate_months != 0')

edited Jul 20 '19 at 07:50

answered Jul 17 '19 at 10:03

cool king

26
2

Thank you for the suggestion! But no the problem is still there: https://ibb.co/yPdDS3z (edit, new photo) – themli Jul 17 '19 at 13:03
I went through the question again, If the data in the duplicate row doesn't matter you can use drop_duplicates. I'll edit my answer – cool king Jul 18 '19 at 06:14
Thanks again! But to clarify, it matters in the sense that I have sections of data (different stocks, see name column) with the same month and year so if and when I run drop_duplicates, I am only left with one stock's data. Because each stock's data has the same time period. The condition needs to be, drop if ['Month'], ['Year'] was the same as the row above. – themli Jul 18 '19 at 09:54
I have made an edit to my answer, i am assuming that there is only one level of grouping i.e. by 'name'. – cool king Jul 20 '19 at 07:50
That almost worked! It deletes the duplicates--but for some reason it also deletes the first month (January) of all years excluding the first year (1990). This happens for all the sections. After running your code I opened the df in excel, see: https://ibb.co/sQC0q8v But at least now each section is the same length, so it's fine to leave it like this! I'll confirm your answer. Let me know if you figure out that bug though. – themli Jul 21 '19 at 14:11
I had assumed that there was only one grouping level i.e. 'name' but it seems like you had duplicate values within an 'year' in a 'name'. Replace groupby('name') with groupby(['name', 'year']). This should do the trick. – cool king Jul 22 '19 at 07:16

Delete "similar" rows in pandas

1 Answers1