Delete rows of a data frame based on quantity of data in another data frame

Question

I have two pandas data frames A and B. B is a subset of A.

I want to delete all numbers from A if it's in B. But, if a number occurs two times in A and 1 time in B then it will only delete 1 occurrence of the number from A.

Here is my sample data sets:

df_A                df_B
[Test]              [Test]
1                   1
2                   2         
3                   5 
2                   5 
4
5
5

After the operation I want new data frame c as

df_C
[Test]
3
2
4

Can you please suggest how to do that?

The suggested duplicate removes all occurrences from A if present in B, not just the first N occurrences.

Possible duplicate of [In Pandas, how to delete rows from a Data Frame based on another Data Frame?](https://stackoverflow.com/questions/39880627/in-pandas-how-to-delete-rows-from-a-data-frame-based-on-another-data-frame) — wwii, Jan 13 '18 at 16:13
This is not dupe, as question asks to remove only first N occurrences. — Stephen Rauch, Jan 13 '18 at 17:58

score 2 · Answer 1 · answered Jan 13 '18 at 17:52

If you create some intermediate values which contain the occurrence number for each value, then you can use pandas.Series.isin() to create a logical index of which rows of the dataframe to exclude like:

Code:

from collections import Counter

def occurrences_number(column):

    def occurrence_number(value, accumulator):
        """ tuple of value and occurrence number of value """
        accumulator[value] += 1
        return value, accumulator[value]

    occurrences = Counter()
    return column.apply(lambda x: occurrence_number(x, occurrences))

def find_not_in_by_occurrence_number(data, not_in):
    not_in_indices = ~occurrences_number(data).isin(occurrences_number(not_in))
    return data[not_in_indices].reset_index()

Test Code:

import pandas as pd

dfA = pd.DataFrame({'A': [1, 2, 3, 2, 4, 5, 5]})
dfB = pd.DataFrame({'B': [1, 2, 5, 5]})
print(dfA)
print(dfB)

dfC = find_not_in_by_occurrence_number(dfA.A, dfB.B).A

print (dfC)

Results:

   A
0  1
1  2
2  3
3  2
4  4
5  5
6  5

   B
0  1
1  2
2  5
3  5

0    3
1    2
2    4
Name: A, dtype: int64

score 2 · Accepted Answer · answered Jan 13 '18 at 18:19

I might do something (stealing SR's setup):

dfA = pd.DataFrame({'A': [1, 2, 3, 2, 4, 5, 5]})
dfB = pd.DataFrame({'B': [1, 2, 5, 5]})

counts = dfA.groupby('A').cumcount()
limits = dfB['B'].value_counts().reindex(dfA.A).fillna(0).values
dfC = dfA.loc[counts >= limits]

which gives me

In [121]: dfC
Out[121]: 
   A
2  3
3  2
4  4

This works by using groupby to get the number of times a given value in A has been seen before:

In [124]: dfA.groupby('A').cumcount()
Out[124]: 
0    0
1    0
2    0
3    1
4    0
5    0
6    1
dtype: int64

and using value_counts to get the limits which we then reindex to match the counts:

In [139]: dfB['B'].value_counts()
Out[139]: 
5    2
2    1
1    1
Name: B, dtype: int64

In [140]: dfB['B'].value_counts().reindex(dfA.A)
Out[140]: 
A
1    1.0
2    1.0
3    NaN
2    1.0
4    NaN
5    2.0
5    2.0
Name: B, dtype: float64

Tai · Answer 3 · 2018-01-13T22:48:19.777

In this problem, you can make use of Counter and drop. For drop, you will need to know the indexes of rows that you want to discard.

import itertools
from collections import Counter
df = pd.DataFrame({'Test': {0: 1, 1: 2, 2: 3, 3: 2, 4: 4, 5: 5, 6: 5}})
df2 = pd.DataFrame({'Test': {0: 1, 1: 2, 2: 5, 3: 5}})
c_df2 = Counter(df2.Test)

indexes_to_remove_2d = [df.index[df['Test'] == k].tolist()[:v] 
                        for k, v in c_df2.items()]
# [[0], [1], [5, 6]]
merged = list(itertools.chain(*indexes_to_remove_2d))
# [0, 1, 5, 6]
df.drop(merged)

    Test
2   3
3   2
4   4

indexes_2d generates those indexes that fits the value k in the counter with df.index[df['Test'] == k] and has a [:v] to limit the size of indexes we take from.

Then, we merge these indexes with itertools.chain. Finally drop rows with these indexes.

Thanks for Stephan Ranch to point out the order problem.

@StephenRauch That's true! Will emphasize it. – Tai Jan 13 '18 at 21:59 — Tai, Jan 13 '18 at 21:59

Delete rows of a data frame based on quantity of data in another data frame

3 Answers3

Code:

Test Code:

Results:

Linked