3

I have two pandas data frames A and B. B is a subset of A.

I want to delete all numbers from A if it's in B. But, if a number occurs two times in A and 1 time in B then it will only delete 1 occurrence of the number from A.

Here is my sample data sets:

df_A                df_B
[Test]              [Test]
1                   1
2                   2         
3                   5 
2                   5 
4
5
5

After the operation I want new data frame c as

df_C
[Test]
3
2
4

Can you please suggest how to do that?

The suggested duplicate removes all occurrences from A if present in B, not just the first N occurrences.

Stephen Rauch
  • 47,830
  • 31
  • 106
  • 135
Elina
  • 109
  • 1
  • 5
  • 4
    Possible duplicate of [In Pandas, how to delete rows from a Data Frame based on another Data Frame?](https://stackoverflow.com/questions/39880627/in-pandas-how-to-delete-rows-from-a-data-frame-based-on-another-data-frame) – wwii Jan 13 '18 at 16:13
  • 2
    This is not dupe, as question asks to remove only first N occurrences. – Stephen Rauch Jan 13 '18 at 17:58

3 Answers3

2

If you create some intermediate values which contain the occurrence number for each value, then you can use pandas.Series.isin() to create a logical index of which rows of the dataframe to exclude like:

Code:

from collections import Counter

def occurrences_number(column):

    def occurrence_number(value, accumulator):
        """ tuple of value and occurrence number of value """
        accumulator[value] += 1
        return value, accumulator[value]

    occurrences = Counter()
    return column.apply(lambda x: occurrence_number(x, occurrences))

def find_not_in_by_occurrence_number(data, not_in):
    not_in_indices = ~occurrences_number(data).isin(occurrences_number(not_in))
    return data[not_in_indices].reset_index()

Test Code:

import pandas as pd

dfA = pd.DataFrame({'A': [1, 2, 3, 2, 4, 5, 5]})
dfB = pd.DataFrame({'B': [1, 2, 5, 5]})
print(dfA)
print(dfB)

dfC = find_not_in_by_occurrence_number(dfA.A, dfB.B).A

print (dfC)

Results:

   A
0  1
1  2
2  3
3  2
4  4
5  5
6  5

   B
0  1
1  2
2  5
3  5

0    3
1    2
2    4
Name: A, dtype: int64
Stephen Rauch
  • 47,830
  • 31
  • 106
  • 135
2

I might do something (stealing SR's setup):

dfA = pd.DataFrame({'A': [1, 2, 3, 2, 4, 5, 5]})
dfB = pd.DataFrame({'B': [1, 2, 5, 5]})

counts = dfA.groupby('A').cumcount()
limits = dfB['B'].value_counts().reindex(dfA.A).fillna(0).values
dfC = dfA.loc[counts >= limits]

which gives me

In [121]: dfC
Out[121]: 
   A
2  3
3  2
4  4

This works by using groupby to get the number of times a given value in A has been seen before:

In [124]: dfA.groupby('A').cumcount()
Out[124]: 
0    0
1    0
2    0
3    1
4    0
5    0
6    1
dtype: int64

and using value_counts to get the limits which we then reindex to match the counts:

In [139]: dfB['B'].value_counts()
Out[139]: 
5    2
2    1
1    1
Name: B, dtype: int64

In [140]: dfB['B'].value_counts().reindex(dfA.A)
Out[140]: 
A
1    1.0
2    1.0
3    NaN
2    1.0
4    NaN
5    2.0
5    2.0
Name: B, dtype: float64
DSM
  • 342,061
  • 65
  • 592
  • 494
1

In this problem, you can make use of Counter and drop. For drop, you will need to know the indexes of rows that you want to discard.

import itertools
from collections import Counter
df = pd.DataFrame({'Test': {0: 1, 1: 2, 2: 3, 3: 2, 4: 4, 5: 5, 6: 5}})
df2 = pd.DataFrame({'Test': {0: 1, 1: 2, 2: 5, 3: 5}})
c_df2 = Counter(df2.Test)

indexes_to_remove_2d = [df.index[df['Test'] == k].tolist()[:v] 
                        for k, v in c_df2.items()]
# [[0], [1], [5, 6]]
merged = list(itertools.chain(*indexes_to_remove_2d))
# [0, 1, 5, 6]
df.drop(merged)

    Test
2   3
3   2
4   4

indexes_2d generates those indexes that fits the value k in the counter with df.index[df['Test'] == k] and has a [:v] to limit the size of indexes we take from.

Then, we merge these indexes with itertools.chain. Finally drop rows with these indexes.

Thanks for Stephan Ranch to point out the order problem.

Tai
  • 7,684
  • 3
  • 29
  • 49