Delete rows with overlapping items

Question

I have a data frame which looks like the below:

Customer 1  Customer 2  Customer 3
A               B            C
B               C            D 
C               D            E
D               E            F
E               F            G

There are customers coming to a store continuously. I want to create a row of the first 3 customers coming in the store in an hour. As the customers keep coming in continuously, it keeps taking group 3 and making rows. Though I do not want to form strict hour lining like 1-2, 2-3, etc.

I just want if customer B and C are covered in row 1, they should not be counted in row 2. I want to delete rows that have overlapping items and only keep the unique ones. So my expected output would be:

Customer 1  Customer 2  Customer 3
A               B            C
D               E            F
G

How to achieve this, please help. Thanks

I wonder why are you using pandas here, regardless what about the G? — รยקคгรђשค, Oct 26 '20 at 05:44
ok so shouldn't there be another row in the required output but then it would repeat. — รยקคгรђשค, Oct 26 '20 at 05:46
does flattened unique values as a list work for you or do you want in the exact same format as in the question? — รยקคгรђשค, Oct 26 '20 at 05:51
For the group of 3 columns, I want unique values as a group. One customer would be counted in only 1 row. How to use flattened unique values? — zsh_18, Oct 26 '20 at 05:53
@zsh-18 I have added an approach to work with the unique values and convert them back to groups of 3. — รยקคгรђשค, Oct 26 '20 at 07:50

score 0 · Answer 1 · answered Oct 26 '20 at 06:45

Here is my take on this, NOT ready yet but should point you in the right direction any edits are welcome

First, let's set up the data

df = pd.DataFrame(data={
    "Customer 1": ["A", "B","C","D", "E"],
    "Customer 2": [ "B","C","D", "E", "F"],
    "Customer 3": ["C","D", "E", "F", "G"],
})

Working on NumPy will be much better so let's create a variable with a NumPy 2d array

df_np = df.values

df_np.flatten()[:6] # This will flatten the list and will only take the first 6 items to be able to reshape it later

np.unique(df_np) # Removes all duplicates so we will be only left with data shape that can be rebuilt into a DataFrame

Now let's reshape it back to the original shape

np.reshape(c, (-1, 3))

You can now rebuild the dataframe

pd.DataFrame(data=c, columns=df.columns)

I couldn't find a way to take care of the G, and as I've said before not a complete solution so any edits are welcome

รยקคгรђשค · Answer 2 · 2020-10-26T07:53:32.343

Explanation:

First we get all the unique values across rows. Group the unique values numpy array together taking 3 at a time as requested and pad the remaining unfilled columns with invalid values and then convert back to dataframe.

import numpy as np
import pandas as pd

df = pd.DataFrame({"Customer 1" : ["A","B","C","D","E"],
                  "Customer 2" : ["B","C","D","E","F"],
                  "Customer 3" : ["C","D","E","F","G"]})



unique_vals = pd.unique(df[['Customer 1', 'Customer 2', 'Customer 3']].values.ravel('K'))

new_shape = unique_vals.size + 3 - unique_vals.size % 3

new_df_source = np.full(new_shape, fill_value = "invalid")

new_df_source.flat[:unique_vals.size] = unique_vals

new_df_source = new_df_source.reshape(-1,3)
output_df = pd.DataFrame(new_df_source, columns=df.columns)

Result:

  Customer 1 Customer 2 Customer 3
0          A          B          C
1          D          E          F
2          G    invalid    invalid

Caveat: The rows in output_df may not be present at all in input df since we are looking at the unique values and grouping back together, though we are still maintaining relative order of unique values.

Delete rows with overlapping items

2 Answers2

Here is my take on this, NOT ready yet but should point you in the right direction any edits are welcome