1

I have a data frame which looks like the below:

Customer 1  Customer 2  Customer 3
A               B            C
B               C            D 
C               D            E
D               E            F
E               F            G

There are customers coming to a store continuously. I want to create a row of the first 3 customers coming in the store in an hour. As the customers keep coming in continuously, it keeps taking group 3 and making rows. Though I do not want to form strict hour lining like 1-2, 2-3, etc.

I just want if customer B and C are covered in row 1, they should not be counted in row 2. I want to delete rows that have overlapping items and only keep the unique ones. So my expected output would be:

Customer 1  Customer 2  Customer 3
A               B            C
D               E            F
G               

How to achieve this, please help. Thanks

zsh_18
  • 1,012
  • 1
  • 11
  • 29

2 Answers2

0

Here is my take on this, NOT ready yet but should point you in the right direction any edits are welcome

First, let's set up the data

df = pd.DataFrame(data={
    "Customer 1": ["A", "B","C","D", "E"],
    "Customer 2": [ "B","C","D", "E", "F"],
    "Customer 3": ["C","D", "E", "F", "G"],
})

Working on NumPy will be much better so let's create a variable with a NumPy 2d array

df_np = df.values

df_np.flatten()[:6] # This will flatten the list and will only take the first 6 items to be able to reshape it later

np.unique(df_np) # Removes all duplicates so we will be only left with data shape that can be rebuilt into a DataFrame

Now let's reshape it back to the original shape

np.reshape(c, (-1, 3))

You can now rebuild the dataframe

pd.DataFrame(data=c, columns=df.columns)

I couldn't find a way to take care of the G, and as I've said before not a complete solution so any edits are welcome

Abod
  • 58
  • 1
  • 7
0

Explanation:

First we get all the unique values across rows. Group the unique values numpy array together taking 3 at a time as requested and pad the remaining unfilled columns with invalid values and then convert back to dataframe.

import numpy as np
import pandas as pd

df = pd.DataFrame({"Customer 1" : ["A","B","C","D","E"],
                  "Customer 2" : ["B","C","D","E","F"],
                  "Customer 3" : ["C","D","E","F","G"]})



unique_vals = pd.unique(df[['Customer 1', 'Customer 2', 'Customer 3']].values.ravel('K'))

new_shape = unique_vals.size + 3 - unique_vals.size % 3

new_df_source = np.full(new_shape, fill_value = "invalid")

new_df_source.flat[:unique_vals.size] = unique_vals

new_df_source = new_df_source.reshape(-1,3)
output_df = pd.DataFrame(new_df_source, columns=df.columns)

Result:

  Customer 1 Customer 2 Customer 3
0          A          B          C
1          D          E          F
2          G    invalid    invalid

Caveat: The rows in output_df may not be present at all in input df since we are looking at the unique values and grouping back together, though we are still maintaining relative order of unique values.

รยקคгรђשค
  • 1,919
  • 1
  • 10
  • 18