0

I have a dataframe in which each user can log multiple entries across a few days. It looks a bit like this:

    DATE        USER    TYPE     MORNING   AFTN   NIGHT   FOOD      BVG
01/11/2019        A       Bvg        1       0       1     -       Water
01/11/2019        A       Bvg        1       0       0     -       Juice
01/11/2019        A       Food       0       1       1     Rice     -
01/11/2019        A       Food       1       0       0     Noodle   -
02/11/2019        A       Bvg        1       0       0     -       Coffee
02/11/2019        A       Food       0       0       1     Bread    -
02/11/2019        A       Bvg        0       0       1     -        Tea
01/11/2019        B       Bvg        1       0       0     -        Water     
01/11/2019        B       Bvg        0       1       0     -        Tea
01/11/2019        B       Food       1       0       0     Rice      -

I need to identify all the combinations of Bvg/Food within the same day and same time. The problem is that the TIME data come in multiple binary columns -- I'll need to find a bvg/food combination that both has the value '1' within the sub-group, and identify which time (MORNING/AFTN/NIGHT) does the combination fall onto.

The final dataframe should have the food/bvg combination, and a single time column. For example for the dataset above, the expected outcome would be:

 TIME            COMBINATION     
MORNING         Water, Noodle     
MORNING         Juice, Noodle    
 NIGHT          Water, Rice             
 NIGHT          Tea, Bread
MORNING         Water, Rice                     

The combinations needs to be of only 1 food & 1 beverage

I've tried conflating the TIME column into a new column of a joint string, for example:

DATE        USER    TYPE     MORNING   AFTN   NIGHT   FOOD      BVG     TIME
01/11/2019    A       Bvg        1       0       1     -       Water   MORNING, NIGHT

And grouped the data by date, user, and time -- but I cannot limit the combination to just two.

I'm quite new to python and I'm flummoxed if there's any way to do this, any help or clue would be very appreciated

vins
  • 3
  • 2

1 Answers1

0

You can find all beverage-food combinations per day, per time of day, and user (you don't explicitly mention by user but your expected outcome requires it) by:

  1. Converting the data into long format
  2. Performing a cross join of all food and beverages, joining on day, time of day, and user
  3. Dropping duplicates

The code below returns the desired result, let me know of this helped!

# PREPARATORY STEP: LOAD DATA USING SOLUTION FROM https://stackoverflow.com/a/53692642/8718701
from io import StringIO
import pandas as pd


d = '''
    DATE        USER    TYPE     MORNING   AFTN   NIGHT   FOOD      BVG
01/11/2019        A       Bvg        1       0       1     -       Water
01/11/2019        A       Bvg        1       0       0     -       Juice
01/11/2019        A       Food       0       1       1     Rice     -
01/11/2019        A       Food       1       0       0     Noodle   -
02/11/2019        A       Bvg        1       0       0     -       Coffee
02/11/2019        A       Food       0       0       1     Bread    -
02/11/2019        A       Bvg        0       0       1     -        Tea
01/11/2019        B       Bvg        1       0       0     -        Water     
01/11/2019        B       Bvg        0       1       0     -        Tea
01/11/2019        B       Food       1       0       0     Rice      -
'''

df = pd.read_csv(StringIO(d), sep='\s+')

# STEP 1 - CONVERT DATA INTO LONG FORMAT
df = pd.melt(
    df,
    id_vars=['DATE', 'USER', 'FOOD', 'BVG'],
    value_vars=['MORNING', 'AFTN', 'NIGHT'],
    var_name='TIME_OF_DAY',
    value_name='VALUE'
)

df = df.loc[df['VALUE'] == 1, :]  # remove empty rows
df.drop(columns=['VALUE'], inplace=True)

df = pd.melt(
    df,
    id_vars=['DATE', 'USER', 'TIME_OF_DAY'],
    value_vars=['FOOD', 'BVG'],
    var_name='PRODUCT_TYPE',
    value_name='PRODUCT'
)

df = df.loc[df['PRODUCT'] != '-', :]  # remove empty rows

# STEP 2 - CROSS JOIN OF BVG AND FOOD TO IDENTIFY ALL COMBINATIONS FOR SAME DAY, TIME, AND USER
df = pd.merge(
    df.loc[df['PRODUCT_TYPE']=='BVG', :],
    df.loc[df['PRODUCT_TYPE']=='FOOD', :],
    on=['DATE', 'TIME_OF_DAY', 'USER'])

df['PRODUCT_COMBINATION'] = df.agg('{0[PRODUCT_x]}, {0[PRODUCT_y]}'.format, axis=1)

# STEP 3 - SELECT COLUMNS REQUIRED FOR FINAL OUTPUT AND DROP DUPLICATES
df = df.loc[:, ['DATE', 'TIME_OF_DAY', 'PRODUCT_COMBINATION']]
df.drop_duplicates(inplace=True)

print(df.to_markdown(index=False))

Returns:

| DATE       | TIME_OF_DAY   | PRODUCT_COMBINATION   |
|:-----------|:--------------|:----------------------|
| 01/11/2019 | MORNING       | Water, Noodle         |
| 01/11/2019 | MORNING       | Juice, Noodle         |
| 01/11/2019 | MORNING       | Water, Rice           |
| 01/11/2019 | NIGHT         | Water, Rice           |
| 02/11/2019 | NIGHT         | Tea, Bread            |
Simon David
  • 663
  • 3
  • 13