5

ok, I've been trying this for too long, time to ask for help. I have a dataframe that looks a bit like this:

  person  fruit   quantity    all_fruits
0 p1      grapes  2           [grapes, banana]
1 p1      banana  1           [grapes, banana]
2 p2      apple   4           [apple, banana, peach]
3 p2      banana  4           [apple, banana, peach]
4 p2      peach   2           [apple, banana, peach]
5 p3      grapes  1           [grapes]
6 p4      banana  1           [banana]
7 p5      apple   3           [apple, peach]
8 p5      peach   2           [apple, peach]

Then I have a list of 'fruits of interest':

fruits_of_interest: ['apple', 'banana']

What I need to do is:

  1. create a column for each fruit of interest, and for each person on column 1(person), assign if she has or not that fruit

  2. for each person on column 1, assign the log(1+x) of the quantity of fruit of interest that such person has under the column of that fruit

I'm struggling to make that work! It doesn't help that my actual dataframe is quite large with close to 800k rows, and the list of "fruits of interest" has over 300 "fruits".

For the first part, I used this function and could get all the columns with the booleans for having or not the fruit:

def has_fruit(fruit, row):
        one_string = '\t'.join(row)
        return fruit in one_string
def process_fruits(df, fruits_of_interest):
    for fruit in fruits_of_interest:
        df[fruit] = [has_fruit(fruit, x) for x in df['all_fruits']]
    return df

The second part where I need to assign the values is the part that I can't make work at all! I've tried doing all at once, with this other function but it's not quite doing what it should:

def process_fruits2(df, fruits_of_interest):
    for fruit in fruits_of_interest:
        if [has_fruit(fruit, x) for x in df['all_fruits']]:
            df[fruit] = np.log1p(df.loc[df['fruit'] == fruit].quantity)

    return df

What am I doing wrong and how could I accomplish this?

Adding expected output:

It would be a dataframe sort of like this (like the answer below, but only containing the fruits from the list fruits_of_interest):

person  apple     banana                                        
p1      0.000000  0.693147
p2      1.609438  1.609438
p3      0.000000  0.000000
p4      0.000000  0.693147
p5      1.386294  0.000000

1 Answers1

2

Here is one approach. I created a pivot table with person (rows) vs fruit (columns):

from io import StringIO
import numpy as np
import pandas as pd

# create data frame
data = '''person  fruit   quantity
p1      grapes  2
p1      banana  1
p2      apple   4
p2      banana  4
p2      peach   2
p3      grapes  1
p4      banana  1
p5      apple   3
p5      peach   2
'''
df = pd.read_csv(StringIO(data), sep='\s+', engine='python')

Calculate pivot table and log (1 + x):

# create summary table: person x fruit
df = df.pivot_table(index='person', columns='fruit', 
                    values='quantity', aggfunc=sum, fill_value=0)

# compute log(1 + fruit)
print(df, end='\n\n')
print(np.log(1 + df))

fruit   apple  banana  grapes  peach
person                              
p1          0       1       2      0
p2          4       4       0      2
p3          0       0       1      0
p4          0       1       0      0
p5          3       0       0      2

fruit      apple    banana    grapes     peach
person                                        
p1      0.000000  0.693147  1.098612  0.000000
p2      1.609438  1.609438  0.000000  1.098612
p3      0.000000  0.000000  0.693147  0.000000
p4      0.000000  0.693147  0.000000  0.000000
p5      1.386294  0.000000  0.000000  1.098612
jsmart
  • 2,921
  • 1
  • 6
  • 13