ok, I've been trying this for too long, time to ask for help. I have a dataframe that looks a bit like this:
person fruit quantity all_fruits 0 p1 grapes 2 [grapes, banana] 1 p1 banana 1 [grapes, banana] 2 p2 apple 4 [apple, banana, peach] 3 p2 banana 4 [apple, banana, peach] 4 p2 peach 2 [apple, banana, peach] 5 p3 grapes 1 [grapes] 6 p4 banana 1 [banana] 7 p5 apple 3 [apple, peach] 8 p5 peach 2 [apple, peach]
Then I have a list of 'fruits of interest':
fruits_of_interest: ['apple', 'banana']
What I need to do is:
create a column for each fruit of interest, and for each person on column 1(person), assign if she has or not that fruit
for each person on column 1, assign the log(1+x) of the quantity of fruit of interest that such person has under the column of that fruit
I'm struggling to make that work! It doesn't help that my actual dataframe is quite large with close to 800k rows, and the list of "fruits of interest" has over 300 "fruits".
For the first part, I used this function and could get all the columns with the booleans for having or not the fruit:
def has_fruit(fruit, row): one_string = '\t'.join(row) return fruit in one_string
def process_fruits(df, fruits_of_interest): for fruit in fruits_of_interest: df[fruit] = [has_fruit(fruit, x) for x in df['all_fruits']] return df
The second part where I need to assign the values is the part that I can't make work at all! I've tried doing all at once, with this other function but it's not quite doing what it should:
def process_fruits2(df, fruits_of_interest): for fruit in fruits_of_interest: if [has_fruit(fruit, x) for x in df['all_fruits']]: df[fruit] = np.log1p(df.loc[df['fruit'] == fruit].quantity) return df
What am I doing wrong and how could I accomplish this?
Adding expected output:
It would be a dataframe sort of like this (like the answer below, but only containing the fruits from the list fruits_of_interest):
person apple banana p1 0.000000 0.693147 p2 1.609438 1.609438 p3 0.000000 0.000000 p4 0.000000 0.693147 p5 1.386294 0.000000