0

I have a question which is closely related to this post: Pandas conditional creation of a series/dataframe column

The difference to that question is that I would like to use the value of one column to assign the values in MANY other columns. I'd like to avoid writing a for-loop with many if-statements over all entries for efficiency reasons.

I have a dataset like this:

import pandas as pd
df = pd.DataFrame(columns=['Type', 'Set', 'Q1', 'Q2', 'Q3', 'color', 'number'])
df['Type'] = ['A', 'B', 'B', 'C', 'D', 'E', 'C', 'D']

Which produces:

  Type  Set   Q1   Q2   Q3 color number
0    A  NaN  NaN  NaN  NaN   NaN  NaN
1    B  NaN  NaN  NaN  NaN   NaN  NaN
2    B  NaN  NaN  NaN  NaN   NaN  NaN
3    C  NaN  NaN  NaN  NaN   NaN  NaN
4    D  NaN  NaN  NaN  NaN   NaN  NaN
5    E  NaN  NaN  NaN  NaN   NaN  NaN
6    C  NaN  NaN  NaN  NaN   NaN  NaN
7    D  NaN  NaN  NaN  NaN   NaN  NaN

Based on the information in Type, I want to create values for various other columns.

For example, for Type==A, I'd like a list of varying things to happen to the respective rows in the dataframe: df['Set'] = 'Z', df[Q1]=0, df[Q2]=0, df[Q3]=random.choice(True, False), df[color]='green' and df[number]=call_on_some_function_I_defined(input = df[Q1])

When Type==B, I'd like certain other things to happen to those same variables: df['Set'] = 'X', df[Q1]=random.choice(0, 250, 500, 750, 1000), etc.

Ideally, I'd like do something along these lines:

import numpy as np

conditions = [
    (df['Type'] == 'A'),
    (df['Type'] == 'B'),
    (df['Type'] == 'C')] #etc.
choices_A = [df['Set'] = 'Z', df[Q1]=0, df[Q2]=0, df[Q3]=random.choice(True, False), df[color]='green', df[number]=call_on_some_function_I_defined(input = df[Q1])]
choices_B = [df['Set'] = 'X', df[Q1]=random.choice(0, 250, 500, 750, 1000)` df[Q2]=random.choice(0, 250, 500, 750, 1000), df[Q3]=False, df[color]='red', df[number]=call_on_some_function_I_defined(input = df[Q2])]

df = np.select(condition[0], choices_A, default=0)
df = np.select(condition[1], choices_B, default=0)

To create output like:

  Type  Set   Q1   Q2   Q3    color number
0    A   Z    0    0   True   green  17
1    B   X   750   0   False   red   85
2    B   X   500  250  False   red   93   #etc

While numpy.select with its conditions and choices is perfect for conditional assignment of values of a single dataframe column, I haven't found a neat way to make conditions work for assigning values to multiple dataframe columns.

Lena
  • 133
  • 6
  • This is a project, not a question. You will need `if` statements in one form or another but it will require them none the less. I understand that you want a better way but in my opinion, this is the wrong forum. – piRSquared Aug 01 '19 at 14:07
  • 1
    @piRSquared, I am confused as to why this question is too far off from this forum, because it appears to be a project to you. My question is about one or maybe two lines of code, namely whether there is some adjustment that I can make to `np.select` or I am missing a function which can do this sort of conditional statements applied to multiple columns at once. – Lena Aug 01 '19 at 14:23
  • 1
    It is my opinion that your question is asking too much at once. That could be a result of your not providing a [mcve]. Regardless, I find the question confusing and too involved. Rather than down voting the question or voting to close and walking away, I wanted to give you feedback so that you knew where I was coming from. I'm just one person so you can ignore me if you like. I'm just trying to be helpful. – piRSquared Aug 01 '19 at 14:28
  • 1
    If every row of the same type were the same, you could potentially get some speedup with builtin methods. Since every row is different (due to ``random`` depending on the type) this isn't a vectorisable/parallelisable operation, and you might just as well use a ``for`` loop. Random values are *very* expensive to create, the loop should not matter by itself. You might have some luck creating dataframes for each type separately, *then* merging them or selecting from them. – MisterMiyagi Aug 01 '19 at 14:28
  • MisterMiyagi, Would you say that there is an option of making this work, if this dependency on previously calculated variables (such as `df[number]` being dependent on the `Q2` column) would not exist? – Lena Aug 01 '19 at 14:39

0 Answers0