0

Suppose the following dataframe:

   c1  c2
0  0   "phrase_x 12"
1  0   "phrase_y 34"
2  1   "phrase_x 56"  
3  1   "phrase_y 78"  

I want to make four new columns, c3-c6, based on conditions from c1 and c2, such that:

  • if c1 == 0 and c2 contains phrase_x, c3 is filled with data from c2, split with a separator string
  • if c1 == 0 and c2 contains phrase_y, c4 is filled with data from c2, split with a separator string
  • if c1 == 1 and c2 contains phrase_x, c5 is filled with data from c2, split with a separator string
  • if c1 == 1 and c2 contains phrase_y, c6 is filled with data from c2, split with a separator string
  • The empty cells may be filled with NaN

Something like this:

   c1  c2             c3    c4    c5    c6
0  0   "phrase_x 12"  12    NaN   NaN   NaN
1  0   "phrase_y 34"  NaN   34    NaN   NaN
2  1   "phrase_x 56"  NaN   NaN   56    NaN
3  1   "phrase_y 78"  NaN   NaN   NaN   78

I have looked at this question, which helped me further a bit, but only if I want to use 1 condition, like this:

df.loc[df['c2'].str.contains("phrase_x") , 'c3'] = df['c2'].str.split('-> ').str[1] df.loc[df['c2'].str.contains("phrase_y") , 'c4'] = df['c2'].str.split('-> ').str[1]

This code produces

   c1  c2             c3    c4    
0  0   "phrase_x 12"  12    NaN
1  0   "phrase_y 34"  NaN   34 
2  1   "phrase_x 56"  56    NaN
3  1   "phrase_y 78"  NaN   78   

Is there a way to use loc with multiple conditions? It seems to me that a possibility is to first use the above 2 statements, and then copy the data over based on if c1 is 0 or 1, like this:

df.loc[df['c1'] = 1, 'c5'] = df['c3']
df.loc[df['c1'] = 1, 'c6'] = df['c4']

However this seems tedious, and you would also need to delete the data in the previous columns.

I also tried the solution under the previous mentioned one, with np.where():

df['c3'] = df.where(df['c1'] == 0 & df['c2'].str.contains("phrase_x"), df['c2'].str.split('-> ').str[1], axis = 0)

However this only leads to more errors, e.g.

ValueError: Cannot set a DataFrame with multiple columns to the single column Budget Outlet 1

3 Answers3

1

Everything will be easier if you split your c2 to two columns then apply condition. That's make your code simpler. Here is what I mean:

Data:

df = pd.DataFrame({'c1':[0,0,1,1],
                   'c2':['x','y','x','y'],
                   'val':[12,34,56,78]})

Simple using loop, with zip:

for i,j,k in zip(df['c1'].tolist(), df['c2'].tolist(), [3,4,5,6]):
    df[f'c{k}'] = np.where((df['c1']==i) & (df['c2']==j), df['val'], np.NaN)

I am don't know your actual data looks like so I use df['c1/2'].tolist() here. However, you should not use this but to modify ur code, creating your desired pair of list of conditions from your c1 and c2

Which should give you desired output:

   c1 c2  val    c3    c4    c5    c6
0   0  x   12  12.0   NaN   NaN   NaN
1   0  y   34   NaN  34.0   NaN   NaN
2   1  x   56   NaN   NaN  56.0   NaN
3   1  y   78   NaN   NaN   NaN  78.0
PTQuoc
  • 938
  • 4
  • 13
  • 1
    Could you clarify what you are doing in the for loop? isn't loopwise iteration over a dataframe generally a bad idea? – MrTomatosoup May 19 '23 at 13:07
  • @MrTomatosoup: this is not iterating over rows. What I am doing is to iterate over your condition. You have 4 different types of condition, and I iterate over 4 condition; each condition create a new column – PTQuoc May 20 '23 at 08:21
0

Your solution should be changed by chain both mask by & for bitwise AND:

m1 = df['c2'].str.contains("phrase_x")
m2 = df['c2'].str.contains("phrase_y")

m3 = df['c1'].eq(0)
m4 = df['c1'].eq(1)

s = df['c2'].str.split().str[1] 

df.loc[m1 & m3 ,'c3'] = s
df.loc[m2 & m3, 'c4'] = s

df.loc[m1 & m4 ,'c5'] = s
df.loc[m2 & m4, 'c6'] = s

print (df)
   c1           c2   c3   c4   c5   c6
0   0  phrase_x 12   12  NaN  NaN  NaN
1   0  phrase_y 34  NaN   34  NaN  NaN
2   1  phrase_x 56  NaN  NaN   56  NaN
3   1  phrase_y 78  NaN  NaN  NaN   78

Another idea is create tuples for dynamic solution:

tups = [('phrase_x',0,'c3'),
        ('phrase_y',0,'c4'),
        ('phrase_x',1,'c5'),
        ('phrase_y',1,'c6')]

s = df['c2'].str.split().str[1] 

for (c2, c1, cols) in tups:
    df.loc[df['c2'].str.contains(c2) & df['c1'].eq(c1), cols] = s

print (df)
   c1           c2   c3   c4   c5   c6
0   0  phrase_x 12   12  NaN  NaN  NaN
1   0  phrase_y 34  NaN   34  NaN  NaN
2   1  phrase_x 56  NaN  NaN   56  NaN
3   1  phrase_y 78  NaN  NaN  NaN   78

General solution if need assign each row to new column is:

df1 = df.join(df.assign(cols = [f'c{x+3}' for x in range(len(df))],
               vals = df['c2'].str.split().str[1] )
             .pivot(columns='cols', values='vals'))
print (df1)
   c1           c2   c3   c4   c5   c6
0   0  phrase_x 12   12  NaN  NaN  NaN
1   0  phrase_y 34  NaN   34  NaN  NaN
2   1  phrase_x 56  NaN  NaN   56  NaN
3   1  phrase_y 78  NaN  NaN  NaN   78
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks! I found this the most helpful answer, especially the first part, really clear layout of the conditions. However, I am not quite sure what you mean by "chain both mask by `&` for bitwise `AND`". Could you clarify? – MrTomatosoup May 19 '23 at 13:16
  • @MrTomatosoup - Sure, it means need apply one condition and another one togehter by AND operator, here working with arrays so is used `&` for bitwise AND. – jezrael May 22 '23 at 05:05
0

Split the columns c2 into two columns as suggested by one of our fellow above is a good idea. Here is improve code you can use

Generate the unique combination of the two

l1 = df['c1'].unique()
l2 = df['c2'].unique()
combin = [(v1, v2) for v1 in l1 for v2 in l2]

Then creating a loop over like above:

for k, (i,j) in enumerate(combin):
    df[f'c{k+3}'] = np.where((df['c1']==i) & (df['c2']==j), df['val'], np.NaN)

Should giving your expected result

PTQuoc
  • 938
  • 4
  • 13
  • Wouldn't loopwise iteration over the full dataframe be a bad idea? (Following https://stackoverflow.com/a/55557758/21420031) – MrTomatosoup May 19 '23 at 13:04