0

I am trying to use Python to find matches to a substring in multiple columns of a dataframe, and copy the entire string, if substring is found, to a new column.

The data strings are extracted from comma-separated strings in another df. So there are varying numbers of strings across each row. The string in column A may or may not be the one I want to copy. If it isn't, the string in column B will be. Some rows include data in columns D and E, but we don't have to use those. (In the real world, these are website urls and I'm trying to gather only the ones from a specific domain, which might be the first one, or the second one on the row. I used simpler strings for the example.) I am trying to use np.where, but I am not getting consistent results, particularly if the correct string is in column A but not repeated in column B. Np.where appears to only apply the "y" and never the "x". I've also tried variations on if/where in loops without good results.

import pandas as pd 

df = pd.DataFrame({"A": ["blue lorry", "yellow cycle", "red car", "blue lorry", "red truck", "red bike", "blue jeep", "yellow skate", "red bus"], "B": ["red train", "red cart", "red car", "red moto",'', "red bike", "red diesel", "red carriage",''], "C": ['','','', "red moto",'', "red bike", "red diesel", "red carriage",''], "D": ['','','', "red moto",'', "red bike", '','','']})

This produces df:

    A               B               C               D
0   blue lorry      red train       
1   yellow cycle    red cart        
2   red car         red car         
3   blue lorry      red moto        red moto        red moto
4   red truck           
5   red bike        red bike        red bike        red bike
6   blue jeep       red diesel      red diesel  
7   yellow skate    red carriage    red carriage    
8   red bus                             

When I run:
df['Red'] = np.where("red" in df['A'], df['A'], df['B'])

It returns:

    A               B               C               D               Red
0   blue lorry      red train                                       red train
1   yellow cycle    red cart                                        red cart
2   red car         red car                                         red car
3   blue lorry      red moto        red moto        red moto        red moto
4   red truck               
5   red bike        red bike        red bike        red bike        red bike
6   blue jeep       red diesel      red diesel                      red diesel
7   yellow skate    red carriage    red carriage                    red carriage
8   red bus                 

The column Red values for lines 4 and 8 are missing, when I expected it to copy the (correct) strings from column A.
I understand the basic structure is: numpy.where(condition, x, y)
I tried to apply code so the condition is to look for "red" and copy the string in column A if "red" is found, or the string in column B if it isn't. But it seems I'm only getting the column B string. Any help is appreciated.

Obviously I'm new here. I gleaned some help for np.where from these topics, but I think there are some differences between using numeric values and strings, and my multiple columns:
np.where Not Working in my Pandas
Efficiently replace values from a column to another column Pandas DataFrame
Update Value in one column, if string in other column contains something in list

  • try using `contains`, like this: `df['Red'] = np.where(df['A'].str.contains('red'), df['A'], df['B'])` – Terry Jun 11 '19 at 18:10
  • Yes! That works as intended. I'm still not sure why it choked on "in", but str.contains did the trick. Thanks! – Pipercollins Jun 11 '19 at 19:13

1 Answers1

1

str.contains works where "in" condition did not. Correct code is:

df['Red'] = np.where(df['A'].str.contains('red'), df['A'], df['B'])   

Thanks to Terry!