I am trying to use Python to find matches to a substring in multiple columns of a dataframe, and copy the entire string, if substring is found, to a new column.
The data strings are extracted from comma-separated strings in another df. So there are varying numbers of strings across each row. The string in column A may or may not be the one I want to copy. If it isn't, the string in column B will be. Some rows include data in columns D and E, but we don't have to use those. (In the real world, these are website urls and I'm trying to gather only the ones from a specific domain, which might be the first one, or the second one on the row. I used simpler strings for the example.) I am trying to use np.where, but I am not getting consistent results, particularly if the correct string is in column A but not repeated in column B. Np.where appears to only apply the "y" and never the "x". I've also tried variations on if/where in loops without good results.
import pandas as pd
df = pd.DataFrame({"A": ["blue lorry", "yellow cycle", "red car", "blue lorry", "red truck", "red bike", "blue jeep", "yellow skate", "red bus"], "B": ["red train", "red cart", "red car", "red moto",'', "red bike", "red diesel", "red carriage",''], "C": ['','','', "red moto",'', "red bike", "red diesel", "red carriage",''], "D": ['','','', "red moto",'', "red bike", '','','']})
This produces df:
A B C D
0 blue lorry red train
1 yellow cycle red cart
2 red car red car
3 blue lorry red moto red moto red moto
4 red truck
5 red bike red bike red bike red bike
6 blue jeep red diesel red diesel
7 yellow skate red carriage red carriage
8 red bus
When I run:
df['Red'] = np.where("red" in df['A'], df['A'], df['B'])
It returns:
A B C D Red
0 blue lorry red train red train
1 yellow cycle red cart red cart
2 red car red car red car
3 blue lorry red moto red moto red moto red moto
4 red truck
5 red bike red bike red bike red bike red bike
6 blue jeep red diesel red diesel red diesel
7 yellow skate red carriage red carriage red carriage
8 red bus
The column Red values for lines 4 and 8 are missing, when I expected it to copy the (correct) strings from column A.
I understand the basic structure is: numpy.where(condition, x, y)
I tried to apply code so the condition is to look for "red" and copy the string in column A if "red" is found, or the string in column B if it isn't. But it seems I'm only getting the column B string. Any help is appreciated.
Obviously I'm new here. I gleaned some help for np.where from these topics, but I think there are some differences between using numeric values and strings, and my multiple columns:
np.where Not Working in my Pandas
Efficiently replace values from a column to another column Pandas DataFrame
Update Value in one column, if string in other column contains something in list