Comparing (Sub)Strings of two columns in a DF

Question

I have a DF as shown below:

DF =
id  token      argument1             argument2 
1   Tza        Tuvia Tza             Moscow  
2   perugia    umbria                perugia    
3   associated the associated press  Nelson

I now want to compare the values of the columns argumentX and token and choose the value for the new column ARG accordingly.

DF =
id  token      argument1             argument2    ARG
1   Tza        Tuvia Tza             Moscow       ARG1
2   perugia    umbria                perugia      ARG2
3   associated the associated press  Nelson       ARG1

Here is what I tried:

conditions = [
(DF["token"] == (DF["Argument1"])),
 DF["token"] == (DF["Argument2"])]

choices = ["ARG1", "ARG2"]

DF["ARG"] = np.select(conditions, choices, default=nan)

This only compares the entire String and matches if they are identical. Constructions such as .isin, .contains or using a helper column such as DF["ARG_cat"] = DF.apply(lambda row: row['token'] in row['argument2'],axis=1) did not work. Any ideas?

jezrael · Accepted Answer · 2018-08-14T09:21:34.117

4

Use str.contains with regex - join all values in token by | for regex OR for check substrings with word boundary:

pat = '|'.join(r"\b{}\b".format(re.escape(x)) for x in DF["token"])
conditions = [ DF["argument1"].str.contains(pat), DF["argument2"].str.contains(pat)]

choices = ["ARG1", "ARG2"]

DF["ARG"] = np.select(conditions, choices, default=np.nan)
print (DF)
   id       token            argument1 argument2   ARG
0   1         Tza            Tuvia Tza    Moscow  ARG1
1   2     perugia               umbria   perugia  ARG2
2   3  associated  the associated ress    Nelson  ARG1

EDIT:

If want compare each row:

d = {'id': [1, 2, 3], 
     'token': ["Tza","perugia","israel"], 
     "argument1": ["Tuvia Tza","umbria","Tuvia Tza"], 
     "argument2": ["israel","perugia","israel"]} 
DF = pd.DataFrame(data=d) 
print (DF)
   id    token  argument1 argument2
0   1      Tza  Tuvia Tza    israel
1   2  perugia     umbria   perugia
2   3   israel  Tuvia Tza    israel

conditions = [[x[0] in x[1] for x in zip(DF['token'], DF['argument1'])], 
              [x[0] in x[1] for x in zip(DF['token'], DF['argument2'])]]

choices = ["ARG1", "ARG2"]

DF["ARG"] = np.select(conditions, choices, default=np.nan)
print (DF)
   id    token  argument1 argument2   ARG
0   1      Tza  Tuvia Tza    israel  ARG1
1   2  perugia     umbria   perugia  ARG2
2   3   israel  Tuvia Tza    israel  ARG2

edited Aug 14 '18 at 09:21

answered Aug 14 '18 at 08:08

jezrael

822,522
95
1,334
1,252

This might work but I have to escape some characters because I get the really annoying python regex bug "sre_constants.error: nothing to repeat" as illustrated here https://stackoverflow.com/questions/3675144/regex-error-nothing-to-repeat – Mi. Aug 14 '18 at 08:16
I couldn't test it yet but just thinking of the regex logic, doesn't this always return "ARG1"? – Mi. Aug 14 '18 at 08:52
@ThelMi - Why do you think? For second is matched `perugia`, so `ARG2` – jezrael Aug 14 '18 at 08:53
Yeah in that case but imagine a DF like: d = {'id': [1, 2, 3], 'token': ["Tza","perugia","israel"], "argument1": ["Tuvia Tza","umbria","Tuvia Tza"], "argument2": ["israel","perugia","israel"]} DF = pd.DataFrame(data=d) – Mi. Aug 14 '18 at 08:59
@ThelMi - Not sure, do you expected `ARG1` 3 times? – jezrael Aug 14 '18 at 09:06
Expected: ARG1,ARG2,ARG2. Got: ARG1,ARG2,ARG1 – Mi. Aug 14 '18 at 09:15
@ThelMi - I see it, give me a sec. – jezrael Aug 14 '18 at 09:16

score 2 · Answer 2 · answered Aug 14 '18 at 09:42

Get boolean index

argument_cols = ['argument1', 'argument2']
boolean_idx = DF[argument_cols].apply(
    lambda arg_column: DF['token'].combine(arg_column, lambda token, arg: token in arg)
)

boolean_idx
Out:
id  argument1   argument2
0   True    False
1   False   True
2   True    False

Select values from rows:

selected_vals = DF[argument_cols][boolean_idx]

selected_vals
Out: 
id            argument1  argument2
0             Tuvia Tza        NaN
1                   NaN    perugia
2  the associated press        NaN

Stack selected_vals and get index level which contains argument names (this code will fail if there's more than one column containing True value in a row):

argument_index_level = selected_vals.stack().index.get_level_values(-1)
# Index(['argument1', 'argument2', 'argument1'], dtype='object')

DF['ARG'] = argument_index_level

DF 
Out: 
id             argument1        argument2        token        ARG
0              Tuvia Tza           Moscow          Tza  argument1
1                 umbria          perugia      perugia  argument2
2   the associated press           Nelson   associated  argument1

You can change values in 'ARG' column with apply().

Comparing (Sub)Strings of two columns in a DF

2 Answers2

Linked