3

I have a DF as shown below:

DF =
id  token      argument1             argument2 
1   Tza        Tuvia Tza             Moscow  
2   perugia    umbria                perugia    
3   associated the associated press  Nelson

I now want to compare the values of the columns argumentX and token and choose the value for the new column ARG accordingly.

DF =
id  token      argument1             argument2    ARG
1   Tza        Tuvia Tza             Moscow       ARG1
2   perugia    umbria                perugia      ARG2
3   associated the associated press  Nelson       ARG1

Here is what I tried:

conditions = [
(DF["token"] == (DF["Argument1"])),
 DF["token"] == (DF["Argument2"])]

choices = ["ARG1", "ARG2"]

DF["ARG"] = np.select(conditions, choices, default=nan)

This only compares the entire String and matches if they are identical. Constructions such as .isin, .contains or using a helper column such as DF["ARG_cat"] = DF.apply(lambda row: row['token'] in row['argument2'],axis=1) did not work. Any ideas?

anothernode
  • 5,100
  • 13
  • 43
  • 62
Mi.
  • 510
  • 1
  • 4
  • 20

2 Answers2

4

Use str.contains with regex - join all values in token by | for regex OR for check substrings with word boundary:

pat = '|'.join(r"\b{}\b".format(re.escape(x)) for x in DF["token"])
conditions = [ DF["argument1"].str.contains(pat), DF["argument2"].str.contains(pat)]

choices = ["ARG1", "ARG2"]

DF["ARG"] = np.select(conditions, choices, default=np.nan)
print (DF)
   id       token            argument1 argument2   ARG
0   1         Tza            Tuvia Tza    Moscow  ARG1
1   2     perugia               umbria   perugia  ARG2
2   3  associated  the associated ress    Nelson  ARG1

EDIT:

If want compare each row:

d = {'id': [1, 2, 3], 
     'token': ["Tza","perugia","israel"], 
     "argument1": ["Tuvia Tza","umbria","Tuvia Tza"], 
     "argument2": ["israel","perugia","israel"]} 
DF = pd.DataFrame(data=d) 
print (DF)
   id    token  argument1 argument2
0   1      Tza  Tuvia Tza    israel
1   2  perugia     umbria   perugia
2   3   israel  Tuvia Tza    israel

conditions = [[x[0] in x[1] for x in zip(DF['token'], DF['argument1'])], 
              [x[0] in x[1] for x in zip(DF['token'], DF['argument2'])]]

choices = ["ARG1", "ARG2"]

DF["ARG"] = np.select(conditions, choices, default=np.nan)
print (DF)
   id    token  argument1 argument2   ARG
0   1      Tza  Tuvia Tza    israel  ARG1
1   2  perugia     umbria   perugia  ARG2
2   3   israel  Tuvia Tza    israel  ARG2
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • This might work but I have to escape some characters because I get the really annoying python regex bug "sre_constants.error: nothing to repeat" as illustrated here https://stackoverflow.com/questions/3675144/regex-error-nothing-to-repeat – Mi. Aug 14 '18 at 08:16
  • I couldn't test it yet but just thinking of the regex logic, doesn't this always return "ARG1"? – Mi. Aug 14 '18 at 08:52
  • @ThelMi - Why do you think? For second is matched `perugia`, so `ARG2` – jezrael Aug 14 '18 at 08:53
  • Yeah in that case but imagine a DF like: d = {'id': [1, 2, 3], 'token': ["Tza","perugia","israel"], "argument1": ["Tuvia Tza","umbria","Tuvia Tza"], "argument2": ["israel","perugia","israel"]} DF = pd.DataFrame(data=d) – Mi. Aug 14 '18 at 08:59
  • @ThelMi - Not sure, do you expected `ARG1` 3 times? – jezrael Aug 14 '18 at 09:06
  • Expected: ARG1,ARG2,ARG2. Got: ARG1,ARG2,ARG1 – Mi. Aug 14 '18 at 09:15
  • @ThelMi - I see it, give me a sec. – jezrael Aug 14 '18 at 09:16
2

Get boolean index

argument_cols = ['argument1', 'argument2']
boolean_idx = DF[argument_cols].apply(
    lambda arg_column: DF['token'].combine(arg_column, lambda token, arg: token in arg)
)

boolean_idx
Out:
id  argument1   argument2
0   True    False
1   False   True
2   True    False

Select values from rows:

selected_vals = DF[argument_cols][boolean_idx]

selected_vals
Out: 
id            argument1  argument2
0             Tuvia Tza        NaN
1                   NaN    perugia
2  the associated press        NaN

Stack selected_vals and get index level which contains argument names (this code will fail if there's more than one column containing True value in a row):

argument_index_level = selected_vals.stack().index.get_level_values(-1)
# Index(['argument1', 'argument2', 'argument1'], dtype='object')

DF['ARG'] = argument_index_level

DF 
Out: 
id             argument1        argument2        token        ARG
0              Tuvia Tza           Moscow          Tza  argument1
1                 umbria          perugia      perugia  argument2
2   the associated press           Nelson   associated  argument1

You can change values in 'ARG' column with apply().

Mikhail Stepanov
  • 3,680
  • 3
  • 23
  • 24