0

I have a dataframe called df1:

d = {'letter':['R','V','Q','M','F','K'], 'info1':['K:2.3','T:1.3','L:2.4','B:7.4','S:2.3','K:4.4'], 'info2':['R:3.2','N:2.1','B:0.3','T:0.9','J:0.003','S:1.223'], 'info3':['X:45','V:32.4','H:0.04','M:3.34','T:2.2','T:3.456'], 'info4':['A:1.7','Z:1.2345','T:9.5','O:3,2','J:3.334','G:345']}

df1 = pd.DataFrame(d)

df1:

  letter info1  info2   info3   info4
0   R   K:2.3   R:3.2   X:45    A:1.7
1   V   T:1.3   N:2.1   V:32.4  Z:1.2345
2   Q   L:2.4   B:0.3   H:0.04  T:9.5
3   M   B:7.4   T:0.9   M:3.34  O:3,2
4   F   S:2.3   J:0.003 T:2.2   J:3.334
5   K   K:4.4   S:1.223 T:3.456 G:345

I want to partially match the strings in column "letter" with the occurrence of that string on the same row and place the match into a new column. If there is no match on the same row then I just want to put NaN.

Desired output:

    letter   info1   info2   info3   info4     new
0      R     K:2.3   R:3.2   X:45    A:1.7     R:3.2
1      V     T:1.3   N:2.1   V:32.4  Z:1.2345  V:32.4
2      Q     L:2.4   B:0.3   H:0.04  T:9.5     NaN
3      M     B:7.4   T:0.9   M:3.34  O:3,2     M:3.34
4      F     S:2.3   J:0.003 T:2.2   J:3.334   NaN
5      K     K:4.4   S:1.223 T:3.456 G:345     K:4.4 

I initially tried creating a mask but that did not work.

df1['new'] = df1.drop("letter", 1).isin(df1["letter"]).any(1)

Any ideas would be great

Chip
  • 471
  • 3
  • 7

4 Answers4

2

Stack your data into a column format so you can act column-on-column. Then subset to places where the values match over.

>>> s = df1.set_index('letter').stack()
>>> s[s.index.get_level_values(0) == s.str[0]]
letter       
R       info2     R:3.2
V       info3    V:32.4
M       info3    M:3.34
K       info1     K:4.4
dtype: object

Map this against the original data frame:

>>> s1 = _  # ie, the value coming out of the subset immediately above
>>> df1['letter'].map(s1.reset_index(level=1, drop=True))
0     R:3.2
1    V:32.4
2       NaN
3    M:3.34
4       NaN
5     K:4.4
Name: letter, dtype: object

Then assign that to your data frame as df1['new']. Note that this method will also throw an error about multiple matches if there is more than one valid match in your info# columns. In my view, Pandas' relative willingness to throw errors (especially compared to R or SAS) is really good for avoiding silent data bugs. Regardless, if duplicates are an issue, just drop duplicates.

ifly6
  • 5,003
  • 2
  • 24
  • 47
0

Here's a one-line solution, where we use the subresult get first non-null value from each row:

df1['new'] = df1.apply(lambda row: row[1:5][row.str.match(row['letter'])], axis=1).fillna(method='bfill', axis=1).iloc[:, 0]

0     R:3.2
1    V:32.4
2       NaN
3    M:3.34
4       NaN
5     K:4.4
smci
  • 32,567
  • 20
  • 113
  • 146
0

Here is another way:

df['new'] = (df.loc[:,'info1':].where(df.loc[:,'info1':]
                          .applymap(lambda x: x[0])
                          .eq(df['letter'],axis=0))
 .ffill(axis=1).iloc[:,-1])
rhug123
  • 7,893
  • 1
  • 9
  • 24
-1

This is what I would do:

df1['new'] = np.NaN    
for col in df1.columns.tolist()[1:-1]:
    df1.loc[df1[col].str[0] == df1['letter'], 'new'] = df1[col]
Aryerez
  • 3,417
  • 2
  • 9
  • 17