1

I needed to search multiple cells for a specific value and when it is found it should be returned in new column.

I got an answer here; Python: find string in multiple columns and return it in new column yet this line below return the first value found

df['b'] = (df[cols].where(df[cols].stack().str.contains('b')
         .unstack(fill_value=False)).ffill(1).iloc[:,-1])

where cols

df = df[['col1', 'col2', 'col3', 'col4']]

I tried the other answers and they all gave me error ValueError: cannot reindex from a duplicate axis

Do any one have an idea how can I get all the matching values in one cell.

the data set

ID   col0  col1  col2  col3  col4  col5
1    jack  a/h   t/m   w/n   y/h    56
2    sam   z/n   b/w   null  b/n   93
3    john  b/i   y/d   b/d   null   33

The code I'm using now :

df['b'] = (df[cols].where(df[cols].stack().str.contains('b')
         .unstack(fill_value=False)).ffill(1).iloc[:,-1])

And here the output I am getting now

ID   col0  col1  col2  col3  col4  col5  b
1    jack  a/h   t/m   w/n   y/h    56   -
2    sam   z/n   b/w   null  b/n   93   b/w
3    john  b/i   y/d   b/d   null   33   b/i

And actually I want the output to look like the data-frame below

ID   col0  col1  col2  col3  col4  col5     b 
1    jack  a/h   t/m   w/n   y/h    56    null
2    sam   z/n   b/w   null  b/n    93   b/w - b/n
3    john  b/i   y/d   b/d   null   33   b/i - b/d
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
SMO
  • 149
  • 8
  • 1
    please provide input and expected output: https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – David Erickson Jul 16 '20 at 06:34

2 Answers2

2

Use DataFrame.filter to filter dataframe containing columns col1-col4 and use DataFrame.stack, then using Series.str.contains filter the stacked dataframe finally use Series.groupby on level=0 and aggregate using join:

s = df.filter(regex=r'col[1-4]').stack()
s = s[s.str.contains('b')].groupby(level=0).agg(' - '.join)
df['b'] = s

Result:

# print(df)

   ID  col0 col1 col2 col3 col4  col5          b
0   1  jack  a/h  t/m  w/n  y/h    56        NaN
1   2   sam  z/n  b/w  NaN  b/n    93  b/w - b/n
2   3  john  b/i  y/d  b/d  NaN    33  b/i - b/d
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
  • Hi @Shubham this code miss half of the records The current result is ` ID col0 col1 col2 col3 col4 col5 b 0 1 jack a/h t/m w/n y/h 56 NaN 1 2 sam z/n b/w NaN b/n 93 b/w - b/n 2 3 john b/i y/d b/d NaN 33 NaN` – SMO Jul 19 '20 at 09:11
  • @SMO Can you please elaborate more, exactly what problem you are facing? – Shubham Sharma Jul 19 '20 at 13:53
  • So the result I am getting now is missing info in the new column. The result now is showing empty cell in columns: b , row: 2. That means even if there is b the in col1-4 we do not always get it at the new column "b" – SMO Jul 20 '20 at 05:02
  • Are you specifically talking about the example you have provided in the question, or some other dataset? Because I've just checked the code is producing proper results on the example you provided. – Shubham Sharma Jul 20 '20 at 05:37
  • My original df has more than 35 columns to be searched for the key word. and when i tried the code on it it kept missing some columns although every thing seems identical (spelling and space) Do you have any idea why ? – SMO Jul 20 '20 at 07:01
  • yes I did this is my line `df.filter(regex=r'Operation/Summary/[0-35]')` 0-35 is within the name of the operation not the index could that be the problem – SMO Jul 20 '20 at 07:49
  • can you elaborate, I could not understand this – SMO Jul 20 '20 at 08:55
  • 1
    @SMO We have to use a more advance regex pattern to match the desired columns, so use `df.filter(regex=r'Operation/Summary/(?:[0-9]|[0-2][0-9]?|3[0-5])$')` instead. – Shubham Sharma Jul 20 '20 at 09:08
  • 1
    It worked ! thanks do you have and documentation for you approach or a link to explain it. I need to understand how it worked and what if the columns number changed how I can change. **Thanks again** – SMO Jul 21 '20 at 08:37
  • @SMO Its just a regex pattern, in case you need to change your columns in future you have to update the regex pattern, here is [the link which explains the pattern](https://regex101.com/r/h8FKmO/1) that i've used here, take a look at it. – Shubham Sharma Jul 21 '20 at 08:48
0

Here's a way to do it with melt:

t = df.melt(id_vars="ID", value_vars = ["col0", "col1", "col2", "col3", "col4"], var_name=[]).dropna()

t = t[t.value.str.contains("b")].groupby("ID")["value"].apply(lambda x: " - ".join(x))

res = pd.concat([df, t], axis=1).rename(columns={"value":"b"})

The output is:

    ID  col0 col1 col2 col3 col4  col5          b
0  1.0  jack  a/h  t/m  w/n  y/h  56.0        NaN
1  2.0   sam  z/n  b/w  NaN  b/n  93.0        NaN
2  3.0  john  b/i  y/d  b/d  NaN  33.0  b/w - b/n
3  NaN   NaN  NaN  NaN  NaN  NaN   NaN  b/i - b/d
Roy2012
  • 11,755
  • 2
  • 22
  • 35