1

I have a column data as follows:

abc|frame|gtk|enst.24|pc|hg|,abc|framex|gtk4|enst.35|pxc|h5g|,abc|frbx|hgk4|enst.23|pix|hokg|
abc|frame|gtk|enst.15|pc|hg|,abc|framex|gtk2|enst.59|pxc|h5g|,abc|frbx|hgk4|enst.18|pif|homg|
abc|frame|gtk|enst.98|pc|hg|,abc|framex|gtk1|enst.45|pxc|h5g|,abc|frbx|hgk4|enst.74|pig|hofg|
abc|frame|gtk|enst.34|pc|hg|,abc|framex|gtk1|enst.67|pxc|h5g|,abc|frbx|hgk4|enst.39|pik|hoqg|

I want to search and extract specific keywords within the frame and extract only that data with in the separators

Specific keywords are

enst.35
enst.18
enst.98
enst.63

The expected output is

abc|framex|gtk4|enst.35|pxc|h5g|
abc|frbx|hgk4|enst.18|pif|homg|
abc|frame|gtk|enst.98|pc|hg|
NA

I tried this herebut was not working effectively

Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53

3 Answers3

1

You can construct a regex pattern using the given keywords then use Series.str.findall to find all occurrences of regex in series:

import re

keywords = ['enst.35','enst.18','enst.98','enst.63']
pattern = '|'.join([rf'[^,]+{re.escape(k)}[^,]+'for k in keywords])
result = df['col'].str.findall(pattern).str.get(0)

#print(result)

0    abc|framex|gtk4|enst.35|pxc|h5g|
1     abc|frbx|hgk4|enst.18|pif|homg|
2        abc|frame|gtk|enst.98|pc|hg|
3                                 NaN
Name: col, dtype: object

You can test the regex pattern here

Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
0

You can try in bashscript as follows:

for STRING in enst.35 enst.18 enst.98 enst.63; do
  tr \, \\n < file.txt | grep "$STRING" || echo NA
done
0

With str.extract and str.split(',) to take the last comma separated value:

df['Data2'] = df['Data'].str.extract('(^.*enst.35\|.+?\|.+?\||^.*enst.18\|.+?\|.+?\||^.*enst.98\|.+?\|.+?\||^.*enst.63\|.+?\|.+?\||)', expand=False).str.split(',').str[-1]

You could create a list of keywords and do list comprehension as well per another answer.

David Erickson
  • 16,433
  • 2
  • 19
  • 35