-1

I have a column data as follows:

abc|frame|gtk|enst.24|pc|hg|,abc|framex|gtk4|enst.35|pxc|h5g|,abc|frbx|hgk4|enst.23|pix|hokg|
abc|frame|gtk|enst.15|pc|hg|,abc|framex|gtk2|enst.59|pxc|h5g|,abc|frbx|hgk4|enst.18|pif|homg|
abc|frame|gtk|enst.98|pc|hg|,abc|framex|gtk1|enst.45|pxc|h5g|,abc|frbx|hgk4|enst.74|pig|hofg|
abc|frame|gtk|enst.34|pc|hg|,abc|framex|gtk1|enst.67|pxc|h5g|,abc|frbx|hgk4|enst.39|pik|hoqg|

I want to search and extract specific keywords within the frame and extract only that data with in the separators

Specific keywords are

enst.35
enst.18
enst.98
enst.63

The expected output is

abc|framex|gtk4|enst.35|pxc|h5g|
abc|frbx|hgk4|enst.18|pif|homg|
abc|frame|gtk|enst.98|pc|hg|
NA

If match is not found fill with NA in the output columns. There can be multiple occurance of id in the same column, but I want to consider only the first occurance. I tried this herebut was not working effectively. Can we do this with bash script

3 Answers3

5

Could you please try following, written and tested in shown samples. Mention all values in variable values_to_be_searched which you want to search in Input_file with , delimiter.

awk -v values_to_be_searched="enst.35,enst.18,enst.98,enst.63" '
BEGIN{
  FS=","
  num=split(values_to_be_searched,array,",")
  for(i=1;i<=num;i++){
    values[array[i]]
  }
}
{
  found=""
  for(i=1;i<=NF;i++){
    for(k in values){
      if(match($i,k)){
        print $i
        found=1
        break
      }
    }
  }
  if(found==""){
    print "NA"
  }
}
'  Input_file

Explanation: Adding detailed explanation for above code.

awk -v values_to_be_searched="enst.35,enst.18,enst.98,enst.63" '  ##Creating variable values_to_be_searched which has all the values to be searched in it.
BEGIN{                                            ##Starting BEGIN section of this code from here.
  FS=","                                          ##Setting field separator as comma here.
  num=split(values_to_be_searched,array,",")      ##Splitting variable values_to_be_searched into an array here with delimiter comma.
  for(i=1;i<=num;i++){                            ##Running a for loop till value of nu here.
    values[array[i]]                              ##Creating array values which has index as value of array which are the keywords to be searched in Input_file.
  }
}
{
  found=""                                        ##Nullifying found here.
  for(i=1;i<=NF;i++){                             ##Running a for loop till NF here.
    for(k in values){                             ##Traversing through values array here.
      if(match($i,k)){                            ##If match of value k found in current field then do following.
        print $i                                  ##Printing current field here, looks like a match of keyword is found in current field.
        found=1                                   ##Setting found as 1 here.
        break                                     ##Using break to come out of loop and save some cycles of for loop here.
      }
    }
  }
  if(found==""){                                  ##Checking condition if found is NOT SET then do following.
    print "NA"                                    ##Printing NA here.
  }
}
' Input_file                                      ##Mentioning Input_file name here.
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
4

since pandas is tagged, You can try with str.split followed by explode and then str.contains + reindex for NaN in missing rows

keywords = ['enst.35','enst.18','enst.98','enst.63']

s = df['Column'].str.split(',').explode()
s[s.str.contains('|'.join(keywords))].reindex(df.index)

0    abc|framex|gtk4|enst.35|pxc|h5g|
1     abc|frbx|hgk4|enst.18|pif|homg|
2        abc|frame|gtk|enst.98|pc|hg|
3                                 NaN
Name: Column, dtype: object

Note: Replace Column in the code with original column name.

anky
  • 74,114
  • 11
  • 41
  • 70
1

Another way:

for STRING in enst.35 enst.18 enst.98 enst.63; do
  tr \, \\n < file.txt | grep "$STRING" || echo NA
done

Output results in:

abc|framex|gtk4|enst.35|pxc|h5g|
abc|frbx|hgk4|enst.18|pif|homg|
abc|frame|gtk|enst.98|pc|hg|
NA
mtnezm
  • 1,009
  • 1
  • 7
  • 19