1

I have a dataframe column, where I need to match a list of substrings provided to me. But I need to select only those substrings which occur at least 3 times in my dataframe. Once I get those substrings all I need to do is to find those rows which contain substrings and assign some values corresponding to the matched pattern.

here's what I have currently done.

#dataframe has 9000 rows
reg_ = {'a':0.03,'b':0.05 ... 'ak':0.004} #approx 2000 patterns
for reg_pat in reg_:
    count = 0
    eliter = len(training[training['concat'].str.contains(reg_pat)]['concat'])
    if (eliter >= 3):
        training[reg_pat] = 0
        training.loc[training['concat'].str.contains(reg_pat), reg_pat] = reg_[reg_pat]       
        muCOls += 1

Is there somehow I can do the filter and assignemnt as a single expression.

Please note that I have asked a sub-question fo the same as a separate question. There I have asked only for finding the counts. Here I am curious to know if both the operations i.e. finding the counts and then using it to assign values can be combined.

Finding the count of a set of substrings in pandas dataframe

The dataframe column looks like this

  training['concat']

  0 svAxu$paxArWAn
  1 xvAxaSa$varRANi
  2 AxAna$xurbale
  3 go$BakwAH
  4 viXi$Bexena
  5 nIwi$kuSalaM
  6 lafkA$upamam
  7 yaSas$lipsoH
  8 kaSa$AGAwam
  9 hewumaw$uwwaram
  10 varRa$pUgAn

Please let me know If I need to delete this question and just modify the old question. I think these are 2 sepearte problems, and hence 2 separate questions.

Community
  • 1
  • 1
Amrith Krishna
  • 2,768
  • 3
  • 31
  • 65

1 Answers1

2

I use apply and numpy.where:

reg_ = {'anuBavAn':0.35, 'a$piwra':0.2, 'piwra':0.7, 'pa':0.03, 'a':0.0005}

elites = dict()
for reg_pat in reg_:
  mask_eliter = training['concat'].apply(lambda x: reg_pat in x)
  if  mask_eliter.sum() >= 3:
      Val = reg_[reg_pat]
      training[reg_pat] = np.where(mask_eliter,Val, 0) 
      elites[reg_pat] = Val

print (training)
             concat       a
0    svAxu$paxArWAn  0.0005
1   xvAxaSa$varRANi  0.0005
2     AxAna$xurbale  0.0005
3         go$BakwAH  0.0005
4       viXi$Bexena  0.0005
5      nIwi$kuSalaM  0.0005
6      lafkA$upamam  0.0005
7      yaSas$lipsoH  0.0005
8       kaSa$AGAwam  0.0005
9   hewumaw$uwwaram  0.0005
10      varRa$pUgAn  0.0005

Another solution with concat:

reg_ = {'anuBavAn':0.35, 'a$piwra':0.2, 'piwra':0.7, 'pa':0.03, 'a':0.0005}

elites = dict()
#first add column 'concat'
df_val = [training['concat']]
for reg_pat in reg_:
  mask_eliter = training['concat'].apply(lambda x: reg_pat in x)
  if  mask_eliter.sum() >= 3:
      Val = reg_[reg_pat]
      #append Series to list
      df_val.append(pd.Series(np.where(mask_eliter,Val, 0), name=reg_pat))
      elites[reg_pat] = Val
print (pd.concat(df_val, axis=1))
             concat       a
0    svAxu$paxArWAn  0.0005
1   xvAxaSa$varRANi  0.0005
2     AxAna$xurbale  0.0005
3         go$BakwAH  0.0005
4       viXi$Bexena  0.0005
5      nIwi$kuSalaM  0.0005
6      lafkA$upamam  0.0005
7      yaSas$lipooH  0.0005
8       kaSa$AGAwam  0.0005
9   hewumaw$uwwaram  0.0005
10      varRa$pUgAn  0.0005
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252