Finding the count of a set of substrings in pandas dataframe

Question

I am given a set of substrings. I need to find the count of occurrence of all those substrings in a particular column in a dataframe. The relevant datframe would look like this

  training['concat']

  0 svAxu$paxArWAn
  1 xvAxaSa$varRANi
  2 AxAna$xurbale
  3 go$BakwAH
  4 viXi$Bexena
  5 nIwi$kuSalaM
  6 lafkA$upamam
  7 yaSas$lipsoH
  8 kaSa$AGAwam
  9 hewumaw$uwwaram
  10 varRa$pUgAn

My set of substrings is a dictionary, where the keys are the substrings and values are the probabilities with which they occur

  reg = {'anuBavAn':0.35, 'a$piwra':0.2 ...... 'piwra':0.7, 'pa':0.03, 'a':0.0005}
  #The length of dicitioanry is 2000

Particularly I need to find those substrings which occur more than twice

I have written the following code that performs the task. Is there a more elegant pythonic way or panda specific way to achieve the same as the current implementation is taking quite some time to execute.

  elites = dict()
  for reg_pat in reg_:
  count = 0
  eliter = len(training[training['concat'].str.contains(reg_pat)]['concat'])
  if eliter >=3:
  elites[reg_pat] = reg_[reg_pat]

training is like 9000 rows – Amrith Krishna Sep 09 '16 at 04:43 — Amrith Krishna, Sep 09 '16 at 04:43

score 2 · Accepted Answer · answered Sep 09 '16 at 05:21

2

You can use apply instead str.contains, it is faster:

reg_ = {'anuBavAn':0.35, 'a$piwra':0.2, 'piwra':0.7, 'pa':0.03, 'a':0.0005}

elites = dict()
for reg_pat in reg_:
  if training['concat'].apply(lambda x: reg_pat in x).sum() >= 3:
      elites[reg_pat] = reg_[reg_pat]

print (elites)
{'a': 0.0005}

answered Sep 09 '16 at 05:21

jezrael

822,522
95
1,334
1,252

Does reg_pat in x works, if reg_pat is regex pattern? – Amrith Krishna Sep 09 '16 at 06:18
1

No, it doenst work then. You need `str.contains` if regex. – jezrael Sep 09 '16 at 06:20

score 2 · Answer 2 · edited May 23 '17 at 12:01

Hopefully I have interpreted your question correctly. I'm inclined to stay away from regex here (in fact, I've never used it in conjunction with pandas), but it's not wrong, strictly speaking. In any case, I find it hard to believe that any regex operations are faster than a simple in check, but I could be wrong on that.

for substr in reg:
    totalStringAppearances = training.apply((lambda string: substr in string))
    totalStringAppearances = totalStringAppearances.sum()
    if totalStringAppearances > 2:
        reg[substr] = totalStringAppearances / len(training)
    else:
        # do what you want to with the very rare substrings

Some gotchas:

If you wanted something like a substring 'a' in 'abcdefa' to return 2, then this will not work. It merely checks for existence of the substring in each string.
Inside the apply(), I am using a potentially unreliable exploitation of booleans. See this question for more details.

Post-edit: Jezrael's answer is more complete as it uses the same variable names. But, in a simple case, regarding regex vs. apply and in, I validate his claim, and my presumption:

Finding the count of a set of substrings in pandas dataframe

2 Answers2

Linked