count appearnce of multi-word substring in some text

Question

So for a single word substring count in some text, I can use some_text.split().count(single_word_substring). How can I do that for a multi-word substring count in some text?

Examples:

text = 'he is going to school. abc is going to school. xyz is going to school.'
to_be_found = 'going to school'

count should be 3.

text = 'he is going to school. abc is going to school. xyz is going to school.'
to_be_found = 'going to'

count should be 3.

text = 'he is going to school. abc is going to school. xyz is going to school.'
to_be_found = 'go'

count should be 0.

text = 'he is going to school. abc-xyz is going to school. xyz is going to school.'
to_be_found = 'school'

count should be 3.

text = 'he is going to school. abc-xyz is going to school. xyz is going to school.'
to_be_found = 'abc-xyz'

count should be 1.

Assumption 1: Everything is lower-case. Assumption 2: The text can contain anything. Assumption 3: The to be found can contain anything too. For example, car with 4 passengers, xyz & abc, etc.

NOTE: REGEX based solutions are acceptable. I am just curious if it's possible without regex (nice to have and just for others who may be interested in this in future).

Have you tried using [`re.findall`](https://docs.python.org/3/library/re.html#re.findall)? — kingkupps, Jan 27 '21 at 22:20
Perhaps `re.findall(fr'\b{to_be_found}\b', text)` and take the `len` of the result? — Nick, Jan 27 '21 at 22:21
This works but seems to a bit slower. Maybe you should add as an answer so I can accept it. It could be useful for other OPs. — utengr, Jan 27 '21 at 22:32
I have an answer with rexeg but I took it down because you said you are not looking for regex option? Should I put it back or no? — Jakub Szlaur, Jan 27 '21 at 22:36
Currently I am working on answer without the regex module ... — Jakub Szlaur, Jan 27 '21 at 22:36
Both are acceptable answers (with/with out regex). I couldn't figure one out without regex so I am interested in that one more but for community, both should be there. — utengr, Jan 27 '21 at 22:40
considering special characters like full-stop as part of word makes it tricky to handle without regex. It is doable, but not worth it when compared to regex based solution already shared above — Moinuddin Quadri, Jan 27 '21 at 22:44
@Anonymous In to be found, full stop is not part of the string in this example. However, the main text can contain anything. To be found can also contain special characters such as & or -. — utengr, Jan 27 '21 at 22:48
I also just tested re.findall(fr'\b{to_be_found}\b', text) with to be found = 'school." and it returns only 1 whereas it should return 3. — utengr, Jan 27 '21 at 22:53
My comment is about first sentence `"he is going to school. abc is going to school."`. You need exact match of the `to_be_found` instead of just substring match, but you want to consider `.` as optional. Splitting the string to words will consider `.` as part of, `shool.` and won't be consider for exact match when done with `school`. One way to handle is to remove all special characters in the string. But doing so without regex will require iteration on your entire string (one char at a time). Then you can `str.count()` on ` to_be_found`. — Moinuddin Quadri, Jan 27 '21 at 22:54
all this is not worth it when you can achieve it with `re.findall()`. Regarding your last comment, you need to replace `.` with `\.` in regex expression, i.e `school\.`, as single `.` has special meaning in regex — Moinuddin Quadri, Jan 27 '21 at 22:57
@Anonymous Oh got it now, thanks for the explanation. You are right, it gets complicated without regex. The original text contains lots of such special symbols connected to words so I guess it may not be possible without regex. — utengr, Jan 27 '21 at 22:58
@utengr the suggestion I made won't work for something like `school.` as there is no word boundary at the end. — Nick, Jan 27 '21 at 23:07
@Nick you should add your answer so I can accept it since the rest of the answers are not what I am looking for. — utengr, Jan 29 '21 at 10:24

score 1 · Accepted Answer · answered Jan 31 '21 at 19:57

1

Here's a working solution using regex:

import re

def occurrences(text,to_be_found):
    return len(re.findall(rf'\W{to_be_found}\W', text))

The capital W in regex is for non-word characters, which covers spaces and other punctuation.

answered Jan 31 '21 at 19:57

Byron

309
1
8

Let me test this on a couple of cases and will accept the answer then. thanks – utengr Feb 02 '21 at 09:49

score 0 · Answer 2 · answered Jan 27 '21 at 22:39

0

the best native way to search substring is still count. it can be used with multi-word substrings as you need
```
text = 'he is going to school. abc is going to school. xyz is going to school.'
text.count('going to school') # 3
text.count('going to') # 3
text.count('school') # 3
text.count('go') # 3
```
for case 'go' if you need 0 you can search 'go ',' go' or ' go ' to catch separate word
also you can write your own method to search by characters https://stackoverflow.com/a/30863956/15080484

answered Jan 27 '21 at 22:39

jeffry_bo

21
3

`text.count('go')` returns 3, but OP wants 0 as result. it needs to be exact word match. Work around for it is to add space before and after the search term, but then you won't be able to match with special character like full stop `.` – Moinuddin Quadri Jan 27 '21 at 22:41
It's not going to work. This works only for single word substring. For go, the answer should be zero, not 3. – utengr Jan 27 '21 at 22:41

score 0 · Answer 3 · answered Jan 27 '21 at 22:44

0

you try this :

text = 'he is going to school. abc is going to school. xyz is going to school.'
to_be_found = 'going to school'
i=0
r=0
while True :
  if text.find(to_be_found,i) <0 or i>len(text) :
    break
  elif text.find(to_be_found,i) >= 0 :
     r=r+1
     i=text.find(to_be_found,i)+len(to_be_found)


print(r)

answered Jan 27 '21 at 22:44

Belhadjer Samir

1,461
7
15

I hope this helped you – Belhadjer Samir Jan 27 '21 at 23:06

Jakub Szlaur · Answer 4 · 2021-01-27T23:05:08.930

Manage to make it work with this code (but it is not in Pythonic way at all):

text = 'he is going to school. abc is going to school. xyz is going to school.'
to_be_found = 'going to school'

def find_occurences(text, look_for):
    spec = [',','.','!','?']
    where = 0
    how_many = 0

    if not to_be_found in text:
        return how_many

    while True:
        i = text.find(look_for, where)

        if i != -1: #We have a match
            if (((text[i-1] == " ") and (text[i + len(look_for)] == " ")) #Check if the text is really alone
            or (((text[i-1] in spec) or ((text[i-1] == " "))) and (text[i + len(look_for)] in spec))): #Check if it is not surrounded by special characters such as ,.!?

                where = i + len(look_for)
                how_many += 1
            else:
                where = i + len(look_for)
        else:
            break
    
    return how_many

print("'{}' was in '{}' this many times: {}".format(to_be_found, text, find_occurences(text, to_be_found)))

The first condition: (text[i-1] == " ") and (text[i + len(look_for)] == " ") checks if the substring is not surrounded by white spaces.
The second condition: ((text[i-1] in spec) or ((text[i-1] == " "))) and (text[i + len(look_for)] in spec)) checks if the substring isn't surrounded by any special characters and white space from the left.

Example 1:

to_be_found = 'going to school'
Output1: 3

Example 2:

to_be_found = 'going to'
Output2: 3

Example 3:

to_be_found = 'go'
Output3: 0

Example 4:

to_be_found = 'school'
Output4: 3

If you would have any suggestions or values that it won't work for please comment on answer and I can edit it :) — Jakub Szlaur, Jan 27 '21 at 23:06
Thanks for your effort. You are right about the Pythonic part :) — utengr, Feb 03 '21 at 22:11

count appearnce of multi-word substring in some text

4 Answers4