4

So for a single word substring count in some text, I can use some_text.split().count(single_word_substring). How can I do that for a multi-word substring count in some text?

Examples:

text = 'he is going to school. abc is going to school. xyz is going to school.'
to_be_found = 'going to school'

count should be 3.

text = 'he is going to school. abc is going to school. xyz is going to school.'
to_be_found = 'going to'

count should be 3.

text = 'he is going to school. abc is going to school. xyz is going to school.'
to_be_found = 'go'

count should be 0.

text = 'he is going to school. abc-xyz is going to school. xyz is going to school.'
to_be_found = 'school'

count should be 3.

text = 'he is going to school. abc-xyz is going to school. xyz is going to school.'
to_be_found = 'abc-xyz'

count should be 1.

Assumption 1: Everything is lower-case. Assumption 2: The text can contain anything. Assumption 3: The to be found can contain anything too. For example, car with 4 passengers, xyz & abc, etc.

NOTE: REGEX based solutions are acceptable. I am just curious if it's possible without regex (nice to have and just for others who may be interested in this in future).

utengr
  • 3,225
  • 3
  • 29
  • 68
  • 1
    Have you tried using [`re.findall`](https://docs.python.org/3/library/re.html#re.findall)? – kingkupps Jan 27 '21 at 22:20
  • 1
    Perhaps `re.findall(fr'\b{to_be_found}\b', text)` and take the `len` of the result? – Nick Jan 27 '21 at 22:21
  • This works but seems to a bit slower. Maybe you should add as an answer so I can accept it. It could be useful for other OPs. – utengr Jan 27 '21 at 22:32
  • I have an answer with rexeg but I took it down because you said you are not looking for regex option? Should I put it back or no? – Jakub Szlaur Jan 27 '21 at 22:36
  • Currently I am working on answer without the regex module ... – Jakub Szlaur Jan 27 '21 at 22:36
  • Both are acceptable answers (with/with out regex). I couldn't figure one out without regex so I am interested in that one more but for community, both should be there. – utengr Jan 27 '21 at 22:40
  • considering special characters like full-stop as part of word makes it tricky to handle without regex. It is doable, but not worth it when compared to regex based solution already shared above – Moinuddin Quadri Jan 27 '21 at 22:44
  • @Anonymous In to be found, full stop is not part of the string in this example. However, the main text can contain anything. To be found can also contain special characters such as & or -. – utengr Jan 27 '21 at 22:48
  • I also just tested re.findall(fr'\b{to_be_found}\b', text) with to be found = 'school." and it returns only 1 whereas it should return 3. – utengr Jan 27 '21 at 22:53
  • 1
    My comment is about first sentence `"he is going to school. abc is going to school."`. You need exact match of the `to_be_found` instead of just substring match, but you want to consider `.` as optional. Splitting the string to words will consider `.` as part of, `shool.` and won't be consider for exact match when done with `school`. One way to handle is to remove all special characters in the string. But doing so without regex will require iteration on your entire string (one char at a time). Then you can `str.count()` on ` to_be_found`. – Moinuddin Quadri Jan 27 '21 at 22:54
  • 1
    all this is not worth it when you can achieve it with `re.findall()`. Regarding your last comment, you need to replace `.` with `\.` in regex expression, i.e `school\.`, as single `.` has special meaning in regex – Moinuddin Quadri Jan 27 '21 at 22:57
  • @Anonymous Oh got it now, thanks for the explanation. You are right, it gets complicated without regex. The original text contains lots of such special symbols connected to words so I guess it may not be possible without regex. – utengr Jan 27 '21 at 22:58
  • 2
    it is possible, but not optimal – Moinuddin Quadri Jan 27 '21 at 22:59
  • @utengr the suggestion I made won't work for something like `school.` as there is no word boundary at the end. – Nick Jan 27 '21 at 23:07
  • @Nick you should add your answer so I can accept it since the rest of the answers are not what I am looking for. – utengr Jan 29 '21 at 10:24
  • @utengr but my answer will not work for `school.`. – Nick Jan 29 '21 at 12:14

4 Answers4

1

Here's a working solution using regex:

import re

def occurrences(text,to_be_found):
    return len(re.findall(rf'\W{to_be_found}\W', text))

The capital W in regex is for non-word characters, which covers spaces and other punctuation.

Byron
  • 309
  • 1
  • 8
0
  1. the best native way to search substring is still count. it can be used with multi-word substrings as you need

    text = 'he is going to school. abc is going to school. xyz is going to school.'
    text.count('going to school') # 3
    text.count('going to') # 3
    text.count('school') # 3
    text.count('go') # 3
    

    for case 'go' if you need 0 you can search 'go ',' go' or ' go ' to catch separate word

  2. also you can write your own method to search by characters https://stackoverflow.com/a/30863956/15080484

jeffry_bo
  • 21
  • 3
  • `text.count('go')` returns 3, but OP wants 0 as result. it needs to be exact word match. Work around for it is to add space before and after the search term, but then you won't be able to match with special character like full stop `.` – Moinuddin Quadri Jan 27 '21 at 22:41
  • It's not going to work. This works only for single word substring. For go, the answer should be zero, not 3. – utengr Jan 27 '21 at 22:41
0

you try this :

text = 'he is going to school. abc is going to school. xyz is going to school.'
to_be_found = 'going to school'
i=0
r=0
while True :
  if text.find(to_be_found,i) <0 or i>len(text) :
    break
  elif text.find(to_be_found,i) >= 0 :
     r=r+1
     i=text.find(to_be_found,i)+len(to_be_found)


print(r)
Belhadjer Samir
  • 1,461
  • 7
  • 15
0

Manage to make it work with this code (but it is not in Pythonic way at all):

text = 'he is going to school. abc is going to school. xyz is going to school.'
to_be_found = 'going to school'

def find_occurences(text, look_for):
    spec = [',','.','!','?']
    where = 0
    how_many = 0

    if not to_be_found in text:
        return how_many

    while True:
        i = text.find(look_for, where)

        if i != -1: #We have a match
            if (((text[i-1] == " ") and (text[i + len(look_for)] == " ")) #Check if the text is really alone
            or (((text[i-1] in spec) or ((text[i-1] == " "))) and (text[i + len(look_for)] in spec))): #Check if it is not surrounded by special characters such as ,.!?

                where = i + len(look_for)
                how_many += 1
            else:
                where = i + len(look_for)
        else:
            break
    
    return how_many

print("'{}' was in '{}' this many times: {}".format(to_be_found, text, find_occurences(text, to_be_found)))
  1. The first condition: (text[i-1] == " ") and (text[i + len(look_for)] == " ") checks if the substring is not surrounded by white spaces.
  2. The second condition: ((text[i-1] in spec) or ((text[i-1] == " "))) and (text[i + len(look_for)] in spec)) checks if the substring isn't surrounded by any special characters and white space from the left.

Example 1:

to_be_found = 'going to school'
Output1: 3

Example 2:

to_be_found = 'going to'
Output2: 3

Example 3:

to_be_found = 'go'
Output3: 0

Example 4:

to_be_found = 'school'
Output4: 3
Jakub Szlaur
  • 1,852
  • 10
  • 39