2

Let's say I have the following string:

"hello&^uevfehello!`.<hellohow*howdhAreyou"

How would I go about counting the frequency of english words that are substrings of it? In this case I would want a result such as:

{'hello': 3, 'how': 2, 'are': 1, 'you': 1}

I searched previous question which were similar to this one but I couldn't really find anything that works. A close solution seemed to be using regular expressions, but it didn't work either. It might be because I was implementing it wrong since I'm not familiar with how it actually works.

How to find the count of a word in a string? it's the last answer

from collections import *
import re

Counter(re.findall(r"[\w']+", text.lower()))

I also tried creating a very bad function that iterates through every single possible arrangement of consecutive letters in the string (up to a max of 8 letters or so). The problem with doing that is

1) it's way longer than it should be and

2) it adds extra words. ex: if "hello" was in the string, "hell" would also be found.

I'm not very familiar with regex which is probably the right way to do this.

Community
  • 1
  • 1
Howcan
  • 313
  • 3
  • 5
  • 16
  • To count the frequency of english words, this is not sufficient. You'll have to use something like [ntlk](http://www.ntlk.org) and even then it'll be hard because you've got no separators for the words. – msvalkon Feb 20 '14 at 08:58
  • Do you have a function or dictionary for identifying english words? – Jayanth Koushik Feb 20 '14 at 09:00
  • I had a list of english words that I was comparing parts of the string with, but it didn't really help much. – Howcan Feb 20 '14 at 09:06
  • @Howcan Please show us the list of english words you have. – thefourtheye Feb 20 '14 at 09:26

3 Answers3

2
d, w = "hello&^uevfehello!`.<hellohow*howdhAreyou", ["hello","how","are","you"]
import re, collections
pattern = re.compile("|".join(w), flags = re.IGNORECASE)
print collections.Counter(pattern.findall(d))

Output

Counter({'hello': 3, 'how': 2, 'you': 1, 'Are': 1})
thefourtheye
  • 233,700
  • 52
  • 457
  • 497
  • @JayanthKoushik RegExs internally use state machines I believe. So, I am not very sure about the complexity. :( – thefourtheye Feb 21 '14 at 04:02
  • You're using a list of known words to compare (w), so technically I'd have to use a list of english words? – Howcan Feb 21 '14 at 04:05
  • @Howcan That is what I understood from this [comment](http://stackoverflow.com/questions/21902569/word-frequency-in-a-string-without-spaces-and-with-special-characters/21903657?noredirect=1#comment33170204_21902569) of yours – thefourtheye Feb 21 '14 at 04:38
  • you can use a dictionary or a library like enchant to make comparisons O(1). – Jayanth Koushik Feb 21 '14 at 05:47
0
from collections import defaultdict

s = 'hello&^uevfehello!`.<hellohow*howdhAreyou'
word_counts = defaultdict(lambda: 0)

i = 0
while i < len(s):
    j = len(s)
    while j > i:
        if is_english_word(s[i:j]):
            word_counts[s[i:j]] += 1
            break
        j -= 1

    if j == i:
        i += 1
    else:
        i = j

print word_counts
Jayanth Koushik
  • 9,476
  • 1
  • 44
  • 52
0

You need to extract all words from the string, then for each word you need to find substrings and then check if any of the substring is english word. I have used english dictionary from answer in How to check if a word is an English word with Python?

There are some false positives in the result however so you may want to use better dictionary or have a custom method to check for desired words.

import re
import enchant
from collections import defaultdict

# Get all substrings in given string.
def get_substrings(string):
    for i in range(0, len(string)):
        for j in range(i, len(string)):
            yield s[i:j+1]

text = "hello&^uevfehello!`.<hellohow*howdhAreyou"

strings = re.split(r"[^\w']+", text.lower())

# Use english dictionary to check if a word exists.
dictionary = enchant.Dict("en_US")
counts = defaultdict(int)
for s in strings:
  for word in get_substrings(s):
      if (len(word) > 1 and dictionary.check(word)):
          counts[word] += 1

print counts

Output:

defaultdict(, {'are': 1, 'oho': 1, 'eh': 1, 'ell': 3, 'oh': 1, 'lo': 3, 'll': 3, 'yo': 1, 'how': 2, 'hare': 1, 'ho': 2, 'ow': 2, 'hell': 3, 'you': 1, 'ha': 1, 'hello': 3, 're': 1, 'he': 3})

Community
  • 1
  • 1
Ashwinee K Jha
  • 9,187
  • 2
  • 25
  • 19
  • But that is not the desired output. Hell should be ignored if it is followed by an o etc. In general, all words which are substrings of other words in the string should be ignored. – Jayanth Koushik Feb 21 '14 at 03:59