Word frequency in a string without spaces and with special characters?

Question

Let's say I have the following string:

"hello&^uevfehello!`.<hellohow*howdhAreyou"

How would I go about counting the frequency of english words that are substrings of it? In this case I would want a result such as:

{'hello': 3, 'how': 2, 'are': 1, 'you': 1}

I searched previous question which were similar to this one but I couldn't really find anything that works. A close solution seemed to be using regular expressions, but it didn't work either. It might be because I was implementing it wrong since I'm not familiar with how it actually works.

How to find the count of a word in a string? it's the last answer

from collections import *
import re

Counter(re.findall(r"[\w']+", text.lower()))

I also tried creating a very bad function that iterates through every single possible arrangement of consecutive letters in the string (up to a max of 8 letters or so). The problem with doing that is

1) it's way longer than it should be and

2) it adds extra words. ex: if "hello" was in the string, "hell" would also be found.

I'm not very familiar with regex which is probably the right way to do this.

To count the frequency of english words, this is not sufficient. You'll have to use something like [ntlk](http://www.ntlk.org) and even then it'll be hard because you've got no separators for the words. — msvalkon, Feb 20 '14 at 08:58
Do you have a function or dictionary for identifying english words? — Jayanth Koushik, Feb 20 '14 at 09:00
I had a list of english words that I was comparing parts of the string with, but it didn't really help much. — Howcan, Feb 20 '14 at 09:06

score 2 · Accepted Answer · answered Feb 20 '14 at 09:33

2

d, w = "hello&^uevfehello!`.<hellohow*howdhAreyou", ["hello","how","are","you"]
import re, collections
pattern = re.compile("|".join(w), flags = re.IGNORECASE)
print collections.Counter(pattern.findall(d))

Output

Counter({'hello': 3, 'how': 2, 'you': 1, 'Are': 1})

answered Feb 20 '14 at 09:33

thefourtheye

233,700
52
457
497

@JayanthKoushik RegExs internally use state machines I believe. So, I am not very sure about the complexity. :( – thefourtheye Feb 21 '14 at 04:02
You're using a list of known words to compare (w), so technically I'd have to use a list of english words? – Howcan Feb 21 '14 at 04:05
@Howcan That is what I understood from this [comment](http://stackoverflow.com/questions/21902569/word-frequency-in-a-string-without-spaces-and-with-special-characters/21903657?noredirect=1#comment33170204_21902569) of yours – thefourtheye Feb 21 '14 at 04:38
you can use a dictionary or a library like enchant to make comparisons O(1). – Jayanth Koushik Feb 21 '14 at 05:47

score 0 · Answer 2 · answered Feb 20 '14 at 09:13

from collections import defaultdict

s = 'hello&^uevfehello!`.<hellohow*howdhAreyou'
word_counts = defaultdict(lambda: 0)

i = 0
while i < len(s):
    j = len(s)
    while j > i:
        if is_english_word(s[i:j]):
            word_counts[s[i:j]] += 1
            break
        j -= 1

    if j == i:
        i += 1
    else:
        i = j

print word_counts

score 0 · Answer 3 · edited May 23 '17 at 11:57

You need to extract all words from the string, then for each word you need to find substrings and then check if any of the substring is english word. I have used english dictionary from answer in How to check if a word is an English word with Python?

There are some false positives in the result however so you may want to use better dictionary or have a custom method to check for desired words.

import re
import enchant
from collections import defaultdict

# Get all substrings in given string.
def get_substrings(string):
    for i in range(0, len(string)):
        for j in range(i, len(string)):
            yield s[i:j+1]

text = "hello&^uevfehello!`.<hellohow*howdhAreyou"

strings = re.split(r"[^\w']+", text.lower())

# Use english dictionary to check if a word exists.
dictionary = enchant.Dict("en_US")
counts = defaultdict(int)
for s in strings:
  for word in get_substrings(s):
      if (len(word) > 1 and dictionary.check(word)):
          counts[word] += 1

print counts

Output:

defaultdict(, {'are': 1, 'oho': 1, 'eh': 1, 'ell': 3, 'oh': 1, 'lo': 3, 'll': 3, 'yo': 1, 'how': 2, 'hare': 1, 'ho': 2, 'ow': 2, 'hell': 3, 'you': 1, 'ha': 1, 'hello': 3, 're': 1, 'he': 3})

But that is not the desired output. Hell should be ignored if it is followed by an o etc. In general, all words which are substrings of other words in the string should be ignored. — Jayanth Koushik, Feb 21 '14 at 03:59

Word frequency in a string without spaces and with special characters?

3 Answers3