0

For example,

text = 'huwefggthisisastringhef'

I'd like to return a True or False depending on the string. E.g.

if detectEnglish(text) == True:
    print('contains english')
pacholik
  • 8,607
  • 9
  • 43
  • 55
J. Lloyd
  • 13
  • 3
  • 1
    Try a simpler question first: how would you determine whether a string *is* an English word? – Beta Sep 23 '17 at 11:04
  • 3
    Not sure if this will help: https://stackoverflow.com/questions/3788870/how-to-check-if-a-word-is-an-english-word-with-python –  Sep 23 '17 at 11:05
  • 1
    To detect the language which is used in a string, you can try this library: https://pypi.python.org/pypi/langdetect – masoud Sep 23 '17 at 11:05
  • Yeah, I've looked at text files with dictionarys and pyenchant but no idea how to actually split it in a way to check whether it is a word – J. Lloyd Sep 23 '17 at 11:06
  • 1
    You could start from testing all possible substrings – Marco Sep 23 '17 at 11:17
  • I think you should look at `wordlists dictionary` which have list of all commonly used words and then match them. – Kaushik NP Sep 23 '17 at 11:24

4 Answers4

4

Finds all english words at least three characters long in text

import enchant
d = enchant.Dict('en_US')

text = 'huwefggthisisastringhef'
l = len(text)

for i in range(l):
    for j in range(i+3, l+1):
        if d.check(text[i:j]):
            print(text[i:j])

Does that by testing all posible substrings (only 231 combinations for 23-chars-long string).

pacholik
  • 8,607
  • 9
  • 43
  • 55
  • 1
    Here is a one-liner that put all matches in a list (based on your example): **[text[i:j] for i in range(l) for j in range(l+1) if len(text[i:j]) >=3 and d.check(text[i:j])]** – Anton vBR Sep 23 '17 at 11:55
0

There are probably better methods to do this but if you don't need any information about the words that will be found you can do this.

This project on Github has over 466K words in a simple text file, you open the text file read it's content into memory and do look up for combination of letters.

If you wanted to you can sort this file into multi-dimensional dictionaries but to be honest if the text is very random it may be very computationally hungry.

I hope this answer was a bit helpful.

Strong will
  • 556
  • 4
  • 9
0

A trie regex could help you. You could filter the wordbook by length first, in order to avoid matching ['h', 'u', 'we', 'f', 'g', 'g', 'this', 'is', 'as', 't', 'ring', 'he', 'f']:

# encoding: utf-8
import re
from trie import Trie

with open('/usr/share/dict/american-english') as wordbook:
    english_words = [word.strip().lower() for word in wordbook if len(word.strip()) >= 3]

trie = Trie()
for word in english_words:
    trie.add(word)
test_word = "huwefggthisisastringhef"
print(re.findall(trie.pattern(), test_word))
# ['this', 'string']

It takes a few seconds to create the regex but the search itself is extremely fast, and should be more efficient than simply looping over every substring.

print(re.findall(trie.pattern(), "sdgfsdfgkjslfkgjsdkfgjsdbbqdsfghiddenwordsadfgsdfgsdfgsdfgsdtqtrwerthg"))
# ['hidden', 'words']
Eric Duminil
  • 52,989
  • 9
  • 71
  • 124
0

Based on the accepted answer here is a minor modification I thought could be valuable to share:

import enchant

d = enchant.Dict('en_US')
text = 'huwefggthisisastringhef'
l = len(text)
words = {text[i:j]:range(i,j) for i in range(l) for j in range(l+1) if len(text[i:j]) >=3 and d.check(text[i:j])}

print(words)

Returns a dictionary with the words and the ranges. Can for instance be used to check which words interesect and so on.

{'this': range(7, 11), 
'his': range(8, 11), 
'sis': range(10, 13), 
'string': range(14, 20), 
'ring': range(16, 20)}
Anton vBR
  • 18,287
  • 5
  • 40
  • 46