0

I have an input string, some of which does not contain actual words (for example, it contains mathematical formulas such as x^2 = y_2 + 4). I would like a way to split my input string by whether we have a substring of actual English words. For example:

If my string was:

"Taking the derivative of: f(x) = \int_{0}^{1} z^3, we can see that we always get x^2 = y_2 + 4 which is the same as taking the double integral of g(x)"

then I would like it split into a list like:

["Taking the derivative of: ", "f(x) = \int_{0}^{1} z^3, ", "we can see that we always get ", "x^2 = y_2 + 4 ", "which is the same as taking the double integral of ", "g(x)"]

How can I accomplish this? I don't think regex will work for this, or at least I'm not aware of any method in regex that detects the longest substrings of English words (including commas, periods, semicolons, etc).

graphtheory123
  • 311
  • 1
  • 6

1 Answers1

2

U can simply use the pyenchant library as mentioned in this post:

import enchant
d = enchant.Dict("en_US")
print(d.check("Hello"))

Output:

True

U can install it by typing pip install pyenchant in ur command line. In ur case, u have to loop through all strings in the string and check whether the current string is an english word or not. Here is the full code to do it:

import enchant
d = enchant.Dict("en_US")

string = "Taking the derivative of: f(x) = \int_{0}^{1} z^3, we can see that we always get x^2 = y_2 + 4 which is the same as taking the double integral of g(x)"

stringlst = string.split(' ')
wordlst = []

for string in stringlst:
    if d.check(string):
        wordlst.append(string)

print(wordlst)

Output:

['Taking', 'the', 'derivative', 'we', 'can', 'see', 'that', 'we', 'always', 'get', '4', 'which', 'is', 'the', 'same', 'as', 'taking', 'the', 'double', 'integral', 'of']

Hope that this helps!

Sushil
  • 5,440
  • 1
  • 8
  • 26