Text Preprocessing Python

Question

I have a text input='The quick brown fox. Jumped over the lazy dog.' And I want the out to be as below:

[['quick', 'brown', 'fox', '.'], ['jumped', 'lazy', 'dog', '.']]

Please let me know how to do this.

I just split the sentence into words but not sure what to do next?

import nltk 
from nltk.tokenize import word_tokenize 

input="The quick brown fox. Jumped over the lazy dog." 
tokens=word_tokenize(input) 
print(tokens)

Split the sentence first based on the delimiter period ('.'), You'll have to split again with delimiter comma (',') over each element from the output of the first split. You'll have to convert the case as well from what I see in your expected output. — Ronnie, Mar 25 '20 at 08:47
What did you try btw? What went wrong? If you could include those details in the question, it'll help others to help you better. — Ronnie, Mar 25 '20 at 08:47
Your expected output is missing some of the words 'The' (twice) and 'over'. Is there a specific reason why? If you elaborate on what you have attempted (post some code) it's easier for us to help. — wstk, Mar 25 '20 at 08:48
1)yeah, the output should be the way which I have provided in the question(the and over are missing) — rahul Bhattiprolu, Mar 25 '20 at 08:51
>>import nltk >>from nltk.tokenize import word_tokenize >>input='The quick brown fox. Jumped over the lazy dog.' >>tokens=word_tokenize(input) >>print(tokens) — rahul Bhattiprolu, Mar 25 '20 at 08:52
Possible dupe of [How to remove stop words using nltk or python](https://stackoverflow.com/questions/5486337/how-to-remove-stop-words-using-nltk-or-python) — Wiktor Stribiżew, Mar 25 '20 at 08:57

score 0 · Answer 1 · answered Mar 25 '20 at 09:02

There's a multitude of ways you could go about this, but let's choose the way you've done thus far.

So you split the sentence into words, which I imagine you accomplished through text = text.split(" "), so your list looks something like text = ["The", "quick", "brown", "fox.", "Jumped", "over", "the", "lazy", "dog."]

Now let's implement the period into this new array, new_list.

text = text.split(" ")
new_list = []  # New list we will write the words to

for word in text:
    if '.' in word:
        word = word.split('.')  # Here we assume period always comes after word
        new_list.append(word[0])
        new_list.append('.')
    else:
        new_list.append(word)

Now it appears you don't want words such as "The" or "over". For this, simply create another array, such as skip_words = ["The", "the", "over"].

skip_words = ["The", "the", "over"]
for word in skip_words:
    new_list.remove(word)

And this should do the trick! Now just try printing out new_list.

Text Preprocessing Python

1 Answers1