Splitting the text into words in python

Question

While preparing the data (text file) for preprocessing. I am not able to split the text file into words.

import io
f = io.open("pg5200.txt", mode="r", encoding="utf-8")
text = f.read()
f.close()

import re
words = re.split(r'\W+', text)
print(words[:100])

After using the above code: The problem is I am getting an extra blank space (" ") in the beginning.

May I know why this extra space is occurring and how can I remove it??

Thank You

Use this :- https://stackoverflow.com/questions/16922214/reading-a-text-file-and-splitting-it-into-single-words-in-python/16922328 It may help you. — Sahil Gupta, May 23 '18 at 06:35
Can you [edit] the question to show a small example of `pg5200.txt` that recreates your problem? — Martin Evans, May 23 '18 at 07:57
Have you tried using `re.findall` ? It seems to be more appropriate for your case: you could try `re.findall(r'\w+', text)`. — Laurent H., May 23 '18 at 08:06

score 0 · Answer 1 · answered May 23 '18 at 06:37

0

You can use the strip function.

Check this answer How do I trim whitespace?

answered May 23 '18 at 06:37

bkupfer

124
11

Splitting the text into words in python

1 Answers1