0

While preparing the data (text file) for preprocessing. I am not able to split the text file into words.

import io
f = io.open("pg5200.txt", mode="r", encoding="utf-8")
text = f.read()
f.close()

import re
words = re.split(r'\W+', text)
print(words[:100])

After using the above code: The problem is I am getting an extra blank space (" ") in the beginning.

May I know why this extra space is occurring and how can I remove it??

Thank You

joon
  • 3,899
  • 1
  • 40
  • 53
Siddhant
  • 1
  • 2
  • Use this :- https://stackoverflow.com/questions/16922214/reading-a-text-file-and-splitting-it-into-single-words-in-python/16922328 It may help you. – Sahil Gupta May 23 '18 at 06:35
  • Can you [edit] the question to show a small example of `pg5200.txt` that recreates your problem? – Martin Evans May 23 '18 at 07:57
  • Have you tried using `re.findall` ? It seems to be more appropriate for your case: you could try `re.findall(r'\w+', text)`. – Laurent H. May 23 '18 at 08:06

1 Answers1

0

You can use the strip function.

Check this answer How do I trim whitespace?

bkupfer
  • 124
  • 11