3

Having a bit of a predicament in python. I'd like to take a .txt file with many comments and split it into a list. However, I'd like to split on all punctuation, spaces and \n. When I run the following python code, it splits my text file in weird spots. NOTE: Below I am only trying to split on periods and endlines to test it out. But it is still often getting rid of the last letter in words.

import regex as re
with open('G:/My Documents/AHRQUnstructuredComments2.txt','r') as infile:
    nf = infile.read()
    wList = re.split('. | \n, nf)

print(wList)
polm23
  • 14,456
  • 7
  • 35
  • 59
John W
  • 41
  • 4

3 Answers3

2

You need to fix the quote marks and make a slight change to the regular expression:

import regex as re
with open('G:/My Documents/AHRQUnstructuredComments2.txt','r') as infile:
    nf = infile.read()
    wList = re.split('\W+' nf)

print(wList)
Ajax1234
  • 69,937
  • 8
  • 61
  • 102
  • This is helpful, but do you know of a site that will tell me how the escape sequences work in the .split() function? I think because I'm trying to remove punctuation and special characters and I'm not properly describing them. – John W Jul 21 '17 at 19:16
  • @JohnW Escape characters will allow the following character to be matched by itself in the expression. Otherwise, the character takes on a special meaning. Regarding the split function, the expression passed to it remains the same for all re methods. See here for more info regarding escape characters: http://www.regular-expressions.info/characters.html – Ajax1234 Jul 21 '17 at 19:25
2

You forgot to close the string and you need \ before .

import regex as re
with open('G:/My Documents/AHRQUnstructuredComments2.txt','r') as infile:
    nf = infile.read()
    wList = re.split('\. |\n |\s', nf)

print(wList)

See Split Strings with Multiple Delimiters? for more info.

Also, RichieHindle answers your question perfectly:

import re
DATA = "Hey, you - what are you doing here!?"
print re.findall(r"[\w']+", DATA)
# Prints ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']
Jake
  • 999
  • 1
  • 7
  • 16
  • Thank you! I'll try that out. It's really useful to see why the Python interpreter is doing the things it does sometimes – John W Jul 21 '17 at 19:32
  • 1
    Yeah, as intuitive as python is it can sometimes be tricky, hope everything ends up working for you! – Jake Jul 21 '17 at 19:33
2

In regex, the character . means any character. You have to escape it, \., to capture periods.

Jared Goguen
  • 8,772
  • 2
  • 18
  • 36