There are four separate bugs in your attempt.
First, you are only looping over pairs of characters. If a backslash occurs as the second character of a pair, you are not noticing.
Second, when the number of characters is not even, you are attempting to inspect past the end of the string. This is what caused the error you are actually asking about. (newtext[i:i+2]
only produces a single character when i
is one less than the length of newtext
; thus the assignment to two separate variables fails, because the expression only produces one value.)
Third, the newline is a single character. The sequence \n
in a string in Python source code represents this character with a symbolic sequence which is two characters long, but in the string it represents, there is no backslash and no n
, just a single character (also known as \x0a
or \u000a
aka LINE FEED).
Fourth, isalpha()
is actually true for characters like ß
and ä
and 日
.
A common arrangement for backslash processing etc is to instead implement a sliding window, so that you inspect the two characters starting at every character position in the string.
# Still broken; looks for literal \ followed by n
# Still broken: isalpha() is wrong for the use case
newtext = []
skip = False
for i in range(len(corptext)):
if skip:
skip = False
continue
op = corptext[i].lower()
# Stylistically, use equality for both comparisons
if op == "\\" and i < len(corptext)-1 and corptext[i+1] != "n":
# Tell the next iteration to skip the next character, too
skip = True
continue
elif op.isalpha() or op in T or op == ' ':
newtext.append(op)
return ''.join(newtext)
As a minor efficiency hack, we collect the new text into a list, and only join them back into a string at the end. Appending to a list is quite a bit faster than appending to a string, so we avoid doing the latter inside the loop.
But for your actual task, a much simpler solution is available:
import re
def cleancorpus(self, corptext):
return re.sub(r'''[^ a-zA-Z,.:\n#()!?'"]''', '', corpustext)
The self
does not make sense outside of a class; this seems trivial enough that there should be no particular reason to want to encapsulate this into a class. But if you do, I suppose you could compile the regex in the __init__
method and save it. Adapting from the answer you posted yourself,
class CorpusReader:
def __init__(self, URL):
with urllib.request.urlopen(URL) as response:
# .decode() produces a string from bytes
# If you don't know the encoding, probably try UTF-8
# then if that fails, figure out the _actual_ encoding
text = response.read().decode()
# You don't need to close when you use "with open(...)"
# response.close
self.regex = re.compile(r'''[^ a-zA-Z,.:\n#()!?'"]''')
self.text = self.cleancorpus(text)
def cleancorpus(self, corptext):
return re.sub(self.regex, '', corptext).lower()
Your method was not doing anything useful with text
; this saves it as self.text
so that you can access it later. I kept the .lower()
which was not in your requirements but was being used in your code; obviously, take it out if you don't want that.
The argument to decode
could be extracted from response.headers['content-type']
but for a beginner, I suppose just hard-coding the expected encoding (if necessary) will be acceptable and sufficient.