-2

I'm trying to filter a corpus in python, i converted it to a string and I only need to keep English letters and any x in T=[',', '.', ':', '\n', '#', '(', ')', '!', '?' ,"'" , '"']

I tried several methods and couldn't succeed in keeping the special character \n along with the others.

One thing I've tried:

def cleancorpus(self, corptext):
   newtext=corptext
   newtext=newtext.lower()
   
   for i in range(0, len(newtext), 2):
       op, code = newtext[i:i+2]
       if(op=="\\" and code not in {"n"}):
           newtext=newtext.replace(op,"")
   newtext=''.join(x for x in newtext if x.isalpha() or x in T or x==' ')
   return newtext

However, it returns a ValueError: not enough values to unpack (expected 2, got 1). I've also tried iterating through the string char by char but my issue is mainly the [\n, ", '].

launax
  • 17
  • 3
  • Try to step through your program step by step: http://pythontutor.com – deceze Dec 16 '21 at 12:50
  • 3
    Note that an [escaped character](https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals) like `"\n"` is still just _one_ character, not two! – Jens Dec 16 '21 at 12:52
  • @Jens I treated it as one character at first but that didn't work out – launax Dec 16 '21 at 12:53
  • 1
    Also note that if the string has an odd number of characters, the last piece will have just the one character and then the `op, code = ...` will have nothing to put in `code`; that's what the "not enough values to unpack" error means – Jiří Baum Dec 16 '21 at 12:53
  • 2
    `import re; newtext = re.sub(r'''[^ a-zA-Z,.:\n#()!?'"]''', '', corpustext)` – tripleee Dec 16 '21 at 12:59
  • 1
    The last two strings in `T` (`"\'"` and `'\"'`) should be `"'"` and `'"'`, methinks. – Jens Dec 16 '21 at 13:01
  • 1
    What are "English letters"? It might be naïve to think that the english language only contains the ASCII characters. – Matthias Dec 16 '21 at 13:23
  • 1
    @Matthias Quite risqué, yes. – deceze Dec 16 '21 at 13:25

2 Answers2

1

There are four separate bugs in your attempt.

First, you are only looping over pairs of characters. If a backslash occurs as the second character of a pair, you are not noticing.

Second, when the number of characters is not even, you are attempting to inspect past the end of the string. This is what caused the error you are actually asking about. (newtext[i:i+2] only produces a single character when i is one less than the length of newtext; thus the assignment to two separate variables fails, because the expression only produces one value.)

Third, the newline is a single character. The sequence \n in a string in Python source code represents this character with a symbolic sequence which is two characters long, but in the string it represents, there is no backslash and no n, just a single character (also known as \x0a or \u000a aka LINE FEED).

Fourth, isalpha() is actually true for characters like ß and ä and .

A common arrangement for backslash processing etc is to instead implement a sliding window, so that you inspect the two characters starting at every character position in the string.

   # Still broken; looks for literal \ followed by n
   # Still broken: isalpha() is wrong for the use case
   newtext = []
   skip = False
   for i in range(len(corptext)):
       if skip:
           skip = False
           continue
       op = corptext[i].lower()
       # Stylistically, use equality for both comparisons
       if op == "\\" and i < len(corptext)-1 and corptext[i+1] != "n":
           # Tell the next iteration to skip the next character, too
           skip = True
           continue
       elif op.isalpha() or op in T or op == ' ':
           newtext.append(op)
   return ''.join(newtext)

As a minor efficiency hack, we collect the new text into a list, and only join them back into a string at the end. Appending to a list is quite a bit faster than appending to a string, so we avoid doing the latter inside the loop.

But for your actual task, a much simpler solution is available:

import re

def cleancorpus(self, corptext):
    return re.sub(r'''[^ a-zA-Z,.:\n#()!?'"]''', '', corpustext)

The self does not make sense outside of a class; this seems trivial enough that there should be no particular reason to want to encapsulate this into a class. But if you do, I suppose you could compile the regex in the __init__ method and save it. Adapting from the answer you posted yourself,

class CorpusReader:
    def __init__(self, URL):
        with urllib.request.urlopen(URL) as response:
            # .decode() produces a string from bytes
            # If you don't know the encoding, probably try UTF-8
            # then if that fails, figure out the _actual_ encoding
            text = response.read().decode()
        # You don't need to close when you use "with open(...)"
        # response.close
        self.regex = re.compile(r'''[^ a-zA-Z,.:\n#()!?'"]''')
        self.text = self.cleancorpus(text)
        
    def cleancorpus(self, corptext):
       return re.sub(self.regex, '', corptext).lower()

Your method was not doing anything useful with text; this saves it as self.text so that you can access it later. I kept the .lower() which was not in your requirements but was being used in your code; obviously, take it out if you don't want that.

The argument to decode could be extracted from response.headers['content-type'] but for a beginner, I suppose just hard-coding the expected encoding (if necessary) will be acceptable and sufficient.

tripleee
  • 175,061
  • 34
  • 275
  • 318
0

I solved this by treating the corpus as a bytes-like object:

class CorpusReader:
    
    def __init__(self, URL):
        with urllib.request.urlopen(URL) as response:
            text = response.read()
        response.close
        text = self.cleancorpus(text)
        
    def cleancorpus(self, corptext):
       newtext=corptext
       newtext=newtext.lower()
       pattern = bytes(b'''[^ a-z,.:\n#()!?'"]''')
       newtext= re.sub(pattern, b'', newtext)
       return newtext
tripleee
  • 175,061
  • 34
  • 275
  • 318
launax
  • 17
  • 3
  • It is unclear why you think you need `bytes` for this. The `newtext` you produce here is now not a string, which seems wrong, or at the very least quite inconvenient. It also doesn't really contribute to the solution; you can easily perform regex substitutions on text just as well (more easily, in fact). – tripleee Dec 20 '21 at 07:19