1

I seem to be having a bit of an issue stripping punctuation from a string in Python. Here, I'm given a text file (specifically a book from Project Gutenberg) and a list of stopwords. I want to return a dictionary of the 10 most commonly used words. Unfortunately, I keep getting one hiccup in my returned dictionary.

import sys
import collections
from string import punctuation
import operator

#should return a string without punctuation
def strip_punc(s):
    return ''.join(c for c in s if c not in punctuation)

def word_cloud(infile, stopwordsfile):

    wordcount = {}

    #Reads the stopwords into a list
    stopwords = [x.strip() for x in open(stopwordsfile, 'r').readlines()]


    #reads data from the text file into a list
    lines = []
    with open(infile) as f:
        lines = f.readlines()
        lines = [line.split() for line in lines]

    #does the wordcount
    for line in lines:
        for word in line:
            word = strip_punc(word).lower()
            if word not in stopwords:
                if word not in wordcount:
                    wordcount[word] = 1
                else:
                    wordcount[word] += 1

    #sorts the dictionary, grabs 10 most common words
    output = dict(sorted(wordcount.items(),
                  key=operator.itemgetter(1), reverse=True)[:10])

    print(output)


if __name__=='__main__':

    try:

        word_cloud(sys.argv[1], sys.argv[2])

    except Exception as e:

        print('An exception has occured:')
        print(e)
        print('Try running as python3 word_cloud.py <input-text> <stopwords>')

This will print out

{'said': 659, 'mr': 606, 'one': 418, '“i': 416, 'lorry': 322, 'upon': 288, 'will': 276, 'defarge': 268, 'man': 264, 'little': 263}

The "i shouldn't be there. I don't understand why it isn't eliminated in my helper function.

Thanks in advance.

mrantry
  • 13
  • 5

4 Answers4

5

The character is not ".

string.punctuation only includes the following ASCII characters:

In [1]: import string

In [2]: string.punctuation
Out[2]: '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

so you will need to augment the list of characters you are stripping.

Something like the following should accomplish what you need:

extended_punc = punctuation + '“' #  and any other characters you need to strip

def strip_punc(s):
    return ''.join(c for c in s if c not in extended_punc)

Alternatively, you could use the package unidecode to ASCII-fy your text and not worry about creating a list of unicode characters you may need to handle:

from unidecode import unidecode

def strip_punc(s):
    s = unidecode(s.decode('utf-8'))
    return ''.join(c for c in s if c not in punctuation).encode('utf-8')
Daniel Corin
  • 1,987
  • 2
  • 15
  • 27
1

As stated in other answers, the problem is that string.punctuation only contains ASCII characters, so the typographical ("fancy") quotes like are missing, among many other.

You could replace your strip_punc function with the following:

def strip_punc(s):
    '''
    Remove all punctuation characters.
    '''
    return re.sub(r'[^\w\s]', '', s)

This approach uses the re module. The regular expression works as follows: It matches any character that is neither alphanumeric (\w) nor whitespace (\s) and replaces it with the empty string (ie. deletes it).

This solution takes advantage of the fact that the "special sequences" \w and \s are unicode-aware, ie. they work equally well for any characters of any script, not only ASCII:

>>> strip_punc("I said “naïve”, didn't I!")
'I said naïve didnt I'

Please note that \w includes the underscore (_), because it is considered "alphanumeric". If you want to strip it as well, change the pattern to:

r'[^\w\s]|_'
lenz
  • 5,658
  • 5
  • 24
  • 44
0

I'd change my logic up on the strip_punc function

from string import asci_letters

def strip_punc(word):
    return ''.join(c for c in word if c in ascii_letters)

This logic is an explicit allow vs an explicit deny which means you are only allowing in the values you want vs only blocking the values you know you don't want i.e. leaves out any edge cases you didn't think about.

Also note this. Best way to strip punctuation from a string in Python

Adam
  • 3,992
  • 2
  • 19
  • 39
  • The OP is using Python 3 (see the `print()` function, and – more clearly – the string `'“i'`, which would be `u'“i'` in Python 2). In Python 3, you must change this to `from string import ascii_letters` – which makes it apparent that this approach will fail for an input string like `"naïve"`, which contains a non-ASCII letter. – lenz Jul 11 '17 at 08:50
  • Well then wouldn't it become a matter of what is a shorter list to not hardcode? i.e. all the ascii-punc + non-ascii-punc vs ascii-letters + non-ascii-letters. – Adam Jul 11 '17 at 08:58
  • There's no need for hard-coded lists here. If a character is punctuation or not is defined by a Unicode property, which you can access eg. through the stdlib `unicodedata` module. – lenz Jul 11 '17 at 09:10
  • I'm not understanding how you are checking if a value is punctuation or not via that module so feel free to post an answer. – Adam Jul 11 '17 at 09:23
  • Yeah, the docs are not overly helpful here. But it's easy: Call `unicodedata.category` with a single character as argument. If the return value starts with "P", then it's a punctuation character. – I posted an answer, but using a different approach. – lenz Jul 11 '17 at 09:56
  • Being based in the US I don't really run into many of those characters. Your solution looks like it might be the best. – Adam Jul 11 '17 at 09:58
0

w/o knowing what is in the stopwords list, the fastest solution is to add this:

#Reads the stopwords into a list
stopwords = [x.strip() for x in open(stopwordsfile, 'r').readlines()]
stopwords.append('“i')

And continue with the rest of your code..

Luis Miguel
  • 5,057
  • 8
  • 42
  • 75
  • And then you find `'i”'` in the text, and you add it to the stopwords list. And then you find `'“a'` in the text, and you add it to the stopwords list. And then you find `'it”'` in the text, and you add it to the stopwords list. And then... – lenz Jul 11 '17 at 09:14
  • @lenz highly unlikely that they will show up in the top 10; but aside from that, your solution is very good, will take a look at your profile :-). Thanks. – Luis Miguel Jul 11 '17 at 12:41
  • Well, fair enough if you really only care about the top 10. So, revenge down-vote? Or you actually think my answer is bad? – lenz Jul 11 '17 at 12:47
  • No hard feelings, I just wanted to know... I actually forgot about the top-10 thing when I down-voted yours, but now I can't un-down-vote anymore. Cheers! – lenz Jul 11 '17 at 12:56
  • Don't worry @lenz. I wont cry :-) Cheers – Luis Miguel Jul 11 '17 at 12:57