Removing punctuation from my nested and tokenized list

Question

I am trying to remove the punctuation from my nested and tokenized list. I have tried several different approaches to this, but to no avail. My most recent attempt looks like this:

def tokenizeNestedList(listToTokenize):
    flat_list = [item.lower() for sublist in paragraphs_no_guten for item in sublist]
    tokenList = []
    for sentence in flat_list:
        sentence.translate(str.maketrans(",",string.punctuation))
        tokenList.append(nltk.word_tokenize(sentence))
    return tokenList

As you can see I'm trying to remove the punctuation as i tokenize the list, the list is being traversed anywho whilst calling my function. However, when trying this approach I get the error

ValueError: the first two maketrans arguments must have equal length

Which I sort of understand why happens. Running my code without trying to remove punctuation and printing the first 10 elements gives me (so you have an idea of what I'm working on) this:

[[], ['title', ':', 'an', 'inquiry', 'into', 'the', 'nature', 'and', 'causes', 'of', 'the', 'wealth', 'of', 'nations'], ['author', ':', 'adam', 'smith'], ['posting', 'date', ':', 'february', '28', ',', '2009', '[', 'ebook', '#', '3300', ']'], ['release', 'date', ':', 'april', ',', '2002'], ['[', 'last', 'updated', ':', 'june', '5', ',', '2011', ']'], ['language', ':', 'english'], [], [], ['produced', 'by', 'colin', 'muir']]

Any and all advice appreciated.

Dani Mesejo · Answer 1 · 2018-10-31T11:07:43.243

Assuming each punctuation is a separate token, you could so something like this:

import string

sentences = [[], ['title', ':', 'an', 'inquiry', 'into', 'the', 'nature', 'and', 'causes', 'of', 'the', 'wealth', 'of',
             'nations'], ['author', ':', 'adam', 'smith'],
             ['posting', 'date', ':', 'february', '28', ',', '2009', '[', 'ebook', '#', '3300', ']'],
             ['release', 'date', ':', 'april', ',', '2002'], ['[', 'last', 'updated', ':', 'june', '5', ',', '2011', ']'],
             ['language', ':', 'english'], [], [], ['produced', 'by', 'colin', 'muir']]


result = [list(filter(lambda x: x not in string.punctuation, sentence)) for sentence in sentences]

print(result)

Output

[[], ['title', 'an', 'inquiry', 'into', 'the', 'nature', 'and', 'causes', 'of', 'the', 'wealth', 'of', 'nations'], ['author', 'adam', 'smith'], ['posting', 'date', 'february', '28', '2009', 'ebook', '3300'], ['release', 'date', 'april', '2002'], ['last', 'updated', 'june', '5', '2011'], ['language', 'english'], [], [], ['produced', 'by', 'colin', 'muir']]

The idea is to use filter, to remove those tokens that are punctuation, as filter returns an iterator use list to convert it back to a list. You could also use the equivalent list comprehension:

result = [[token for token in sentence if token not in string.punctuation] for sentence in sentences]

This did what I wanted. Thank you! I added a filter to the listcomprehension method to get what I wanted exactly. For anyone else reading this at a later time, i put a filter on it to remove whitespace: result = list(filter(None, result)) — UndisclosedCurtain, Oct 31 '18 at 15:06

score 1 · Accepted Answer · answered Oct 31 '18 at 14:18

1

For this to work as it is you need to run Python 3.x . Also, b contains the example nested list which you have provided

import string
# Remove empty lists
b = [x for x in b if x]
# Make flat list
b = [x for sbl in b for x in sbl]
# Define translation
translator = str.maketrans('', '', string.punctuation)
# Apply translation
b = [x.translate(translator) for x in b]
# Remove empty strings
b = list(filter(None, b))

A reference why it didn't work before: Python 2 maketrans() function doesn't work with Unicode: "the arguments are different lengths" when they actually are

answered Oct 31 '18 at 14:18

Konstantin Grigorov

1,356
12
20

This does almost what I want it to. Is there a way to reverse the flattening of the list? I need to keep the words in the lists i have previously determined. – UndisclosedCurtain Oct 31 '18 at 14:48
Try: `b = [x for x in b if x] # Define translation translator = str.maketrans('', '', string.punctuation) # Apply translation b = [[word.translate(translator) for word in sbl] for sbl in b] # Remove empty strings b = [list(filter(None, sbl)) for sbl in b]` – Konstantin Grigorov Oct 31 '18 at 15:42

Removing punctuation from my nested and tokenized list

2 Answers2