10

The NLTK book has a couple of examples of word counts, but in reality they are not word counts but token counts. For instance, Chapter 1, Counting Vocabulary says that the following gives a word count:

text = nltk.Text(tokens)
len(text)

However, it doesn't - it gives a word and punctuation count. How can you get a real word count (ignoring punctuation)?

Similarly, how can you get the average number of characters in a word? The obvious answer is:

word_average_length =(len(string_of_text)/len(text))

However, this would be off because:

  1. len(string_of_text) is a character count, including spaces
  2. len(text) is a token count, excluding spaces but including punctuation marks, which aren't words.

Am I missing something here? This must be a very common NLP task...

Zach
  • 4,624
  • 13
  • 43
  • 60

4 Answers4

22

Tokenization with nltk

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
text = "This is my text. It icludes commas, question marks? and other stuff. Also U.S.."
tokens = tokenizer.tokenize(text)

Returns

['This', 'is', 'my', 'text', 'It', 'icludes', 'commas', 'question', 'marks', 'and', 'other', 'stuff', 'Also', 'U', 'S']
petra
  • 2,642
  • 2
  • 20
  • 12
  • hey, keep in mind that nltk counts "it's" as two word. Try with text = "Hi, it's me". Output is ['Hi', 'it', 's', 'me'] – dallonsi Jun 21 '23 at 13:57
16

Removing Punctuation

Use a regular expression to filter out the punctuation

import re
from collections import Counter

>>> text = ['this', 'is', 'a', 'sentence', '.']
>>> nonPunct = re.compile('.*[A-Za-z0-9].*')  # must contain a letter or digit
>>> filtered = [w for w in text if nonPunct.match(w)]
>>> counts = Counter(filtered)
>>> counts
Counter({'this': 1, 'a': 1, 'is': 1, 'sentence': 1})

Average Number of Characters

Sum the lengths of each word. Divide by the number of words.

>>> float(sum(map(len, filtered))) / len(filtered)
3.75

Or you could make use of the counts you already did to prevent some re-computation. This multiplies the length of the word by the number of times we saw it, then sums all of that up.

>>> float(sum(len(w)*c for w,c in counts.iteritems())) / len(filtered)
3.75
dhg
  • 52,383
  • 8
  • 123
  • 144
  • 3
    So the NLTK does't have any functions for these operations? – Zach May 20 '12 at 21:14
  • Alternatively, you can use `re.split()` on punctuation and whitespace. – Joel Cornett May 20 '12 at 21:43
  • @Joel: That would cause problems for punctuation that is embedded inside of words (eg, `U.S.`). – dhg May 20 '12 at 21:46
  • @dhg: `U.S.` pretty ambiguous, but I see what you're saying. Out of curiosity, is there any reason you're not using `re.findall()` ? – Joel Cornett May 20 '12 at 22:12
  • 1
    @Joel: 1) I mean't that if you split the word `U.S.` on punctuation, you would get the two words `U` and `S`, and that is wrong. 2) `findall` would work in this particular case, but the way I've written it, you can use the regex to define exactly what it means to be a "punctuation token" (perhaps in a more complex way than I have). – dhg May 20 '12 at 22:19
2

Removing Punctuation (with no regex)

Use the same solution as dhg, but test that a given token is alphanumeric instead of using a regex pattern.

from collections import Counter

>>> text = ['this', 'is', 'a', 'sentence', '.']
>>> filtered = [w for w in text if w.isalnum()]
>>> counts = Counter(filtered)
>>> counts
Counter({'this': 1, 'a': 1, 'is': 1, 'sentence': 1})

Advantages:

  • Works better with non English languages as "À".isalnum() is True while bool(nonPunct.match("à")) is False (an "à" is not a punctuation mark at least in French).
  • Does not need to use the re package.
Adrien Pacifico
  • 1,649
  • 1
  • 15
  • 33
1

Removing punctuation

from string import punctuation   
punctuations = list(punctuation)
punctuations.append("''")
punctuations.append("--")
punctuations.append("``")
from string import punctuation 
text = [word for word in text if word not in punctuations]

The average number of character in a word on a text

from collections import Counter
from nltk import word_tokenize

word_count = Counter(word_tokenize(text))
sum(len(x)* y for x, y in word_count.items()) / len(text)
rad15f
  • 29
  • 3