0

Hi have been playing with a simple program that reads in text and identify's keywords where the initial letter is capitalised. The issue I am having is that the program will not remove punctuation from words, what I mean by that is, Frodo Frodo. Frodo, are coming up as different entries rather than the same. I tried using import string and playing around with punctuation but it did not work.

Below is my code and the text i used was from http://www.angelfire.com/rings/theroaddownloads/fotr.pdf (copied into a txt document called novel.txt). Thanks Again

by_word = {}
with open ('novel.txt') as f:
  for line in f:
    for word in line.strip().split():
      if word[0].isupper():
        if word in by_word:
          by_word[word] += 1
        else:
          by_word[word] = 1

by_count = []
for word in by_word:
  by_count.append((by_word[word], word))

by_count.sort()
by_count.reverse()

for count, word in by_count[:100]:
  print(count, word)
  • 1
    Possible duplicate of [Best way to strip punctuation from a string in Python](http://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string-in-python) – elethan Apr 24 '17 at 02:12
  • Tried using the above solution first, but it didn't seem to work with my implementation, I possibly could have been doing it wrong. – Joshua Robertson Apr 24 '17 at 03:20

2 Answers2

1

Hope this below will work for you as expected:

import string
exclude = set(string.punctuation)

by_word = {}
with open ('novel.txt') as f:
  for line in f:
    for word in line.strip().split():
      if word[0].isupper():
        word = ''.join(char for char in word if char not in exclude)
        if word in by_word:
          by_word[word] += 1
        else:
          by_word[word] = 1

by_count = []
for word in by_word:
  by_count.append((by_word[word], word))

by_count.sort()
by_count.reverse()

for count, word in by_count[:100]:
  print(count, word)

It will remove all of

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ 

from word.

Claudio
  • 7,474
  • 3
  • 18
  • 48
0

Your code is fine, to strip punctuation, split using a regex,

for word in line.strip().split():

can be changed to

for word in re.split('[,.;]',line.strip()):

where, first argument in [] contains all the punctuation marks. This uses the re module, https://docs.python.org/2/library/re.html#re.split.

Pbd
  • 1,219
  • 1
  • 15
  • 32
  • Thanks that seems to have removed the punctuation but am now getting, Traceback (most recent call last): File "C:\Users\joshr\Desktop\Key-word reader.py", line 7, in if word[0].isupper(): IndexError: string index out of range - I understand what this error is trying to say but surely as each list is made up of only one object there should be no issue with index 0. – Joshua Robertson Apr 24 '17 at 03:31