1

My intention is to have a whole lot of text and translate it into all lower case first. (Which it does) Then, remove the punctuation marks in the text.(Which it does not) Finally, print out the frequency of the word used. (It prints out test. and test as two different things.)

from collections import Counter



text = """
Test. test test. Test Test test. 
""".lower().strip(".")



words = text.split()
counts = Counter(words)
print(counts)

Any help would be appreciated.

  • 2
    Because the periods are in in the middle of the string. Use `.replace('.', '')`. Note that in your actual example, that might not be completely representative, you also get newline characters tagged on to your string e.g. `test\n`. – roganjosh Apr 27 '17 at 18:04
  • Deleted my answer, @roganjosh already answered it. I suggest you post it as an answer. – zengr Apr 27 '17 at 18:08
  • 1
    @zengr had started, just reactive your answer, it doesn't bother me and yours is already typed :) – roganjosh Apr 27 '17 at 18:08

4 Answers4

0

You need .replace('.', '') in place of strip

zengr
  • 38,346
  • 37
  • 130
  • 192
  • Thank you for the answer! How would you replace multiple things? For instance if I added !, ?, and ' to the list how would I replace all those things with nothing? –  Apr 27 '17 at 18:13
  • I guess I need to read more on .strip() I thought it was only used for whitespace based on the few articles I came across. Thanks for letting me know. – Mike - SMT Apr 27 '17 at 18:13
  • @DeathPox check out this link it answers that question for [Multiple Strings](http://stackoverflow.com/questions/6116978/python-replace-multiple-strings) – Mike - SMT Apr 27 '17 at 18:15
0

You can split the text in a list and then strip the punctuation, or use roganjosh's suggestion, which is to use .replace('.', ''):

Way 1:

text = "Test. test test. Test Test test."
word = text.split()
the_list = [i.strip('.') for i in word]
counts = Counter(the_list)

Note that for .strip(), only punctuation at the end of a string will be removed, not in the middle.

Way 2:

new_text = text.replace('.', '')
counts = Counter(new_text)
Ajax1234
  • 69,937
  • 8
  • 61
  • 102
0

If all you want is to extract words (for counting or any other reason), use regular expressions re.findall (or re.finditer if the texts are big and you don't want to collect all the matches in memory):

import re

text = """
Test. test test. Test Test test. 
"""

# Counter({'test': 6})
counts = Counter(re.findall("\w+", text))

Note this may be trickier with the non-ASCII texts (and doesn't account for, e.g. words-with-dashes).

drdaeman
  • 11,159
  • 7
  • 59
  • 104
0

To replace all characters you need to work with it word by word.

strip is an amazing function and you can use it to remove multiple characters all at one, but the problem with strip() is that it will stop after the first whitespace it find.

word = text.split()
text_list = [i.strip('.') for i in word]
count = len(text_list)
text = " ".join(text_list)

This way you work with each word.

Hope this helps

yatabani
  • 320
  • 2
  • 10