Writing a program to print hapax's from a string

Question

A hapax is a word that only occurs once in a string. My code sort of works. At first, it got the first hapax, then, I changed the string I put in, and it got the last one, and the first hapax, but not the second hapax...here's my current code

def hapax(stringz):
    w = ''
    l = stringz.split()
    for x in l:
        w = ''
        l.remove(x)
        for y in l:
            w += y
        if w.find(x) == -1:
            print(x)


hapax('yo i went jogging then yo i went joggin tuesday wednesday')

All i got was

then
wednesday

kdopen · Answer 1 · 2015-03-23T13:47:06.543

You can do this quickly with the Counter class.

>>> s='yo i went jogging then yo i went joggin tuesday wednesday'
>>> from collections import Counter
>>> Counter(s.split())
Counter({'yo': 2, 'i': 2, 'went': 2, 'joggin': 1, 'then': 1, 'tuesday': 1, 'wednesday': 1, 'jogging': 1})

Then simply iterate through the returned dictionary looking for words with a count of 1

>>> c=Counter(s.split())
>>> for w in c:
...     if c[w] == 1:
...         print w
... 
joggin
then
tuesday
wednesday
jogging
>>>

You'll note that you actually have five hapaxes in that string: joggin, then, tuesday, wednesday, and jogging.

You may need additional logic to decide if "Jogging" and "jogging" are different words. You also need to decide if punctuation counts (and remove if it it shouldn't). That is all dependent on the fine requirements of your problem statement.

Regarding your original code, I'm not sure what you were trying to accomplish with this loop:

for y in l:
    w += y

It simply concatenates all the words into a single string with no spaces. Thus, if l is ['the','cat','sat','on','the','mat'], w will be 'thecatsatonthemat' which may cause problems in your match. If the original string contained "I may be that maybe you are right", the words "may be" would concatentate to "maybe" and find would find them.

Great minds :) Fixed that while you were typing your comment :) — kdopen, Mar 23 '15 at 13:42
`Counter(w.rstrip(string.punctuation) for w in s.split())` should do it — Padraic Cunningham, Mar 23 '15 at 13:44

miradulo · Answer 2 · 2015-03-23T13:47:06.673

You can use a list comprehension with collections.Counter to do so succintly. Also note .lower() to place all words in lowercase, as to not confuse Jogging and jogging as two different words, for instance.

from collections import Counter
my_str = 'yo i went Jogging then yo i went jogging tuesday wednesday'
my_list = Counter(my_str.lower().split())
print([element for element in my_list if my_list[element] == 1])

Outputs:

['wednesday', 'then', 'tuesday']

Furthermore, if it is required that you strip all punctuation in addition to capitalization, you could exclude punctuation characters before counting words with a set(string.punctuation), like so:

from collections import Counter
import string

my_str = 'yo! i went Jogging then yo i went jogging tuesday, wednesday.'
removed_punct_str = ''.join(chara for chara in my_str if chara not in set(string.punctuation))
my_list = Counter(removed_punct_str.lower().split())
print([element for element in my_list if my_list[element] == 1])

score 0 · Accepted Answer · edited May 23 '17 at 11:51

String Module:

Use string module to get Punctuation list and use our normal for loop to replace.Demo:

>>> import string
>>> string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
>>>

more pythonic: how to replace punctuation in a string python?

Algo:

Remove Punctuation from the Input text by string module.
Convert to lower case.
Split Input text and update Dictionary.
Iterate items from the Dictionary and update hapax words.

code:

import string
import collections

def hapax(text):
    # Remove Punctuation from the Input text.
    text = text.translate(string.maketrans("",""), string.punctuation)
    print "Debug 1- After remove Punctuation:", text

    # ignore:- Lower/upper/mix cases
    text = text.lower()
    print "Debug 2- After converted to Lower case:", text

    #- Create Default dictionary. Key is word and value 
    word_count = collections.defaultdict(int)
    print "Debug 3- Collection Default Dictionary:", word_count

    #- Split text and update result dictionary.
    for word in text.split():
        if word:#- Ignore whitespace.
            word_count[word] += 1

    print "Debug 4- Word and its count:", word_count

    #- List which save word which value is 1.
    hapax_words = list()
    for word, value in word_count.items():
        if value==1:
            hapax_words.append(word)

    print "Debug 5- Final Hapax words:", hapax_words


hapax('yo i went jogging then yo i went jogging tuesday wednesday some punctuation ? I and & ')

Output:

$ python 2.py 
Debug 1- After remove Punctuation: yo i went jogging then yo i went jogging tuesday wednesday some punctuation  I and  
Debug 2- After converted to Lower case: yo i went jogging then yo i went jogging tuesday wednesday some punctuation  i and  
Debug 3- Collection Default Dictionary: defaultdict(<type 'int'>, {})
Debug 4- Word and its count: defaultdict(<type 'int'>, {'and': 1, 'then': 1, 'yo': 2, 'i': 3, 'tuesday': 1, 'punctuation': 1, 'some': 1, 'wednesday': 1, 'jogging': 2, 'went': 2})
Debug 5- Final Hapax words: ['and', 'then', 'tuesday', 'punctuation', 'some', 'wednesday']

rajkrish06 · Answer 4 · 2016-08-29T01:51:39.397

Python 3.X code:

import string

def edit_word(new_str):
    """Remove punctuation"""
    new_str = new_str.lower()
    st_table = new_str.maketrans(string.punctuation, '-'*32)
    new_str = new_str.translate(st_table)
    return new_str.replace('-', '')

st = "String to check for hapax!, try with any string"
w_dict = {}
for w in st.split():
    ew = edit_word(w)
    w_dict[ew] = w_dict.get(ew, 0) + 1

for w, c in w_dict.items():
    if c == 1: print(w)

Writing a program to print hapax's from a string

4 Answers4