Preceding Word Length

Question

I have to make a function that takes a single argument word and returns the average length(in characters) of the word that precedes word in the text. If word happens to be the first word occurring in the text, the length of the preceding word for that occurrence should be zero. For example

>>> average_length("the")
4.4
>>> average_length('whale')
False
average_length('ship.')
3.0

This is what I have written so far,

def average_length(word):
    text = "Call me Ishmael. Some years ago - never mind how long..........."
    words = text.split()
    wordCount = len(words)

    Sum = 0
    for word in words:
        ch = len(word)
        Sum = Sum + ch
    avg = Sum/wordCount
    return avg

I know this isn't right at all but I'm having trouble of how to approach this correctly. This question is asking me to find every instance of word in the text, and when you do, calculate the length of the word immediately before it in the text. Not every word from beginning to that word, just one.

I should have also mentioned that all the tests will only test my code using the first paragraph from 'Moby Dick':

"Call me Ishmael. Some years ago - never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen and regulating the circulation. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul; whenever I find myself involuntarily pausing before coffin warehouses, and bringing up the rear of every funeral I meet; and especially whenever my hypos get such an upper hand of me, that it requires a strong moral principle to prevent me from deliberately stepping into the street, and methodically knocking people's hats off - then, I account it high time to get to sea as soon as I can. This is my substitute for pistol and ball. With a philosophical flourish Cato throws himself upon his sword; I quietly take to the ship. There is nothing surprising in this. If they but knew it, almost all men in their degree, some time or other, cherish very nearly the same feelings towards the ocean with me."

This seems very similar to this: http://stackoverflow.com/questions/4664850/find-all-occurrences-of-a-substring-in-python — EdChum, Mar 21 '16 at 09:43

L3viathan · Answer 1 · 2016-04-01T13:58:19.867

It seems like you can save a lot of computation time by going over your data only once:

from collections import defaultdict
prec = defaultdict(list)
text = "Call me Ishmael. Some years ago..".split()

Create two iterators over your list. We call next on the second one, so that from now on, whenever we get an element from the iterator, we get a word and its successor.

first, second = iter(text), iter(text)
next(second)

Zipping over the two iterators ("abc","def" → "ad", "be", "cf"), we append the length of the first word to the list of predecessor lengths of the second. That works, because we're using a defaultdict(list), which returns an empty list for any not yet existing key.

for one, two in zip(first, second):  # pairwise
    prec[two].append(len(one))

Finally, we can create a new dictionary from words to the mean of their predecessor's lengths: Sum divided by length. Instead of this dictionary comprehension, you could also use a normal for loop.

# avg_prec_len = {key: sum(prec[key]) / len(prec[key]) for key in prec}
avg_prec_len = {}
for key in prec:
    # prec[key] is a list of lengths
    avg[key] = sum(prec[key]) / len(prec[key])

Then you can just look it up in that dictionary.

(If you're using Python 2, use izip instead of zip, and do from __future__ import division).

score 1 · Accepted Answer · edited May 23 '17 at 10:28

Based on your requirements for no imports and a simple approach, the following function does it without any, the comments and variable names should make the function logic pretty clear:

def match_previous(lst, word):
    # keep matches_count of how many times we find a match and total lengths
    matches_count = total_length_sum = 0.0
    # pull first element from list to use as preceding word
    previous_word = lst[0]
    # slice rest of words from the list 
    # so we always compare two consecutive words
    rest_of_words = lst[1:]
    # catch where first word is "word" and add 1 to matches_count
    if previous_word == word:
        matches_count += 1
    for current_word in rest_of_words:
        # if the current word matches our "word"
        # add length of previous word to total_length_sum
        # and increase matches_count.
        if word == current_word:
            total_length_sum += len(previous_word)
            matches_count += 1
        # always update to keep track of word just seen
        previous_word = current_word
    # if  matches_count is 0 we found no word in the text that matched "word"
    return total_length_sum / matches_count if matches_count else False

It takes two args, the split list of words and the word to search for:

In [41]: text = "Call me Ishmael. Some years ago - never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen and regulating the circulation. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul; whenever I find myself involuntarily pausing before coffin warehouses, and bringing up the rear of every funeral I meet; and especially whenever my hypos get such an upper hand of me, that it requires a strong moral principle to previous_wordent me from deliberately stepping into the street, and methodically knocking people's hats off - then, I acmatches_count it high time to get to sea as soon as I can. This is my substitute for pistol and ball. With a philosophical flourish Cato throws himself upon his sword; I quietly take to the ship. There is nothing surprising in this. If they but knew it, almost all men in their degree, some time or other, cherish very nearly the same feelings towards the ocean with me."

In [42]: match_previous(text.split(),"the")
Out[42]: 4.4

In [43]: match_previous(text.split(),"ship.")
Out[43]: 3.0

In [44]: match_previous(text.split(),"whale")
Out[44]: False

In [45]: match_previous(text.split(),"Call")
Out[45]: 0.0

You can obviously do the same as your own function, take a single arg do the split text in the function. The only way False will be returned will be if we found no match for the word, you can see call returns 0.0 as it is the first word in the text.

If we add some prints to the code and use enumerate:

def match_previous(lst, word):
    matches_count = total_length_sum = 0.0
    previous_word = lst[0]
    rest_of_words = lst[1:]
    if previous_word == word:
        print("First word matches.")
        matches_count += 1
    for ind, current_word in enumerate(rest_of_words, 1):
        print("On iteration {}.\nprevious_word = {} and current_word = {}.".format(ind, previous_word, current_word))
        if word == current_word:
            total_length_sum += len(previous_word)
            matches_count += 1
            print("We found a match at index {} in our list of words.".format(ind-1))
        print("Updating previous_word from {} to {}.".format(previous_word, current_word))
        previous_word = current_word
    return total_length_sum / matches_count if matches_count else False

And run it with a small sample list we can see what happens:

In [59]: match_previous(["bar","foo","foobar","hello", "world","bar"],"bar")
First word matches.
On iteration 1.
previous_word = bar and current_word = foo.
Updating previous_word from bar to foo.
On iteration 2.
previous_word = foo and current_word = foobar.
Updating previous_word from foo to foobar.
On iteration 3.
previous_word = foobar and current_word = hello.
Updating previous_word from foobar to hello.
On iteration 4.
previous_word = hello and current_word = world.
Updating previous_word from hello to world.
On iteration 5.
previous_word = world and current_word = bar.
We found a match at index 4 in our list of words.
Updating previous_word from world to bar.
Out[59]: 2.5

The advantage to use iter is we don't need to create a new list by slicing the remainder, to use it in the code you would just need to change the start of the function to:

def match_previous(lst, word):
    matches_count = total_length_sum = 0.0
    # create an iterator
    _iterator = iter(lst)
    # pull first word from iterator
    previous_word = next(_iterator)
    if previous_word == word:
        matches_count += 1
    # _iterator will give us all bar the first word we consumed with  next(_iterator)
    for current_word in _iterator:

Each time you consume an element from an iterator, we move to the next element:

In [61]: l = [1,2,3,4]

In [62]: it = iter(l)

In [63]: next(it)
Out[63]: 1

In [64]: next(it)
Out[64]: 2
# consumed two of four so we are left with two
In [65]: list(it)
Out[65]: [3, 4]

The only way a dict really makes sense is if you take multiple words to your function which you can do with *args:

def sum_previous(text):
    _iterator = iter(text.split())
    previous_word = next(_iterator)
    # set first k/v pairing with the first word
    # if  "total_lengths" is 0 at the end we know there
    # was only one match at the very start
    avg_dict = {previous_word: {"count": 1.0, "total_lengths": 0.0}}
    for current_word in _iterator:
        # if key does not exist, it creates a new key/value pairing
        avg_dict.setdefault(current_word, {"count": 0.0, "total_lengths": 0.0})
        # update value adding word length and increasing the count
        avg_dict[current_word]["total_lengths"] += len(previous_word)
        avg_dict[current_word]["count"] += 1
        previous_word = current_word
    # return the dict so we can use it outside the function.
    return avg_dict


def match_previous_generator(*args):
    # create our dict mapping words to sum of all lengths of their preceding words.
    d = sum_previous(text)
    # for every word we pass to the function.
    for word in args:
        # use dict.get with a default of an empty dict.
        #  to catch when a word is not in out text.
        count = d.get(word, {}).get("count")
        # yield each word and it's avg or False for non existing words.
        yield (word, d[word]["total_lengths"] / count if count else False)

Then just pass in the text and all the words you want to search for, you can call list on the generator function:

In [69]: list(match_previous_generator("the","Call", "whale", "ship."))
Out[69]: [('the', 4.4), ('Call', 0.0), ('whale', False), ('ship.', 3.0)]

Or iterate over it:

In [70]: for tup in match_previous_generator("the","Call", "whale", "ship."):
   ....:     print(tup)
   ....:     
('the', 4.4)
('Call', 0.0)
('whale', False)
('ship.', 3.0)

Łukasz Rogalski · Answer 3 · 2016-04-01T13:52:02.953

I'd suggest splitting this task to some atomic parts:

from __future__ import division  # int / int should result in float

# Input data:
text = "Lorem ipsum dolor sit amet dolor ..."
word = "dolor"

# First of all, let's extract words from string
words = text.split()

# Find indices of picked word in words
indices = [i for i, some_word in enumerate(words) if some_word == word]

# Find indices of preceding words
preceding_indices = [i-1 for i in indices]

# Find preceding words, handle first word case
preceding_words = [words[i] if i != -1 else "" for i in preceding_indices]

# Calculate mean of words length
mean = sum(len(w) for w in preceding_words) / len(preceding_words)

# Check if result is correct
# (len('ipsum') + len('amet')) / 2 = 9 / 2 = 4.5
assert mean == 4.5

Obviously we may wrap it to function. I dropped a comments here:

def mean_length_of_preceding_words(word, text):
    words = text.split()
    indices = [i for i, some_word in enumerate(words) if some_word == word]
    preceding_indices = [i-1 for i in indices]
    preceding_words = [words[i] if i != -1 else "" for i in preceding_indices]
    mean = sum(len(w) for w in preceding_words) / len(preceding_words)
    return mean

Obviously performance is not a key here - I attempted to use only built-ins (from __future__... is a built-in in my opinion), and keep intermediate steps clean and self-explanatory.

Some test cases:

assert mean_length_of_preceding_words("Lorem", "Lorem ipsum dolor sit amet dolor ...") == 0.0
assert mean_length_of_preceding_words("dolor", "Lorem ipsum dolor sit amet dolor ...") == 4.5
mean_length_of_preceding_words("E", "A B C D")  # ZeroDivisionError - average length of zero words does not exist

Splitting process (words = ...) should be tweaked, if you want to somehow handle punctuation. Specification does not mention it, so I kept that plain and simple.

I don't like changing return type for special case, but if you insist, you may do early exit.

def mean_length_of_preceding_words(word, text):
    words = text.split()
    if word not in words:
        return False
    indices = [i for i, some_word in enumerate(words) if some_word == word]
    preceding_indices = [i-1 for i in indices]
    preceding_words = [words[i] if i != -1 else "" for i in preceding_indices]
    mean = sum(len(w) for w in preceding_words) / len(preceding_words)
    return mean

Last test case changes to:

assert mean_length_of_preceding_words("E", "A B C D") is False

gboffi · Answer 4 · 2016-04-01T13:25:53.657

This answer is based on the assumption that you want to strip away all the punctuation to have just words...

I play dirty prepending a null string to the list of words, so that your requirement about the predecessor of the first word of text is satisfied.

The result is computed using some smart indexing that numpy makes possible.

class Preceding_Word_Length():
    def __init__(self, text):
        import numpy as np
        self.words = np.array(
            ['']+[w.strip(''',.?!'":''') for w in text.split() if w != '-'])
        self.indices = np.arange(len(self.words))
        self.lengths = np.fromiter((len(w) for w in self.words), float)
    def mean(self, word):
        import numpy as np
        if word not in self.words:
            return 0.0
        return np.average(self.lengths[self.indices[word==self.words]-1])

text = '''Call me Ishmael. Some years ago - never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen and regulating the circulation. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul; whenever I find myself involuntarily pausing before coffin warehouses, and bringing up the rear of every funeral I meet; and especially whenever my hypos get such an upper hand of me, that it requires a strong moral principle to prevent me from deliberately stepping into the street, and methodically knocking people's hats off - then, I account it high time to get to sea as soon as I can. This is my substitute for pistol and ball. With a philosophical flourish Cato throws himself upon his sword; I quietly take to the ship. There is nothing surprising in this. If they but knew it, almost all men in their degree, some time or other, cherish very nearly the same feelings towards the ocean with me.'''

ishmael = Preceding_Word_Length(text)

print(ishmael.mean('and'))   # -> 6.28571428571
print(ishmael.mean('Call'))  # -> 0.0
print(ishmael.mean('xyz'))   # -> 0.0

I would like to stress that implementing this behavior within a class leads to an easy way to cache some computations that are repeated for successive analysis of the same text.

score 1 · Answer 5 · answered Apr 01 '16 at 13:42

Very similar to my previous answer, w/o importing numpy

def average_length(text, word):
    words = ['']+[w.strip(''',.?!'":''') for w in text.split() if w != '-']
    if word not in words: return False
    match = [len(prev) for prev, curr in zip(words[:-1],words[1:]) if curr==word]
    return 1.0*sum(match)/len(match)

Preceding Word Length

5 Answers5

Linked