1
my_file = "The Itsy Bitsy Spider went up the water spout.
Down came the rain & washed the spider out.
Out came the sun & dried up all the rain,
And the Itsy Bitsy Spider went up the spout again. "

Expected output:

{'the': ['itsy', 'water', 'rain', 'spider', 'sun', 'rain', 'itsy', 'spout'], 'itsy': ['bitsy', 'bitsy'], 'bitsy': ['spider', 'spider'], 'spider': ['went', 'out', 'went'], 'went': ['up', 'up'], 'up': ['the', 'all', 'the'], 'water': ['spout'], 'spout': ['down', 'again'], 'down': ['came'], 'came': ['the', 'the'], 'rain': ['washed', 'and'], 'washed': ['the'], 'out': ['out', 'came'], 'sun': ['dried'], 'dried': ['up'], 'all': ['the'], 'and': ['the'], 'again': []}

My code:

import string

words_set = {}
    for line in my_file:
        lower_text = line.lower()
        for word in lower_text.split():
            word = word.strip(string.punctuation + string.digits)
            if word:
                if word in words_set:
                    words_set[word] = words_set[word] + 1
                else:
                    words_set[word] = 1
Robert
  • 7,394
  • 40
  • 45
  • 64

1 Answers1

1

You can reproduce your expected results with a few concepts:

Given

import string
import itertools as it
import collections as ct


data = """\
The Itsy Bitsy Spider went up the water spout.
Down came the rain & washed the spider out.
Out came the sun & dried up all the rain,
And the Itsy Bitsy Spider went up the spout again.
"""

Code

def clean_string(s:str) -> str:
    """Return a list of lowered strings without punctuation."""
    table = str.maketrans("","", string.punctuation)
    return s.lower().translate(table).replace("  ", " ").replace("\n", " ")


def get_neighbors(words:list) -> dict:
    """Return a dict of right-hand, neighboring words."""
    dd = ct.defaultdict(list)
    for word, nxt in it.zip_longest(words, words[1:], fillvalue=""):
        dd[word].append(nxt)
    return dict(dd)

Demo

words = clean_string(data).split()
get_neighbors(words)

Results

{'the': ['itsy', 'water', 'rain', 'spider', 'sun', 'rain', 'itsy', 'spout'],
 'itsy': ['bitsy', 'bitsy'],
 'bitsy': ['spider', 'spider'],
 'spider': ['went', 'out', 'went'],
 'went': ['up', 'up'],
 'up': ['the', 'all', 'the'],
 'water': ['spout'],
 'spout': ['down', 'again'],
 'down': ['came'],
 'came': ['the', 'the'],
 'rain': ['washed', 'and'],
 'washed': ['the'],
 'out': ['out', 'came'],
 'sun': ['dried'],
 'dried': ['up'],
 'all': ['the'],
 'and': ['the'],
 'again': ['']}

Details

clean_string

  • You can use any number of ways to remove punctuation. Here we use a translation table to replace most of the punctuation. Others are directly removed via str.replace().

get_neighbors

  • A defaultdict makes a dict of lists. A new list value is made if a key is missing.
  • We make the dict by iterating two juxtaposed word lists, one ahead of the other.
  • These lists are zipped by the longest list, filling the shorter list with an empty string.
  • dict(dd) ensures a simply dict is returned.

If you solely wish to count words:

Demo

ct.Counter(words)

Results

Counter({'the': 8,
         'itsy': 2,
         'bitsy': 2,
         'spider': 3,
         'went': 2,
         'up': 3,
         'water': 1,
         'spout': 2,
         'down': 1,
         'came': 2,
         'rain': 2,
         'washed': 1,
         'out': 2,
         'sun': 1,
         'dried': 1,
         'all': 1,
         'and': 1,
         'again': 1})
pylang
  • 40,867
  • 14
  • 129
  • 121