3

I want to read a file and create a dictionary with each word as a key and the word following it as the value.

For example if I have a file that contains:

'Cake is cake okay.'

The dictionary created should contain:

{'cake': ['is', 'okay'], 'is': ['cake'], 'okay': []}

So far I've managed to do the opposite with my code. I've updated the dictionary value with the previous word in the file. I'm not quite sure how to change it in order to have it work as intended.

def create_dict(file):

    word_dict = {}
    prev_word = ''

    for line in file:

        for word in line.lower().split():
            clean_word = word.strip(string.punctuation)

            if clean_word not in word_dict:
                word_dict[clean_word] = []

            word_dict[clean_word].append(prev_word)
            prev_word = clean_word

Thank you guys for your help in advance!

Edit

Updated with progress:

def create_dict(file):
    word_dict = {}
    next_word = ''

    for line in file:
        formatted_line = line.lower().split()

        for word in formatted_line:
            clean_word = word.strip(string.punctuation)

            if next_word != '':
                if next_word not in word_dict:
                    word_dict[next_word] = []

            if clean_word == '':
                clean_word.

            next_word = clean_word
    return word_dict
Mike Müller
  • 82,630
  • 20
  • 166
  • 161

2 Answers2

1

You can use itertools.zip_longest() and dict.setdefault() for a shorter solution:

import io
from itertools import zip_longest  # izip_longest in Python 2
import string

def create_dict(fobj):
    word_dict = {}
    punc = string.punctuation
    for line in fobj:
        clean_words = [word.strip(punc) for word in line.lower().split()]
        for word, next_word in zip_longest(clean_words, clean_words[1:]):
            words = word_dict.setdefault(word, [])
            if next_word is not None:
                words.append(next_word)
    return word_dict

Test it:

>>> fobj = io.StringIO("""Cake is cake okay.""")
>>> create_dict(fobj)
{'cake': ['is', 'okay'], 'is': ['cake'], 'okay': []}
Mike Müller
  • 82,630
  • 20
  • 166
  • 161
0

Separate the code that generates words from a given file (splitting on space, case folding, stripping punctuation, etc) from the code that creates the bigram dictionary (the topic of this question):

#!/usr/bin/env python3
from collections import defaultdict
from itertools import tee

def create_bigram_dict(words):
    a, b = tee(words) # itertools' pairwise recipe
    next(b)
    bigrams = defaultdict(list)
    for word, next_word in zip(a, b):  
        bigrams[word].append(next_word)
    bigrams[next_word] # last word may have no following words
    return bigrams

See itertools' pairwise() recipe. To support less than two words in a file, the code requires minor tweaks. You could call return dict(bigrams) here if you need the exact type. Example:

>>> create_bigram_dict('cake is cake okay'.split())
defaultdict(list, {'cake': ['is', 'okay'], 'is': ['cake']}

To create the dict from a file, you could define get_words(file):

#!/usr/bin/env python3
import regex as re  # $ pip install regex

def get_words(file):
    with file:
        for line in file:
            words = line.casefold().split()
            for w in words:
                yield re.fullmatch(r'\p{P}*(.*?)\p{P}*', w).group(1)

Usage: create_bigram_dict(get_words(open('filename'))).


To strip Unicode punctuation, \p{P} regex is used. The code may preserve punctuation inside words e.g.:

>>> import regex as re
>>> re.fullmatch(r'\p{P}*(.*?)\p{P}*', "doesn't.").group(1)
"doesn't"

Note: the dot at the end is gone but ' inside is preserved. To remove all punctuation, s = re.sub(r'\p{P}+', '', s) could be used:

>>> re.sub(r'\p{P}+', '', "doesn't.")
'doesnt'

Note: the single quote is gone too.

Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670