1

I'm trying to write a function that returns a dictionary whose keys are pairs of words that appear consecutively in the input file and whose values are lists containing any word that has followed that pair in the file. Ex. Suppose the input file contained only the sentence "This clause is first, and this clause came second.". The resulting dictionary should be: {(this, clause):[is, came], (clause, is):[first], (is, first):[and], (first, and):[this], (and, this):[clause], (clause, came):[second]}.

import string

def predictive(text_file):
    file = open(text_file, encoding='utf8')
    text = file.read()
    file.close()

    punc = string.punctuation + '’”—⎬⎪“⎫1234567890'
    new_text = text
    for char in punc:
        new_text = new_text.replace(char, '')
        new_text = new_text.lower()
    text_split = new_text.split()
    print(text_split)

predictive('gatsby.txt')

I used The Great Gatsby as the text file and stripped away unnecessary punctuation and lowercased the words. I'm not sure what to do next to return what I am looking for but would really appreciate any suggestions or help guiding me in the right way. Thanks!

Boris Verkhovskiy
  • 14,854
  • 11
  • 100
  • 103
Stiff
  • 47
  • 7

4 Answers4

0

One way to do this, after what you mentioned (sanitising the string) would be:

  1. Enumerate your splitted string list
  2. Create a dictionary to hold your results (easier to deal with duplicate indexes)
  3. Iterate over your enumerated list creating dictionary keys made of concatenated list[idx] + list[idx+1] by space and values of list[idx+3].

Don't forget to check for index out of bounds while in the for loop, otherwise you'll get errors.

Done! You have what you need.

Yuri Eastwood
  • 378
  • 2
  • 9
0

Since you've already split the text into words, all you have left to do is iterate over the words looking at the next two words, initialize the value in the dictionary to an empty list using dict.setdefault then append the third word to it:

text = "This clause is first, and this clause came second."
words = text.lower().replace(',', '').replace('.', '').split()

result = {}
for first, second, third in zip(words, words[1:], words[2:]):
    result.setdefault((first, second), []).append(third)
print(result)
# {('this', 'clause'): ['is', 'came'], ('clause', 'is'): ['first'], ('is', 'first'): ['and'], ('first', 'and'): ['this'], ('and', 'this'): ['clause'], ('clause', 'came'): ['second']}

I think you'll find that the hardest part is going to be splitting real-world text into words. Just removing the punctuation like you've done will get you most of the way there, but it's hard to get it completely right. You could try using the nltk library for this but it splits "didn't" as ["did", "n't"] so you'd have to re-join those back into a single word.

Boris Verkhovskiy
  • 14,854
  • 11
  • 100
  • 103
0

With zip and defaultdict:

from collections import defaultdict
text = "This clause is first, and this clause came second".lower()
words = text.split()
two_words = zip(words[:-2],words[1:-1])
two_words_values = zip(two_words,words[2:])

twoD = defaultdict(list)

for i,j in two_words_values:
    twoD[i].append(j)

print(twoD)

Output:

defaultdict(<class 'list'>, {('this', 'clause'): ['is', 'came'], ('clause', 'is'): ['first,'], ('is', 'first,'): ['and'], ('first,', 'and'): ['this'], ('and', 'this'): ['clause'], ('clause', 'came'): ['second']})

Default Dictonaries are similar to dictionaries in every sense except they make life a little bit easier by eliminating the need to check if a key is present or not.


Without defaultdict in native python:

text = "This clause is first, and this clause came second".lower()
words = text.split()
two_words = zip(words[:-2],words[1:-1])
two_words_values = zip(two_words,words[2:])

twoD = {}

for i,j in two_words_values:
    if i not in twoD: twoD[i]=[]
    twoD[i].append(j)

print(twoD)

Output:

{('this', 'clause'): ['is', 'came'], ('clause', 'is'): ['first,'], ('is', 'first,'): ['and'], ('first,', 'and'): ['this'], ('and', 'this'): ['clause'], ('clause', 'came'): ['second']}
Rishabh Kumar
  • 2,342
  • 3
  • 13
  • 23
0
import re
from itertools import islice
from collections import defaultdict

sentence = '''This clause is first, and this clause came second'''
words = re.findall(r'\w+', sentence.lower())

n_adjacent_words = 2
words_dict = defaultdict(list)

for *first_words, next_word in zip(
    *(islice(words, i, None) 
      for i in range(n_adjacent_words+1)
     )
):
    words_dict[tuple(first_words)].append(next_word)

>>> words_dict
defaultdict(list,
            {('this', 'clause'): ['is', 'came'],
             ('clause', 'is'): ['first'],
             ('is', 'first'): ['and'],
             ('first', 'and'): ['this'],
             ('and', 'this'): ['clause'],
             ('clause', 'came'): ['second']})
BallpointBen
  • 9,406
  • 1
  • 32
  • 62