-1

This code is meant to read a text file and add every word to a dictionary where the key is the first letter and the values are all the words in the file that start with that letter. It kinda works but for two problems I run into:

  1. the dictionary keys contain apostrophes and periods (how to exclude?)
  2. the values aren't sorted alphabetically and are all jumbled up. the code ends up outputting something like this:
' - {"don't", "i'm", "let's"}
. - {'below.', 'farm.', 'them.'}
a - {'take', 'masters', 'can', 'fallow'}
b - {'barnacle', 'labyrinth', 'pebble'}
...
...
y - {'they', 'very', 'yellow', 'pastry'}

when it should be more like:

a - {'ape', 'army','arrow', 'arson',}
b - {'bank', 'blast', 'blaze', 'breathe'}
etc
# make empty dictionary
dic = {}

# read file
infile = open('file.txt', "r")

# read first line
lines = infile.readline()
while lines != "":
    # split the words up and remove "\n" from the end of the line
    lines = lines.rstrip()
    lines = lines.split()

    for word in lines:
        for char in word: 
            # add if not in dictionary
             if char not in dic: 
                dic[char.lower()] = set([word.lower()])
            # Else, add word to set
             else:
                dic[char.lower()].add(word.lower())
    # Continue reading
    lines = infile.readline()

# Close file
infile.close()

# Print
for letter in sorted(dic): 
    print(letter + " - " + str(dic[letter]))

I'm guessing I need to remove the punctuation and apostrophes from the whole file when I'm first iterating through it but before adding anything to the dictionary? Totally lost on getting the values in the right order though.

Perma
  • 1
  • 1
  • The problem is that you are looping over each of the characters in the word, then adding the word to that key. Just take the first character, i.e. `word[0]` and maybe check to see if it is `.isalpha()` – juanpa.arrivillaga Feb 12 '20 at 10:36
  • Note, never loop over a file like that, file objects are iterators over the lines in the file, so you ca just do `for line in infile: ...` – juanpa.arrivillaga Feb 12 '20 at 10:37

3 Answers3

1

Use defaultdict(set) and dic[word[0]].add(word), after removing any starting punctuation. No need for the inner loop.

Elazar
  • 20,415
  • 4
  • 46
  • 67
1
from collections import defaultdict


def process_file(fn):
    my_dict = defaultdict(set)
    for word in open(fn, 'r').read().split():
        if word[0].isalpha():
            my_dict[word[0].lower()].add(word)
    return(my_dict)


word_dict = process_file('file.txt') 
for letter in sorted(word_dict): 
    print(letter + " - " + ', '.join(sorted(word_dict[letter])))
Jordan Dimov
  • 1,268
  • 12
  • 26
0

You have a number of problems

  1. splitting words on spaces AND punctuation
  2. adding words to a set that could not exist at the time of the first addition
  3. sorting the output

Here a short program that tries to solve the above issues

import re, string

# instead of using "text = open(filename).read()" we exploit a piece
# of text contained in one of the imported modules
text = re.__doc__

# 1. how to split at once the text contained in the file
#
# credit to https://stackoverflow.com/a/13184791/2749397
p_ws = string.punctuation + string.whitespace
words = re.split('|'.join(re.escape(c) for c in p_ws), text)

# 2. how to instantiate a set when we do the first addition to a key,
#    that is, using the .setdefault method of every dictionary
d = {}
# Note: words regularized by lowercasing, we skip the empty tokens    
for word in (w.lower() for w in words if w):
    d.setdefault(word[0], set()).add(word)

# 3. how to print the sorted entries corresponding to each letter
for letter in sorted(d.keys()):
    print(letter, *sorted(d[letter]))

My text contains numbers, so numbers are found in the output (see below) of the above program; if you don't want numbers filter them, if letter not in '0123456789': print(...).

And here it is the output...

0 0
1 1
8 8
9 9
a a above accessible after ailmsux all alphanumeric alphanumerics also an and any are as ascii at available
b b backslash be before beginning behaviour being below bit both but by bytes
c cache can case categories character characters clear comment comments compatibility compile complement complementing concatenate consist consume contain contents corresponding creates current
d d decimal default defined defines dependent digit digits doesn dotall
e each earlier either empty end equivalent error escape escapes except exception exports expression expressions
f f find findall finditer first fixed flag flags following for forbidden found from fullmatch functions
g greedy group grouping
i i id if ignore ignorecase ignored in including indicates insensitive inside into is it iterator
j just
l l last later length letters like lines list literal locale looking
m m made make many match matched matches matching means module more most multiline must
n n name named needn newline next nicer no non not null number
o object occurrences of on only operations optional or ordinary otherwise outside
p p parameters parentheses pattern patterns perform perl plus possible preceded preceding presence previous processed provides purge
r r range rather re regular repetitions resulting retrieved return
s s same search second see sequence sequences set signals similar simplest simply so some special specified split start string strings sub subn substitute substitutions substring support supports
t t takes text than that the themselves then they this those three to
u u underscore unicode us
v v verbose version versions
w w well which whitespace whole will with without word
x x
y yes yielding you
z z z0 za

Without comments and a little obfuscation it's just 3 lines of code...

import re, string
text = re.__doc__
p_ws = string.punctuation + string.whitespace
words = re.split('|'.join(re.escape(c) for c in p_ws), text)

d, add2d = {}, lambda w: d.setdefault(w[0],set()).add(w) #1
for word in (w.lower() for w in words if w): add2d(word) #2
for abc in sorted(d.keys()): print(abc, *sorted(d[abc])) #3
gboffi
  • 22,939
  • 8
  • 54
  • 85