Python create book index from text file

Question

I have a text file that may look like this...

3:degree
54:connected
93:adjacent
54:vertex
19:edge
64:neighbor
72:path
55:shortest path
127:tree
3:degree
55:graph
64:adjacent   and so on....

I want to have my function read each line of text, and split it at the colon to make it into a dictionary where the word is in 'key' position and the page numbers are in the 'value' position of my dictionary- I'll then have to create a new dictionary and scan through each word and if it's already in the dictionary just add the page number behind it and if it's not in the dictionary, I'll add it to the dictionary.

This is my idea so far...

def index(fileName):

    inFile=open(fileName,'r')
    index={}
    for line in inFile:
        line=line.strip()      #This will get rid of my new line character
        word=line[1]
        if word not in index:
            index[word]=[]
            index[word].append(line)
    return index

fileName='terms.txt'

print(index(fileName))

I'm on the right page but just need a little help to get going.

Have you looked at this,http://stackoverflow.com/questions/3199171/append-multiple-values-for-one-key-in-python-dictionary? It is similar in method. — AppliedNumbers, Jul 24 '13 at 14:58

rnbguy · Answer 1 · 2013-07-24T13:39:51.657

0

Edit the lines I commented with # edit

def index(fileName):
    inFile=open(fileName,'r')
    index={}
    for line in inFile:
        line=line.strip().split(':',1) # edit
        word,index=line # edit
        if word not in index:
            index[word]=[]
        index[word].append(index) # edit
    return index

edited Jul 24 '13 at 13:39

answered Jul 24 '13 at 13:32

rnbguy

1,369
1
10
28

Use `index.setdefault` to shorten that if-else condition to just one line. – Ashwini Chaudhary Jul 24 '13 at 13:48

score 0 · Answer 2 · answered Jul 24 '13 at 13:32

You are not splitting the line, you are only taking the character at position 1.

Use .split(':', 1) to split the line once on ::

def index(filename):
    with open(filename) as infile:
        index = {}
        for line in infile:
            page, word = map(str.strip, line.split(':', 1))
            index.setdefault(word, []).append(int(page))
        return index

You may want to use a set instead to avoid the same page number being added twice. You can also use collections.defaultdict to simplify this a little further still:

from collections import defaultdict

def index(filename):
    with open(filename) as infile:
        index = defaultdict(set)
        for line in infile:
            page, word = map(str.strip, line.split(':', 1))
            index[word].add(int(page))
        return index

This gives:

defaultdict(<type 'set'>, {'neighbor': set([64]), 'degree': set([3]), 'tree': set([127]), 'vertex': set([54]), 'shortest path': set([55]), 'edge': set([19]), 'connected': set([54]), 'adjacent': set([64, 93]), 'graph': set([55]), 'path': set([72])})

for your input text; a defaultdict is a subclass of dict and behaves just like a normal dictionary, except that it'll create a new set for each key you try to access but is not yet present.

Thanks so much- I want to add in a line that converts all uppercase letters to lowercase letters- Do I have to turn it into a string to convert it to lowercase and then into a list to sort it alphabetically? — user2553807, Jul 24 '13 at 20:57
d=str(index) for element in d: element.lower() # would something like this work? — user2553807, Jul 24 '13 at 20:59
Don't turn `index` into a string, it's a dictionary. I'm not sure what you are trying to achieve here; `index[word.lower()].add(int(page))` would store the words lowercased to start with. — Martijn Pieters, Jul 24 '13 at 21:06
To loop over `index` in sorted order (by key), use `for word in sorted(index):`. — Martijn Pieters, Jul 24 '13 at 21:06

score 0 · Answer 3 · answered Jul 24 '13 at 13:32

You can use str.split to separate a string into tokens. In your case, the delimiter is :.

records = """3:degree
     54:connected
     93:adjacent
     54:vertex"""
index = {}
for line in records.split('\n'):
     page, word = line.split(':')
     index[word] = int(page.strip())

index
# {'vertex': 54, 'connected': 54, 'adjacent': 93, 'degree': 3}

At some point you will need to handle words with multiple page references. For this, I recommend creating a collections.defaultdict with list as the default:

from collections import defaultdict
index = defaultdict(list)
index[word].append(page)  # add reference to this page

Python create book index from text file

3 Answers3