Parsing a huge dictionary file with python. Simple task I cant get my head around

Question

I just got a giant 1.4m line dictionary for other programming uses, and i'm sad to see notepad++ is not powerful enough to do the parsing job to the problem. The dictionary contains three types of lines:

<ar><k>-aaltoiseen</k>
yks.ill..ks. <kref>-aaltoinen</kref></ar>
yks.nom. -aaltoinen; yks.gen. -aaltoisen; yks.part. -aaltoista; yks.ill. -aaltoiseen; mon.gen. -aaltoisten -aaltoisien; mon.part. -aaltoisia; mon.ill. -aaltoisiinesim. Lyhyt-, pitkäaaltoinen.</ar>

and I want to extract every word of it to a list of words without duplicates. Lets start by my code.

f = open('dic.txt')
p = open('parsed_dic.txt', 'r+')
lines = f.readlines()
for line in lines:
    #<ar><k> lines
    #<kref> lines
    #ending to ";" - lines
    for word in listofwordsfromaline:
        p.write(word,"\n")
f.close()
p.close()

Im not particulary asking you how to do this whole thing, but anything would be helpful. A link to a tutorial or one type of line parsing method would be highly appreciated.

In the example you posted, what are the "words"? Given your example input, what would you like the output of _that example_ to look like (even if you have to do it by hand). Give us that, and we can tailor a response to your problem. — Hooked, Dec 14 '14 at 18:31

score 0 · Answer 1 · edited May 23 '17 at 11:57

0

First find what defines a word for you. Make a regular expression to capture those matches. For example - word break '\b' will match word boundaries (non word characters). https://docs.python.org/2/howto/regex.html

If the word definition in each type of line is different - then if statements to match the line first, then corresponding regular expression match for the word, and so on.

Match groups in Python

edited May 23 '17 at 11:57

Community

1
1

answered Dec 14 '14 at 18:33

tzharg

313
2
11

Thanks for the link! :D This was just what i was looking for. Sorry for being a newb. – punkkapoika Dec 14 '14 at 18:35

ZdaR · Accepted Answer · 2014-12-14T18:42:34.743

0

For the first two cases you can see that any word starts and ends with a specific tag , if we see it closely , then we can say that every word must have a ">-" string preceding it and a "

# First and second cases
start = line.find(">-")+2
end = line.find("</")+1
required_word = line[start:end]

In the last case you can use the split method:

    word_lst = line.split(";")
    ans = []
    for word in word_list:
      start = word.find("-")
      ans.append(word[start:])
    ans = set(ans)

edited Dec 14 '14 at 18:42

answered Dec 14 '14 at 18:35

ZdaR

22,343
7
66
87

How nice! I ended up doing just how you showed me and got the list parsed. Thanks m8! – punkkapoika Dec 14 '14 at 19:26

Parsing a huge dictionary file with python. Simple task I cant get my head around

2 Answers2