0

I just got a giant 1.4m line dictionary for other programming uses, and i'm sad to see notepad++ is not powerful enough to do the parsing job to the problem. The dictionary contains three types of lines:

<ar><k>-aaltoiseen</k>
yks.ill..ks. <kref>-aaltoinen</kref></ar>
yks.nom. -aaltoinen; yks.gen. -aaltoisen; yks.part. -aaltoista; yks.ill. -aaltoiseen; mon.gen. -aaltoisten -aaltoisien; mon.part. -aaltoisia; mon.ill. -aaltoisiinesim. Lyhyt-, pitkäaaltoinen.</ar>

and I want to extract every word of it to a list of words without duplicates. Lets start by my code.

f = open('dic.txt')
p = open('parsed_dic.txt', 'r+')
lines = f.readlines()
for line in lines:
    #<ar><k> lines
    #<kref> lines
    #ending to ";" - lines
    for word in listofwordsfromaline:
        p.write(word,"\n")
f.close()
p.close()

Im not particulary asking you how to do this whole thing, but anything would be helpful. A link to a tutorial or one type of line parsing method would be highly appreciated.

  • In the example you posted, what are the "words"? Given your example input, what would you like the output of _that example_ to look like (even if you have to do it by hand). Give us that, and we can tailor a response to your problem. – Hooked Dec 14 '14 at 18:31
  • This looks like XML, is it? – Doncho Gunchev Dec 14 '14 at 18:38

2 Answers2

0

First find what defines a word for you. Make a regular expression to capture those matches. For example - word break '\b' will match word boundaries (non word characters). https://docs.python.org/2/howto/regex.html

If the word definition in each type of line is different - then if statements to match the line first, then corresponding regular expression match for the word, and so on.

Match groups in Python

Community
  • 1
  • 1
tzharg
  • 313
  • 2
  • 11
0

For the first two cases you can see that any word starts and ends with a specific tag , if we see it closely , then we can say that every word must have a ">-" string preceding it and a "

# First and second cases
start = line.find(">-")+2
end = line.find("</")+1
required_word = line[start:end]

In the last case you can use the split method:

    word_lst = line.split(";")
    ans = []
    for word in word_list:
      start = word.find("-")
      ans.append(word[start:])
    ans = set(ans)
ZdaR
  • 22,343
  • 7
  • 66
  • 87