Parsing a Moses config file

Question

Given a config file as such from the Moses Machine Translation Toolkit:

#########################
### MOSES CONFIG FILE ###
#########################

# input factors
[input-factors]
0

# mapping steps
[mapping]
0 T 0

[distortion-limit]
6

# feature functions
[feature]
UnknownWordPenalty
WordPenalty
PhrasePenalty
PhraseDictionaryMemory name=TranslationModel0 num-features=4 path=/home/gillin/jojomert/phrase-jojo/work.src-ref/training/model/phrase-table.gz input-factor=0 output-factor=0
LexicalReordering name=LexicalReordering0 num-features=6 type=wbe-msd-bidirectional-fe-allff input-factor=0 output-factor=0 path=/home/gillin/jojomert/phrase-jojo/work.src-ref/training/model/reordering-table.wbe-msd-bidirectional-fe.gz
Distortion
KENLM lazyken=0 name=LM0 factor=0 path=/home/gillin/jojomert/ru.kenlm order=5

# dense weights for feature functions
[weight]
UnknownWordPenalty0= 1
WordPenalty0= -1
PhrasePenalty0= 0.2
TranslationModel0= 0.2 0.2 0.2 0.2
LexicalReordering0= 0.3 0.3 0.3 0.3 0.3 0.3
Distortion0= 0.3
LM0= 0.5

I need to read the parameters from the [weights] section:

UnknownWordPenalty0= 1
WordPenalty0= -1
PhrasePenalty0= 0.2
TranslationModel0= 0.2 0.2 0.2 0.2
LexicalReordering0= 0.3 0.3 0.3 0.3 0.3 0.3
Distortion0= 0.3
LM0= 0.5

I have been doing it as such:

def read_params_from_moses_ini(mosesinifile):
    parameters_string = ""
    for line in reversed(open(mosesinifile, 'r').readlines()):
        if line.startswith('[weight]'):
            return parameters_string
        else:
            parameters_string+=line.strip() + ' '

to get this output:

LM0= 0.5 Distortion0= 0.3 LexicalReordering0= 0.3 0.3 0.3 0.3 0.3 0.3 TranslationModel0= 0.2 0.2 0.2 0.2 PhrasePenalty0= 0.2 WordPenalty0= -1 UnknownWordPenalty0= 1

Then using parsing the output with

moses_param_pattern = re.compile(r'''([^\s=]+)=\s*((?:[^\s=]+(?:\s|$))*)''')

def parse_parameters(parameters_string):
    return dict((k, list(map(float, v.split())))
                   for k, v in moses_param_pattern.findall(parameters_string))


 mosesinifile = 'mertfiles/moses.ini'

 print (parse_parameters(read_params_from_moses_ini(mosesinifile)))

to get:

{'UnknownWordPenalty0': [1.0], 'PhrasePenalty0': [0.2], 'WordPenalty0': [-1.0], 'Distortion0': [0.3], 'LexicalReordering0': [0.3, 0.3, 0.3, 0.3, 0.3, 0.3], 'TranslationModel0': [0.2, 0.2, 0.2, 0.2], 'LM0': [0.5]}

The current solution involve some crazy reversal line reading from the config file and then pretty complicated regex reading to get the parameters.

Is there a simpler or less hacky/verbose way to read the file and achieve the desired parameter dictionary output?

Is it possible to change the configparser such that it reads the moses config file? It's pretty hard because it has some erroneous section that are actually parameters, e.g. [distortion-limit] and there's no key to the value 6. In a validated configparse-able file, it would have been distortion-limit = 6.

Note: The native python configparser is unable to handle a moses.ini config file. Answers from How to read and write INI file with Python3? will not work.

If the [this post](http://stackoverflow.com/questions/8884188/how-to-read-and-write-ini-file-with-python) does not work for you, please let know. — Wiktor Stribiżew, Dec 07 '15 at 12:35
@stribizhev, the answer doesn't work, as stated in the question, the standard configparser won't work with erroneous parameter without a key. — alvas, Dec 07 '15 at 12:36
Something like `[input-factors]\\n0\\n` would cause the `ConfigParser` to fail. — alvas, Dec 07 '15 at 12:37
This is fairly simple. It could be done using regex. What makes this different is that it can have different key/value forms, bot single and multiple, depending on which section. That would have mean these forms are pre-designed based on constant sections. Am I right on that? So, are you looking to parse the whole thing, or just the `weight` section? If you are just looking for that section, you could just use `import regex` then use the `\G` anchor to find key/values. No need to break it up and join into a special form. — , Dec 10 '15 at 03:21

vks · Answer 1 · 2015-12-13T03:56:18.820

You can simply do this.

x="""#########################
### MOSES CONFIG FILE ###
#########################

# input factors 
[input-factors]
0

# mapping steps
[mapping]
0 T 0

[distortion-limit]
6

# feature functions
[feature]
UnknownWordPenalty
WordPenalty
PhrasePenalty
PhraseDictionaryMemory name=TranslationModel0 num-features=4 path=/home    /gillin/jojomert/phrase-jojo/work.src-ref/training/model/phrase-table.gz input-factor=0 output-factor=0
LexicalReordering name=LexicalReordering0 num-features=6 type=wbe-msd-bidirectional-fe-allff input-factor=0 output-factor=0 path=/home/gillin/jojomert/phrase-jojo/work.src-ref/training/model/reordering-table.wbe-msd-bidirectional-fe.gz
Distortion
KENLM lazyken=0 name=LM0 factor=0 path=/home/gillin/jojomert/ru.kenlm      order=5

# dense weights for feature functions
[weight]
UnknownWordPenalty0= 1
WordPenalty0= -1
PhrasePenalty0= 0.2
TranslationModel0= 0.2 0.2 0.2 0.2
LexicalReordering0= 0.3 0.3 0.3 0.3 0.3 0.3
Distortion0= 0.3
LM0= 0.5"""

print [(i,j.split()) for i,j in re.findall(r"([^\s=]+)=\s*([\d.\s]+(?<!\s))",re.findall(r"\[weight\]([\s\S]*?)(?:\n\[[^\]]*\]|$)",x)[0])]

Output:[('UnknownWordPenalty0', ['1']), ('PhrasePenalty0', ['0.2']), ('TranslationModel0', ['0.2', '0.2', '0.2', '0.2']), ('LexicalReordering0', ['0.3', '0.3', '0.3', '0.3', '0.3', '0.3']), ('Distortion0', ['0.3']), ('LM0', ['0.5'])] `

@alvas this regex just takes out the block `[weight]` and then parses the contents of it. — vks, Dec 10 '15 at 09:25
no it didn't really work, it was supposed to get `('TranslationModel0', ['0.2', '0.2', '0.2' ,'0.2' ])` instead of `('TranslationModel0', '0.2')` — alvas, Dec 13 '15 at 02:53

Wiktor Stribiżew · Accepted Answer · 2015-12-10T12:19:40.777

Here is another short regex-based solution that returns a dictionary of the values similar to your output:

import re
from collections import defaultdict

dct = {}

str="MOSES_INI_FILE_CONTENTS"

#get [weight] section
match_weight = re.search(r"\[weight][^\n]*(?:\n(?!$|\n)[^\n]*)*", str) # Regex is identical to "(?s)\[weight].*?(?:$|\n\n)"
if match_weight:
    weight = match_weight.group() # get the [weight] text
    dct = dict([(x[0], [float(x) for x in x[1].split(" ")]) for x in  re.findall(r"(\w+)\s*=\s*(.*)\s*", weight)])

print dct

See IDEONE demo

The resulting dictionary contents:

{'UnknownWordPenalty0': [1.0], 'LexicalReordering0': [0.3, 0.3, 0.3, 0.3, 0.3, 0.3], 'LM0': [0.5], 'PhrasePenalty0': [0.2], 'TranslationModel0': [0.2, 0.2, 0.2, 0.2], 'Distortion0': [0.3], 'WordPenalty0': [-1.0]}

The logic:

Get the [weight] block out of the file. It can be done with a r"\[weight][^\n]*(?:\n(?!$|\n)[^\n]*)*" regex that matches [weight] literally, then it matches every character any number of times until a double \n symbol (the regex is using an unroll the loop technique and is good with longer texts spanning several lines). The identical lazy-dot-based regex is [r"(?s)\[weight].*?(?:$|\n\n)"] but it is not efficient (62 steps with the first regex and 528 with this second regex to find the match in the current MOSES.ini file), but is definitely more readable.
Once you have run the search, check for the match. If match is found, run the re.findall(r"(\w+)\s*=\s*(.*)\s*", weight) method to collect all key-value pairs. The regex used is a simple (\w+)\s*=\s*(.*)\s* matching and capturing into Group 1 one or more alphanumeric symbols ((\w+)) followed by any amount of spaces, =, again any amount of spaces (\s*=\s*), and then matching and capturing into Group 2 any symbols but a newline up to the end of string. Trailing newlines with subsequent sapces are trimmed with the final \s*.
When collecting the keys and values, the latter can be returned as lists of numbers parsed as float values using comprehension.

Technically, the `r"\[weight][^\n]*` would still work if the `[weight]` section is not the last right? Since it reads until the newline. — alvas, Dec 10 '15 at 11:58
@alvas: Sorry, I have overcomplicated the solution trying to cram the whole code into the `successive_match` method. Actually, a two regex solution is really the most convenient, readable and more efficient. I have replaced the original answer with a new one. Note that `r"\[weight][^\n]*"` won't do the work at all as it will not match the whole `[weight]` block. It continues up to the double newline character or the end of string. Unrolled regex is the most efficient regex to do that task ever (for a regex of course). Please see my explanations and ask me if there is anything left unclear. — Wiktor Stribiżew, Dec 10 '15 at 12:23

Casimir et Hippolyte · Answer 3 · 2015-12-12T10:40:43.153

Without regex, you can do something like this:

flag = False
result = dict()

with open('moses.ini', 'rb') as fh:
    for line in fh:
        if flag:
            parts = line.rstrip().split('= ')
            if len(parts) == 2:
                result[parts[0]] = [float(x) for x in parts[1].split()]
            else:
                break
        elif line.startswith('[weight]'):
            flag = True

print(result)

The file is read line by line in a loop, when [weight] is reached the flag is set to True and key/value(s) are extracted for all the next lines until a blank line or the end of the file.

In this way, only the current line is loaded in memory and once the end of the [weight] block is reached, the program stops to read the file.

An other way using itertools:

from itertools import *

result = dict()

with open('moses.ini', 'rb') as fh:
    a = dropwhile(lambda x: not(x.startswith('[weight]')), fh)
    a.next()
    for k,v in takewhile(lambda x: len(x)==2, [y.rstrip().split('= ') for y in a]):
        result[k] = [float(x) for x in v.split()]

print(result)

Parsing a Moses config file

3 Answers3