Parsing Lines in Python: Use RE or Not?

Question

I'm a Perl programmer who's attempting to learn Python by taking some work I've done before and converting it over to Python. This is NOT a line-by-line translation. I want to learn the Python Technique to do this type of task.

I'm parsing a Windows INI file. Sections names are in the format:

[<type> <description>]

The <type> is a single word field and is not case sensitive. The <description> could be multiple words.

After a section, there are a bunch of parameters and values. These are in the form of:

 <parameter> = <value>

Parameters have no blank spaces and can only contain underscores, letters, and numbers (case insensitive). Thus, the first = is the divider between a parameter and the value. There might be white space separating the parameter and value around the equals sign. There might be extra white space at the beginning or end of the line.

In Perl, I used regular expressions for parsing:

while (my $line = <CONTROL_FILE>) {
    chomp($line);
    next if ($line =~ /^\s*[#;']/);     #Comments start with "#", ";", or "'"
    next if ($line =~ /^\s*$/);         #Ignore blank lines

    if ($line =~ /^\s*\[\s*(\w+)\s+(.*)/) {    #Section
        say "This is a '$1' section called '$2'";
    }
    elsif ($line =~ /^\s*(\w+)\s*=\s*(.*)/) {   #Parameter
       say "Parameter is '$1' with a value of '$2'";
    }
    else {      #Not Comment, Section, or Parameter
        say "Invalid line";
    }

}

The problem is that I've been corrupted by Perl, so I think the easiest way to do something is to use a regular expression. Here's the code I have so far...

 for line in file_handle:
     line = line.strip

     # Comment lines and blank lines
     if line.find("#") == 1 \
             or line.find(";") == 1 \
             or line.whitespace:
         continue

    # Found a Section Heading
    if line.find("[") == 1:
        print "I want to use a regular expression here"
        print "to split the section up into two pieces"
    elif line.find("=") != -1:
        print "I want to use a regular expression here"
        print "to split the parameter into key and value"
    else
        print "Invalid Line"

There are several things that irritate me here:

There are two places where a regular expression just seem to be calling out to be used. What is the Python way of doing this splitting?
I make sure to strip white space on either side of the string, and rewrite the string. That way, I don't have to do the stripping multiple times. However, I'm rewriting the string which I understand is a very inefficient operation in Python. What is the Python way to handle this issue?
In the end, my algorithm looks pretty much like my Perl algorithm, and that seems to say that I am letting my Perl thinking get in the way. How should my code be structured in Python?

I've been going through the various on line tutorials, and they've helped me with understanding the syntax, but not much in the way of handling the language itself -- especially someone who tends to think in another language.

My question:

Should I use regular expressions? Or, is there another and better way to handle this?
Is my coding logic correct? How should I be thinking about parsing this file?

Be sure to have a look at the [`ConfigParser`](http://docs.python.org/library/configparser.html) module. — Sven Marnach, Feb 08 '12 at 21:33
@SvenMarnach - Thanks for your suggestion, but I've already saw that. The problem is that the ConfigParser puts the output into a dictionary, and I cannot guarantee the order of the sections in a dictionary which is really important in this particular application. I had the same issue with Perl with the [Config::Ini](http://search.cpan.org/~rjbs/Config-INI-0.019/lib/Config/INI.pm) module. Besides, this gives me a chance to really learn the ins and outs of Python. — David W., Feb 08 '12 at 21:55
Starting in Python 2.6, you can pass in a different type than `dict` an use one of the libraries offering ordered dictionaries. Starting in Python 2.7 `OrderedDict` is included in the standard library and the standard dictionary type of `ConfigParser`. — Sven Marnach, Feb 08 '12 at 22:07

score 5 · Answer 1 · edited May 23 '17 at 12:31

Python includes a ini parsing library. If you want to build a library to parse ini files, then you are looking at an actual parser. Regex won't cut it, use PLY or hook in a flex/bison C parser. Additional python parsing resources are available as well.

Lexers handle all of the text consumption and tree construction for you, since it's a mechanical task prone to programmer error. I.E. this section:

while (my $line = <CONTROL_FILE>) {
    chomp($line);
    next if ($line =~ /^\s*[#;']/);     #Comments start with "#", ";", or "'"
    next if ($line =~ /^\s*$/);         #Ignore blank lines

    if ($line =~ /^\s*\[\s*(\w+)\s+(.*)/) {    #Section
        say "This is a '$1' section called '$2'";
    }
    elsif ($line =~ /^\s*(\w+)\s*=\s*(.*)/) {   #Parameter
       say "Parameter is '$1' with a value of '$2'";
    }
    else {      #Not Comment, Section, or Parameter
        say "Invalid line";
    }

}

Is created by the lexer, you just need to define the correct Regex. The parser pulls the tokens from the lexer, and determines if they fit the allowable token patterns. That is:

[<type> <description>]
<parameter> = <value>

Define those tokens, and then how the are allowed to fit. Everything else just puts itself together. For those of you who think you can do a better job with a quick for loop and some regex, I suggest you read Lex & Yacc, 2nd Ed.

For an example parser I wrote with PLY, go here. It parses a "jetLetter" file, which is just a dialect of groff/troff.

+1 for showing how the 'Python' way of doing a lot of things is often knowing the powerful build in libraries. — Nick Garvey, Feb 08 '12 at 21:41
Just wanted to throw in a link to [lepl](http://www.acooke.org/lepl/), a nice, lightweight parsing library I recently learned about on this site. — Niklas B., Feb 08 '12 at 21:44

score 5 · Accepted Answer · answered Feb 08 '12 at 21:37

5

While I don't think this is your intention, the file format appears quite similar to Python's built-in ConfigParser module. Sometimes the most "Pythonic" way is already provided for you. (:

In more direct answer to your question: regular expressions may be a good way to do this. Otherwise, you could try the more basic (and less robust)

(parameter, value) = line.split('=')

This would throw an error if the line contained no or more than one '=' character. You may want to test it first with '=' in line.

Also:

line.find("[") == 1

is probably better replaced by

line.startswith("[")

Hope that helpls a little (:

answered Feb 08 '12 at 21:37

tjvr

17,431
6
25
26

Thanks, I actually saw that module, but unfortunately, it stores the results in a dictionary, and you can lose the order the sections were read in. To me, the order of the sections are very important. I had the same issue in Perl with the [Config::Ini](http://search.cpan.org/~rjbs/Config-INI-0.019/lib/Config/INI.pm) module. Besides, the idea is to learn the language. Thanks for the pointer to the `startswith` method. – David W. Feb 08 '12 at 21:58
@David You're welcome. I thought the built-in way wouldn't be quite the same, somehow... :) – tjvr Feb 08 '12 at 22:12
To avoid more than 1 '=' signs, use `line.split('=',1)` To also address problem with no '=' signs, use `parameter,value = (line.split('=',1)+[''])[:2]`. Don't put ()'s around the LHS tuple, they are unnecessary clutter. Also be sure to call `line.strip` using `line.strip()` - the code you have will replace line with the bound method strip, something I'm sure is not desired. – PaulMcG Feb 08 '12 at 23:55
And there is no `whitespace` method for str. Easiest way to test for and ignore blank lines is `line = line.strip()` followed by `if not line: continue`. – PaulMcG Feb 08 '12 at 23:59
@PaulMcGuire - The code I wrote was pretty much writing it as posted. I just wanted to make sure I was heading in the right direction on this. I found that `whitespace` isn't a method, and discovered that `split` doesn't take parameters. It'll take me a while to get the hang of Python documentation. I pretty already figured out what you stated except I had parens around my LHS tuples. I'll remove them. It's going to take me a while to learn Python. For example, how do you figure out if a variable is defined. I looked for a `defined` command, but none exists. Realize you do a `try/except`. Thx. – David W. Feb 09 '12 at 17:57

score 0 · Answer 3 · answered Feb 09 '12 at 05:57

Yes, by all means use regular expressions in this case. The syntax of .INI file lines that you’re trying to parse fits mathematically within the characteristics of a Chomsky Type 3 (regular) grammar, which is exactly the sort of thing regular expressions are designed to parse.

The regular expressions you need are (off the top of my head, untested) something like:

r"^\[\s*(\w)\s+(.*)\]$"

and

r"^(\w)\s*\=\s*(.*)$"

Use re.search, and in the returned Match objects, you can extract the groups corresponding to the parenthesized groupings in the expressions.

Parsing Lines in Python: Use RE or Not?

3 Answers3