0

I'm trying to parse big text files with python.

These files have a syntax like this:

<option1> {
<variable1>=<value1>; //<comment> 
<variable2>=<value2>;
..
<variableN>=<valueN>; //<comment> 
}

<option2> {
<variable1>=<value1>; //<comment> 
<variable2>=<value2>;
..
<variableN>=<valueN>; //<comment> 
}

...
...

<optionN> {
<variable1>=<value1>; //<comment> 
<variable2>=<value2>;
..
<variableN>=<valueN>; //<comment> 
}

And I want to get for instance <optionK>[<variableT>] value.

Is there an optimal way to do this by using a fileparser?

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
ccamacho
  • 707
  • 8
  • 22
  • @sshashank124: The OP stated the file is huge; regex would require you read the whole file into memory, perhaps not the most practical advice? – Martijn Pieters Mar 20 '14 at 09:57
  • @MartijnPieters: `mmap` allows you to apply regex to a huge file. See [How to read tokens without reading whole line or file](http://stackoverflow.com/q/20019503/4279) – jfs Mar 20 '14 at 10:07
  • you could try something like `lepl` (discontinued) to parse the file, here's a [code example](http://stackoverflow.com/a/7357689/4279) – jfs Mar 20 '14 at 10:14
  • @JFSebastian: Can't look it up right now but Jon Clements the other day had found you couldn't if the file was larger than available memory. But I have no first-hand experience there and I'll happily defer to you. I'd read the file line by line detection sections, myself. – Martijn Pieters Mar 20 '14 at 10:34
  • @MartijnPieters: My answer explicitly says *"It works even if the file doesn't fit in memory."* I wouldn't have said that if I hadn't tried it. I also would not use a single regex to parse the file. I just mentioned it to say that it is possible – jfs Mar 26 '14 at 21:30

1 Answers1

1

Consider your above example (ugly solution) you can use http://docs.python.org/2/library/htmlparser.html as follow:

test = """
<option1> {
<variable1>=<value1>; //<comment>
<variable2>=<value2>;
..
<variableN>=<valueN>; //<comment>
}

<option2> {
<variable1>=<value1>; //<comment>
<variable2>=<value2>;
..
<variableN>=<valueN>; //<comment>
}

...
...

<optionN> {
<variable1>=<value1>; //<comment>
<variable2>=<value2>;
..
<variableN>=<valueN>; //<comment>
}

"""

from HTMLParser import HTMLParser

# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
    option = ""
    key = ""
    value = ""
    r = {}
    def handle_starttag(self, tag, attrs):
        self.currentTag = tag
        print "Encountered a start tag:", tag
        if "option" in tag:
            #self.r = {}
            self.option = tag
            self.r[self.option] = {}
        elif "{" in self.currentData or "=" not in self.currentData and "//" not in self.currentData:
            self.key = tag
            self.r[self.option][self.key] = ""
        elif "=" in self.currentData:
            self.value = tag
            self.r[self.option][self.key] = self.value
            #print self.r
    def handle_endtag(self, tag):
        self.currentData = None
        print "Encountered an end tag :", tag
    def handle_data(self, data):
        self.currentData = data
        print "Encountered some data  :", data
        #find a condition to yield result here "}" ? 

# instantiate the parser and fed it some HTML
parser = MyHTMLParser()  
parser.feed(test) 
print parser.r
Ali SAID OMAR
  • 6,404
  • 8
  • 39
  • 56