Python parse text file into nested dictionaries

Question

Consider the following data structure:

[HEADER1]
{
   key value
   key value
   ...
   [HEADER2]
   {
      key value
      ...
   }
   key value
   [HEADER3]
   {
      key value
      [HEADER4]
      {
         key value
         ...
      }
   }
   key value
}

There are no indents in the raw data, but I added them here for clarity. The number of key-value pairs is unknown, '...' indicates there could be many more within each [HEADER] block. Also the amount of [HEADER] blocks is unknown.

Note that the structure is nested, so in this example header 2 and 3 are inside header 1 and header 4 is inside header 3.

There can be many more (nested) headers, but I kept the example short.

How do I go about parsing this into a nested dictionary structure? Each [HEADER] should be the key to whatever follows inside the curly brackets.

The final result should be something like:

dict = {'HEADER1': 'contents of 1'}
contents of 1 = {'key': 'value', 'key': 'value', 'HEADER2': 'contents of 2', etc}

I'm guessing I need some sort of recursive function, but I am pretty new to Python and have no idea where to start.

For starters, I can pull out all the [HEADER] keys as follows:

path = 'mydatafile.txt'
keys = []

with open (path, 'rt') as file:
   for line in file:
      if line.startswith('['):
         keys.append(line.rstrip('\n'))

for key in keys:
   print(key)

But then what, maybe this not even needed?

Any suggestions?

So are there really headers without closing `}`s and also double `}`s and values outside of `{}`s? — Jon Clements, Oct 21 '17 at 18:06
No, each header is followed by {...}, but since they can be nested, there could be two closing brackets on adjacent lines. — koen, Oct 21 '17 at 18:07
Oh wait, is header 2 within header1 ? Might be an idea to show how you'd expect the output `dict` to actually look — Jon Clements, Oct 21 '17 at 18:08
@Koen. You **really** need to make it clearer in your question that the headers/sections can be nested. — ekhumoro, Oct 21 '17 at 18:10
@Koen. PS: could this be some kind of standard format? Where are you getting these files from? — ekhumoro, Oct 21 '17 at 18:13
Gotcha, I updated the question. The data comes from analytical software, and I am stuck with that format. No idea if this is some sort of standard format. — koen, Oct 21 '17 at 18:15
Also there are no indents in the input file, which probably caused the confusion. — koen, Oct 21 '17 at 18:16

Ashish Ranjan · Accepted Answer · 2017-10-21T19:08:15.170

4

You can do it by pre-formatting your file content using few regex and then pass it to json.loads

You can do these kind of regex substitutions one by one:

#1 \[(\w*)\]\n -> "$1":

#2 \}\n(\w) -> },$1

#3 (\w*)\s(\w*)\n([^}]) -> $1:$2,$3

#4 (\w*)\s(\w*)\n\} -> $1:$2}

and then finally pass the final string to json.loads:

import json
d = json.loads(s)

which will parse it to a dict format.

Explanation :

1. \[(\w*)\]\n : replace [HEADERS]\n with "HEADERS":

2. \}\n(\w): replace any closing braces i.e, } that have any value after them, with },

3. (\w*)\s(\w*)\n([^}]): replace key value\n with key:value, for lines having any next elements

4. (\w*)\s(\w*)\n\}: replace key value\n with key:value for lines having no next elements

So, by minor modifications to these regexes you will be able to parse it to a dict format, the basic concept is to reformat the file contents to a format that can be parsed easily.

edited Oct 21 '17 at 19:08

answered Oct 21 '17 at 18:37

Ashish Ranjan

5,523
2
18
39

So how do I iterate over the lines and accomplish this? `for line in file: line = re.sub("\[(\w*)\]\n", "", line)` is not changing anything? – koen Oct 21 '17 at 19:20
1

don't iterate over lines, read the whole file and then use these regex on the file content and then pass the resulting string to the next regex – Ashish Ranjan Oct 21 '17 at 19:23
I see, I tried that: `s = open(path, 'rt').read() s1 = re.sub("\[(\w*)\]\n", "", s)`, but no changes. – koen Oct 21 '17 at 19:29
1

also don't replace with empty string, check the answer for what to substitute with which regex. see this for how to use captured groups : https://stackoverflow.com/questions/6711567/how-to-use-python-regex-to-replace-using-captured-group – Ashish Ranjan Oct 21 '17 at 19:32
if your file content has a format inconsistent with that mentioned in the question, obv this regex won't work you'll have to do some minor tweaks a/c to your needs to make it work. I suspect, that might be the case. – Ashish Ranjan Oct 21 '17 at 19:35
I'm using a test file exactly as above, I guess I'm having trouble with the syntax; this: `s1 = re.sub('\[(\w*)\]\n', '"$1":', s)` will replace `[HEADER1]` with `"$1":` – koen Oct 21 '17 at 19:40
1

checkout the above link, that'll help a lot with substitution, basically you need to use `\1` instead of `$1` in python – Ashish Ranjan Oct 21 '17 at 19:41
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/157202/discussion-between-koen-and-ashish-ranjan). – koen Oct 21 '17 at 20:37

Python parse text file into nested dictionaries

1 Answers1

Linked