0

I have a huge text file that has information stored in this format.

someOtherMessage{
              class = "someClass";
      sampleMessage{
                  someValue{
                      someText{
                          someParam = "value";
                          someSymbol = "another_symbol";
                      }; //someText
                  }; //someValue
       }; //sampleMessage
    }; //someOtherMessage

someOtherMessage2{
              class = "someClass2";
      sampleMessage2{
                  someValue2{
                      someText2{
                          someParam = "value2";
                          someSymbol = "another_symbol2";
                      }; //someText2
                  }; //someValue2
       }; //sampleMessage2
    }; //someOtherMessage2

I want to iterate over this file using a py script and build a dict(or any other data struct) in the following format.

For eg.

dict = {'someOtherMessage': 'someOtherMessage{
              class = "someClass";
      sampleMessage{
                  someValue{
                      someText{
                          someParam = "value";
                          someSymbol = "another_symbol";
                      }; //someText
                  }; //someValue
       }; //sampleMessage
    }; //someOtherMessage',

'someOtherMessage2': 'someOtherMessage2{
          class = "someClass2";
  sampleMessage2{
              someValue2{
                  someText2{
                      someParam = "value2";
                      someSymbol = "another_symbol2";
                  }; //someText2
              }; //someValue2
   }; //sampleMessage2
}; //someOtherMessage2'
}

I used the following regex but it picks everything between the first and last curly brace, how can I make it pick just the required ones separately?

r"(?s){(.*)}"

1 Answers1

0

Here is one of the possible solution for your problem. I am assuming you are aware about greedy and lazy quantifiers. If not, here is a link: Greedy vs. Reluctant vs. Possessive Qualifiers

(?s)\{(.*?)\};.*?(\n\n|$)

Error

In the original regex, (?s){(.*)}, you have used greedy quantifier which matches too much and hence results in matching something that is not what we want.

If you replace it with a lazy quantifier, (?s){(.*?)}, it matches too little which again results in matching something not desired.


Correction

In order to specify the correction ending } we have to find something to anchor our match. This is where those new lines between that data that is to be obtained comes into play.

    }; //someOtherMessage

someOtherMessage2{

Here there is a \n after that someOtherMessage comment and then another \n in that new line. So we add

`(?s)\{(.*?)\};.*?\n\n`

which simply means {...something here...}...something here...\n\n.

This regex will still not the ending data since it does not have any \n in the end.

}; //someOtherMessage2'
}

So to match it, we set $ to match end of file by removing m modifier. This changes our regex to:

 (?s)\{(.*?)\};.*?(\n\n|$)

I hope I have helped you with your problem. Please do note this is one of the possible solution to your problem. There is another approach to finding matching nested parenthesis which is also available on SO. However, it might be a bit complex. Here is one such link: Can regular expressions be used to match nested patterns?

If you any other doubt, please do mention them.

AKSingh
  • 1,535
  • 1
  • 8
  • 17