1

I've been banging my head against this for hours, but I'm obviously lacking fundamental Regex knowledge to do what I want.

I have a WKT (well known text, see http://en.wikipedia.org/wiki/Well-known_text) string, that looks like this:

PROJCS["MGI / Austria GK Central",GEOGCS["MGI",DATUM["Militar_Geographische_Institute",SPHEROID["Bessel 1841",6377397.155,299.1528128000009,AUTHORITY["EPSG","7004"]],AUTHORITY["EPSG","6312"]],PRIMEM["Greenwich",0],UNIT["degree",0.0174532925199433],AUTHORITY["EPSG","4312"]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",0],PARAMETER["central_meridian",13.33333333333333],PARAMETER["scale_factor",1],PARAMETER["false_easting",0],PARAMETER["false_northing",-5000000],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AUTHORITY["EPSG","31255"]]

I want to parse this string into key / value pairs. So, as an example:

SPHEROID["Bessel 1841",6377397.155,299.1528128000009,AUTHORITY["EPSG","7004"]] would become:

key: SPHEROID

value: "Bessel 1841",6377397.155,299.1528128000009,AUTHORITY["EPSG","7004"]

By matching against \[(.*?)\] I'm getting all the values (see http://rubular.com/r/6SxMbRMufJ), but I'm losing the keys. How can I create a Regex where the first group is the key, and the second group is the value?

Also, is there a way to split nested values (like key[key[value]]]) as well, or do I have to use recursion on every match?

Knaģis
  • 20,827
  • 7
  • 66
  • 80
lightxx
  • 1,037
  • 2
  • 11
  • 29
  • Regex is a very bad match for this type of parsing. The recursive part is short of impossible. Also you have to account for values like "[key]" - if the brackets are within quotes then you ignore those. Your best bet is to write a parser that reads one character a time and builds the result object tree. There will be no recursion, and the parsing will work in linear time. – Knaģis Oct 14 '13 at 16:31
  • All time favorite [parse HTML with RegEx](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) have good reasons why it is not easy as well as explanations how it can be done with RegEx - enjoy reading. – Alexei Levenkov Oct 14 '13 at 16:31
  • The last part of the question is just icing on the cake. All I need right know is the key/value example I provided above. This should be possible with Regexes, no? – lightxx Oct 14 '13 at 16:33
  • I'm not sure what do you mean "last part of the question"... The whole question reads to me as "how to match nested brackets with RegEx" with 2 samples - outer "SPHEROID" case and inner nested key/value. Also I could be totally off... – Alexei Levenkov Oct 14 '13 at 16:39
  • I somehow hoped there would be a more elegant solution to this as character-wise parsing :( – lightxx Oct 15 '13 at 04:42

1 Answers1

3

The regular expression to achieve the minimum you are asking, is ([^\[]+?)\[(.*)\].


However, since you are parsing a specific format you should look for existing parsers that do that.

For example, you can look at the code from http://www.dupuis.me/node/28

Also, http://gis.stackexchange.com has answers that mention other libraries: https://gis.stackexchange.com/questions/13078/how-to-unproject-wkt-to-wkt-in-net

Community
  • 1
  • 1
Knaģis
  • 20,827
  • 7
  • 66
  • 80
  • thanks! could you please provide a short explanation of the parts of your regex? I'd really like to understand what's going on! – lightxx Oct 14 '13 at 18:32
  • 1
    `()` marks the groups you capture. The main difference with your initial code is the first group that searches for all characters until it finds `[` - `[^x]` means any character except `x`. Your mistake was to put `.*?` - that searches for the minimum count of chars which is wrong since that matches `[[]` instead of `[[]]`. – Knaģis Oct 14 '13 at 18:38