1

The general problem:

I've got lot of data I'm trying to clean up then parse. Each line is really long, but they all have the same structure. It starts with one unique substring, followed by a second unique substring, followed by a substring that repeats about 20 times.

So it's: String A, String B, String C, String C, String C, etc. Every line is in that format.

At the start of String A is an ID, just a unique six digit number. I'm trying to insert that ID at the beginning of String B and all of the String C's.

String C is the problem. I can write regex's for each of the ID, B, and C, but trying to insert the captured ID into all the C's fails. It only works on the last one. That's actually the correct behavior here, but I'm pretty sure there is a way to to treat String C so that it will act like each instance of the substring is separate. And the regex runs over it again and again.

I tried using '\G' syntax but I can't seem to make it work.

So here's a specific example using some massively abridged sample data:

['sample_id':121084,[122,'southwest',7.23,[[['station_01',[1]],['station_02',[1]], ['station_03',[22]],['station_04',[49]],['station_05',[1]],['station_06',[4]],['station_07',[101]],['station_08',[22]]]],[[['run':133225,'marker':'SAM',[[['substation_01',[1]],['substation_02',[3]],['substation_03',[16]],['substation_04',[15]],['substation_05',[14]],['substation_06',[6]],['substation_07',[41]],['substation_08',[19]],['substation_09',[13]],['substation_10',[1]],['substation_11',[13]],['substation_12',[1]]]],'TK',22,34,127],['run':608049,'marker':'TIM',[[['substation_01',[12]],['substation_02',[6]],['substation_03',[17]],['substation_04',[11]],['substation_05',[1]],['substation_06',[6]],['substation_07',[5]],['substation_08',[19]]]],'TM',21,21,966],['run':445801,'marker':'RON',[[['substation_01',[5]],['substation_02',[5]],['substation_03',[6]],['substation_04',[11]],['substation_05',[1]],['substation_06',[15]],['substation_07',[11]],['substation_08',[16]],['substation_09',[1]],['substation_10',[13]],['substation_11',[3]]]],'TR',12,33,521],['run':142278, etc...

Just a note: The only difference between String B and all the String Cs is the number of brackets, but that's actually useful once I start parsing this out (ultimately it'll all be JSON).

What I'm trying to get is:

['sample_id':121084,[122,'southwest',7.23,[[['station_01',[1]],['station_02',[1]],['station_03',[22]],['station_04',[49]],['station_05',[1]],['station_06',[4]],['station_07',[101]],['station_08',[22]]]],[[['sample_id':121084,'run':133225,'marker':'SAM',[[['substation_01',[1]],['substation_02',[3]],['substation_03',[16]],['substation_04',[15]],['substation_05',[14]],['substation_06',[6]],['substation_07',[41]],['substation_08',[19]],['substation_09',[13]],['substation_10',[1]],['substation_11',[13]],['substation_12',[1]]]],'TK',22,34,127],['sample_id':121084,'run':608049,'marker':'TIM',[[['substation_01',[12]],['substation_02',[6]],['substation_03',[17]],['substation_04',[11]],['substation_05',[1]],['substation_06',[6]],['substation_07',[5]],['substation_08',[19]]],'TM',21,21,966],['sample_id':121084,'run':445801,'marker':'RON',[[['substation_01',[5]],['substation_02',[5]],['substation_03',[6]],['substation_04',[11]],['substation_05',[1]],['substation_06',[15]],['substation_07',[11]],['substation_08',[16]],['substation_09',[1]],['substation_10',[13]],['substation_11',[3]]],'TR',12,33,521],['sample_id':121084, etc...

In the latter text block each substring now begins with the ID 'sample_id':121084 (I bolded it to make it slightly easier to see what's going on).

Here's the Regex that gets me up through String C.

\[('sample_id':\d{6},)(?:.+\]\]\],\[\[)\[(.+?\d\],)\[(.+?\d\],)

So I'm trying to insert that first capture group ($1) in front of the second group, then the third group over and over and over (about 20x). If I repeat the last bit, I end up killing all but one of the C Strings, which again, I believe to be the 'proper' behavior. I'm trying to figure out how to get around that.

It's a mess I know. But each of those is just one line, and I've got doc after doc that'll have 100 or so lines like that. So a regex that doesn't break up the lines seems best.

I went over this page a few times trying to engineer a solution, but again, I couldn't make the \G syntax work here.

Collapse and Capture a Repeating Pattern in a Single Regex Expression

Should mention I'm trying to do this in Sublime Text 2. Thanks for any help.

Community
  • 1
  • 1
noLongerRandom
  • 521
  • 1
  • 5
  • 17
  • I'd look into pyparsing to do this... – dawg Sep 04 '14 at 21:55
  • Turns out the solution to this was to learn enough Python to make the changes (didn't need pyparsing but thanks for the tip). Still couldn't get the \G syntax to function anything like it should and the corresponding documentation for the Boost Regex is incredibly inadequate but those are problems for another day. – noLongerRandom Sep 09 '14 at 20:57

0 Answers0