Regex to capture colon-separated key-value pairs, with multi-line values

Question

I'm currently working on a project in Ruby on Rails (in Eclipse) and my task is to split up a block of data into relevant parts using Regular Expressions.

I've decided to break up the data based on 3 parameters:

The line must start with a capital letter (RegEx equivalent - /^[A-Z]/)
It must end with a : (RegEx equivalent - /$":"/)

I would appreciate any help....The code I'm using in my controller is:

@f = File.open("report.rtf")  
@fread = @f.read  
@chunk = @fread.split(/\n/)

where @chunk is the array that will be created by the split and @fread is the data that is being split up (by new lines).

Any help will be appreciated, thanks a lot!

I cannot release the exact data but it goes basically by this (this is medically related)

Exam 1: CBW 8080

RESULT:

This report is dictated with specific measurement. Please see the original report.

COMPARISON: 1/30/2012, 3/8/12, 4/9/12

RECIST 1.1: BLAH BLAH BLAH

The ideal output would be an array that says:

["Exam 1:", "CBW 8080", "RESULT", "This report is dictated with specific measurement. Please see the original report.", "COMPARISON:", "1/30/2012, 3/8/12, 4/9/12", "RECIST 1.1:", "BLAH BLAH BLAH"]

PS I'm just using \n as a placeholder until I get it working

Is the data coming from a file? Why not use readlines instead of manually splitting on a newline? — Mark Thomas, Jun 18 '12 at 18:39
We need more info. Can there be colons in the middle of a line? Does a line end with a colon or will there also be a newline character after the colon? What to do with lines not matching? What about accented/foreign capital letters? What is the relevance of Eclipse here? — Lars Haugseth, Jun 18 '12 at 18:40
I answered all your queries in the original question...Eclipse is irrelevant....Thanks so much for your help! — John Dough, Jun 18 '12 at 19:24
Okay, @WillShah I've posted an updated, updated solution, given the improvements to your question. — Andrew Cheong, Jun 18 '12 at 21:01

Andrew Cheong · Accepted Answer · 2012-06-18T20:58:39.890

Given the clarified question, here's a new solution.

UPDATED

"Slurp" the entire block of data (including the newline characters and all) into a single string, first.

str = IO.read("report.rtf")

Then use this regex:

captures = str.scan(/(?<=^|[\r\n])([A-Z][^:]*):([^\r\n]*(?:[\r\n]+(?![A-Z].*:).*)*)/)

See a live example here: http://rubular.com/r/8w3X6WGq4l.

The answer, explained:

    (?<=                Lookbehind assertion.
        ^                   Start at the beginning of the string,
        |                   or,
        [\r\n]              a new line.
    )
    (                   Capture group 1, the "key".
        [A-Z][^:]*          Capital letter followed as many non-colon
                            characters as possible.
    )
    :                   The colon character.

    (                   Capture group 2, the "value".
        [^\r\n]*            All characters (i.e. non-newline characters) on the
                            same line belongs to the "value," so take them all.

        (?:             Non-capture group.

            [\r\n]+         Having already taken everything up to a newline
                            character, take the newline character(s) now.

            (?!             Negative lookahead assertion.
                [^A-Z].*:       If this next line contains a capital letter,
                                followed by a string of anything then a colon,
                                then it is a new key/value pair, so we do not
                                want to match this case.
            )
            .*              Providing this isn't the case though, take the line!

        )*              And keep taking lines as long as we don't find a
                        key/value pair.
    )

Awesome, this does the job perfectly and you explained it to a T. Thank you so much! Now i just need to get ride of the "invalid byte sequence in UTF-8" error and I'm done! — John Dough, Jun 19 '12 at 15:31
Btw do you know how to get rid of that error? I think its something to do with encoding :/ — John Dough, Jun 19 '12 at 15:51
No idea, but have you seen this: http://stackoverflow.com/questions/2982677/ruby-1-9-invalid-byte-sequence-in-utf-8? — Andrew Cheong, Jun 19 '12 at 16:18
The answer was str = IO.read("report_test").force_encoding("ISO-8859-1").encode("utf-8", replace: nil) if anyone is interested but thanks a lot @acheong87 ! — John Dough, Jun 19 '12 at 19:01

score 1 · Answer 2 · answered Jun 18 '12 at 18:43

1

I'm not entirely sure what you're looking for. If you want all occurrences of capital letter followed by some text and a semicolon, then you can do:

str.scan(/[A-Z].*?:/)

answered Jun 18 '12 at 18:43

brentvatne

7,603
4
38
55

This is the best answer so far, I just need to output the other lines as well – John Dough Jun 18 '12 at 19:12

score 0 · Answer 3 · answered Jun 18 '12 at 18:46

0

This should do it.

/^[A-Z].*:$/

answered Jun 18 '12 at 18:46

Andrew Cheong

29,362
15
90
145

This just returns one big array of the whole file without any splits :/ – John Dough Jun 18 '12 at 19:11
Sorry, I thought you meant you already split it by ``\n`` and you were going to run each line through another regex--my solution was meant to match against each line. – Andrew Cheong Jun 18 '12 at 19:51
Would that result in the desired output if i split each line and then try this on each line? – John Dough Jun 18 '12 at 19:54
Oh, no, it won't. I just looked at your updated question, with your new example. In your original question it wasn't clear that sometimes the "value" would be on the same line, and other times on separate lines. I will post a new solution so as not to confuse this discussion. – Andrew Cheong Jun 18 '12 at 20:01

u19964 · Answer 4 · 2012-06-18T19:48:58.243

0

The regex can be: /(^[A-Z].*\:)/m And you extract by adding:

@chunk = @fread.scan(/(^[A-Z].*\:)/m)

provided @fread is a string. You can use http://rubular.com/ for testing regex in ruby.

edited Jun 18 '12 at 19:48

answered Jun 18 '12 at 18:47

u19964

3,255
4
21
28

This returns an empty array :/ – John Dough Jun 18 '12 at 19:12
Then you need /(^[A-Z].*\:$)/m , which is scanning across multiple lines. – u19964 Jun 18 '12 at 19:14
Try open your terminal and run irb, for interactive ruby. You can try that regex on some strings. I am not sure about your input format. – u19964 Jun 18 '12 at 19:24

score 0 · Answer 5 · answered Oct 05 '13 at 16:22

0

Yet another solution:

input_str.split("\r\n").each |s| do
    var_name = s.split(": ")[0]
    var_value = s.split(": ")[1]
    # do whatever you like
done

answered Oct 05 '13 at 16:22

Roozbeh Zabihollahi

7,207
45
39

Regex to capture colon-separated key-value pairs, with multi-line values

5 Answers5

Linked