4

I'm currently working on a project in Ruby on Rails (in Eclipse) and my task is to split up a block of data into relevant parts using Regular Expressions.

I've decided to break up the data based on 3 parameters:

  1. The line must start with a capital letter (RegEx equivalent - /^[A-Z]/)
  2. It must end with a : (RegEx equivalent - /$":"/)

I would appreciate any help....The code I'm using in my controller is:

@f = File.open("report.rtf")  
@fread = @f.read  
@chunk = @fread.split(/\n/)

where @chunk is the array that will be created by the split and @fread is the data that is being split up (by new lines).

Any help will be appreciated, thanks a lot!

I cannot release the exact data but it goes basically by this (this is medically related)

Exam 1: CBW 8080

RESULT:

This report is dictated with specific measurement. Please see the original report.

COMPARISON: 1/30/2012, 3/8/12, 4/9/12

RECIST 1.1: BLAH BLAH BLAH

The ideal output would be an array that says:

["Exam 1:", "CBW 8080", "RESULT", "This report is dictated with specific measurement. Please see the original report.", "COMPARISON:", "1/30/2012, 3/8/12, 4/9/12", "RECIST 1.1:", "BLAH BLAH BLAH"]

PS I'm just using \n as a placeholder until I get it working

Andrew Cheong
  • 29,362
  • 15
  • 90
  • 145
John Dough
  • 125
  • 2
  • 12

5 Answers5

5

Given the clarified question, here's a new solution.

UPDATED

"Slurp" the entire block of data (including the newline characters and all) into a single string, first.

str = IO.read("report.rtf")

Then use this regex:

captures = str.scan(/(?<=^|[\r\n])([A-Z][^:]*):([^\r\n]*(?:[\r\n]+(?![A-Z].*:).*)*)/)

See a live example here: http://rubular.com/r/8w3X6WGq4l.

The answer, explained:

    (?<=                Lookbehind assertion.
        ^                   Start at the beginning of the string,
        |                   or,
        [\r\n]              a new line.
    )
    (                   Capture group 1, the "key".
        [A-Z][^:]*          Capital letter followed as many non-colon
                            characters as possible.
    )
    :                   The colon character.

    (                   Capture group 2, the "value".
        [^\r\n]*            All characters (i.e. non-newline characters) on the
                            same line belongs to the "value," so take them all.

        (?:             Non-capture group.

            [\r\n]+         Having already taken everything up to a newline
                            character, take the newline character(s) now.

            (?!             Negative lookahead assertion.
                [^A-Z].*:       If this next line contains a capital letter,
                                followed by a string of anything then a colon,
                                then it is a new key/value pair, so we do not
                                want to match this case.
            )
            .*              Providing this isn't the case though, take the line!

        )*              And keep taking lines as long as we don't find a
                        key/value pair.
    )
Andrew Cheong
  • 29,362
  • 15
  • 90
  • 145
  • Awesome, this does the job perfectly and you explained it to a T. Thank you so much! Now i just need to get ride of the "invalid byte sequence in UTF-8" error and I'm done! – John Dough Jun 19 '12 at 15:31
  • Btw do you know how to get rid of that error? I think its something to do with encoding :/ – John Dough Jun 19 '12 at 15:51
  • No idea, but have you seen this: http://stackoverflow.com/questions/2982677/ruby-1-9-invalid-byte-sequence-in-utf-8? – Andrew Cheong Jun 19 '12 at 16:18
  • The answer was str = IO.read("report_test").force_encoding("ISO-8859-1").encode("utf-8", replace: nil) if anyone is interested but thanks a lot @acheong87 ! – John Dough Jun 19 '12 at 19:01
1

I'm not entirely sure what you're looking for. If you want all occurrences of capital letter followed by some text and a semicolon, then you can do:

str.scan(/[A-Z].*?:/)
brentvatne
  • 7,603
  • 4
  • 38
  • 55
0

This should do it.

/^[A-Z].*:$/
Andrew Cheong
  • 29,362
  • 15
  • 90
  • 145
  • This just returns one big array of the whole file without any splits :/ – John Dough Jun 18 '12 at 19:11
  • Sorry, I thought you meant you already split it by ``\n`` and you were going to run each line through another regex--my solution was meant to match against each line. – Andrew Cheong Jun 18 '12 at 19:51
  • Would that result in the desired output if i split each line and then try this on each line? – John Dough Jun 18 '12 at 19:54
  • Oh, no, it won't. I just looked at your updated question, with your new example. In your original question it wasn't clear that sometimes the "value" would be on the same line, and other times on separate lines. I will post a new solution so as not to confuse this discussion. – Andrew Cheong Jun 18 '12 at 20:01
0

The regex can be: /(^[A-Z].*\:)/m And you extract by adding:

@chunk = @fread.scan(/(^[A-Z].*\:)/m)

provided @fread is a string. You can use http://rubular.com/ for testing regex in ruby.

u19964
  • 3,255
  • 4
  • 21
  • 28
0

Yet another solution:

input_str.split("\r\n").each |s| do
    var_name = s.split(": ")[0]
    var_value = s.split(": ")[1]
    # do whatever you like
done
Roozbeh Zabihollahi
  • 7,207
  • 45
  • 39