How to find an expression in a text file and process all lines until the next occurrence of the expression and repeat until end of the file

Question

I have a text file:

Some comment on the 1st line of the file.

processing date:         31.8.2016
amount:                  -1.23
currency:                EUR
balance:                 1234.56
payer reference:         /VS123456/SS0011223344/KS1212
type of the transaction: Some type of the transaction 1
additional info:         Amount: 1.23 EUR 29.08.2016 Place: 123456789XY



processing date:         30.8.2016
amount:                  -2.23
currency:                EUR
balance:                 12345.56
payer reference:         /VS123456/SS0011223344/KS1212
type of the transaction: Some type of the transaction 2
additional info:         Amount: 2.23 EUR 28.08.2016 Place: 123456789XY



processing date:         29.8.2016
amount:                  -3.23
currency:                EUR
balance:                 123456.56
payer reference:         /VS123456/SS0011223344/KS1212
type of the transaction: Some type of the transaction 2
additional info:         Amount: 2.23 EUR 27.08.2016 Place: 123456789XY

I need to process the file so I will have the values on the right side, 31.8.2016, -1.23, EUR, 1234.56, etc., stored in a MySQL database.

I only achieved returning either 1 occurrence of the line which contains a particular string or all the lines using find or find_all, but this is not sufficient as I somehow need to identify the block starting with "processing date:" and ending with "additional info:" and process the values there, then process next block, and next, until the end of the file.

Any hints how to achieve this?

that could be done in several ways,but the most simple way is to read whole file as string, call `.split(/^processing date/)` on that and you will get a list of segments starting by the date and ending with empty newlines that are present before next item. It is simple, but may fail if your files are *large* like, gigabytes. — quetzalcoatl, Sep 19 '16 at 14:53
Your question is premature. You need to try, then when you run into a problem write a detailed question about that specific problem. Please read "[ask]" and the linked pages, and "[mcve]". Also "[How much research effort is expected of Stack Overflow users?](http://meta.stackoverflow.com/a/261593/128421)" will help you understand what we expect. — the Tin Man, Sep 19 '16 at 23:33

the Tin Man · Accepted Answer · 2016-09-19T23:54:59.680

I'd start with this:

File.foreach('data.txt', "\n\n") do |li|
  next unless li[/^processing/]
  puts "'#{li.strip}'"
end

If "data.txt" contains your content, foreach will read the file and return paragraphs, not lines, of text in li. Once you have those you can manipulate them as you need. This is very fast and efficient and doesn't have the scalability problems readlines or any read-based I/O could have.

This is the output:

'processing date:         31.8.2016
amount:                  -1.23
currency:                EUR
balance:                 1234.56
payer reference:         /VS123456/SS0011223344/KS1212
type of the transaction: Some type of the transaction 1
additional info:         Amount: 1.23 EUR 29.08.2016 Place: 123456789XY'
'processing date:         30.8.2016
amount:                  -2.23
currency:                EUR
balance:                 12345.56
payer reference:         /VS123456/SS0011223344/KS1212
type of the transaction: Some type of the transaction 2
additional info:         Amount: 2.23 EUR 28.08.2016 Place: 123456789XY'
'processing date:         29.8.2016
amount:                  -3.23
currency:                EUR
balance:                 123456.56
payer reference:         /VS123456/SS0011223344/KS1212
type of the transaction: Some type of the transaction 2
additional info:         Amount: 2.23 EUR 27.08.2016 Place: 123456789XY'

You can see by the wrapping ' that the file is being read in chunks or paragraphs delineated by "\n\n" then each chunk is stripped to remove trailing blanks.

See the foreach documentation for more information.

split(':', 2) is your friend:

'processing date:         31.8.2016'.split(':', 2) # => ["processing date", "         31.8.2016"]
'amount:                  -1.23'.split(':', 2) # => ["amount", "                  -1.23"]
'currency:                EUR'.split(':', 2) # => ["currency", "                EUR"]
'balance:                 1234.56'.split(':', 2) # => ["balance", "                 1234.56"]
'payer reference:         /VS123456/SS0011223344/KS1212'.split(':', 2) # => ["payer reference", "         /VS123456/SS0011223344/KS1212"]
'type of the transaction: Some type of the transaction 1'.split(':', 2) # => ["type of the transaction", " Some type of the transaction 1"]
'additional info:         Amount: 1.23 EUR 29.08.2016 Place: 123456789XY'.split(':', 2) # => ["additional info", "         Amount: 1.23 EUR 29.08.2016 Place: 123456789XY"]

From that you can do:

text = 'processing date:         31.8.2016
amount:                  -1.23
currency:                EUR
balance:                 1234.56
payer reference:         /VS123456/SS0011223344/KS1212
type of the transaction: Some type of the transaction 1
additional info:         Amount: 1.23 EUR 29.08.2016 Place: 123456789XY'

text.lines.map{ |li| li.split(':', 2).map(&:strip) }.to_h
# => {"processing date"=>"31.8.2016", "amount"=>"-1.23", "currency"=>"EUR", "balance"=>"1234.56", "payer reference"=>"/VS123456/SS0011223344/KS1212", "type of the transaction"=>"Some type of the transaction 1", "additional info"=>"Amount: 1.23 EUR 29.08.2016 Place: 123456789XY"}

There are a number of ways to continue parsing the information into more usable data but that's for you to figure out.

How to find an expression in a text file and process all lines until the next occurrence of the expression and repeat until end of the file

1 Answers1