4

I have some text file with multiple lines block like

2011/01/01 13:13:13,<AB>, Some Certain Text,=,
[    
certain text
         [
                  0: 0 0 0 0 0 0 0 0 
                  8: 0 0 0 0 0 0 0 0 
                 16: 0 0 0 9 343 3938 9433 8756 
                 24: 6270 4472 3182 2503 1768 1140 836 496 
                 32: 326 273 349 269 144 121 94 82 
                 40: 64 80 66 59 56 47 50 46 
                 48: 64 35 42 53 42 40 41 34 
                 56: 35 41 39 39 47 30 30 39 
                 Total count: 12345
        ]
    certain text
]
some text
2011/01/01 14:14:14,<AB>, Some Certain Text,=,
[
 certain text
   [
              0: 0 0 0 0 0 0 0 0 
              8: 0 0 0 0 0 0 0 0 
             16: 0 0 0 4 212 3079 8890 8941 
             24: 6177 4359 3625 2420 1639 974 594 438 
             32: 323 286 318 296 206 132 96 85 
             40: 65 73 62 53 47 55 49 52 
             48: 29 44 44 41 43 36 50 36 
             56: 40 30 29 40 35 30 25 31 
             64: 47 31 25 29 24 30 35 31 
             72: 28 31 17 37 35 30 20 33 
             80: 28 20 37 25 21 23 25 36 
             88: 27 35 22 23 15 24 34 28
             Total count: 123456 
    ]
    certain text
some text
]

Those variant-length blocks exist between text. I want to read out all numbers after : and keep them in individual arrays. In this case, there will be two arrays:

array1 = { 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 343 3938 9433 8756 6270 4472 3182 2503 1768 1140 836 496 326 273 349 269 144 121 94 82 64 80 66 59 56 47 50 46 64 35 42 53 42 40 41 34 35 41 39 39 47 30 30 39 12345 }

array2 = { 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 212 3079 8890 8941 6177 4359 3625 2420 1639 974 594 438 323 286 318 296 206 132 96 85 65 73 62 53 47 55 49 52 29 44 44 41 43 36 50 36 40 30 29 40 35 30 25 31 47 31 25 29 24 30 35 31 28 31 17 37 35 30 20 33 28 20 37 25 21 23 25 36 27 35 22 23 15 24 34 28 123456 }

I found lpeg may be a light-weighted way to achieve it. But I'm totally new to PEGs and LPeg. Please help!

Community
  • 1
  • 1
Decula
  • 494
  • 4
  • 16

5 Answers5

5

LPEG version:

local lpeg            = require "lpeg"
local lpegmatch       = lpeg.match
local C, Ct, P, R, S  = lpeg.C, lpeg.Ct, lpeg.P, lpeg.R, lpeg.S
local Cg              = lpeg.Cg

local data_to_arrays

do
  local colon    = P":"
  local lbrak    = P"["
  local rbrak    = P"]"
  local digits   = R"09"^1
  local eol      = P"\n\r" + P"\r\n" + P"\n" + P"\r"
  local ws       = S" \t\v"
  local optws    = ws^0
  local getnum   = C(digits) / tonumber * optws
  local start    = lbrak * optws * eol
  local stop     = optws * rbrak
  local line     = optws * digits * colon * optws
                 * getnum * getnum * getnum * getnum
                 * getnum * getnum * getnum * getnum
                 * eol
  local count    = optws * P"Total count:" * optws * getnum * eol
  local inner    = Ct(line^1 * count^-1)
--local inner    = Ct(line^1 * Cg(count, "count")^-1)
  local array    = start * inner * stop
  local extract  = Ct((array + 1)^0)

  data_to_arrays = function (data)
    return lpegmatch (extract, data)
  end
end

This actually works only if there are exactly eight integers on each line of the data block. Depending on how well formed your input is this may be a curse or a blessing ;-)

And a test file:

data = [[
some text
[    
some text
         [
                  0: 0 0 0 0 0 0 0 0 
                  8: 0 0 0 0 0 0 0 0 
                 16: 0 0 0 9 343 3938 9433 8756 
                 24: 6270 4472 3182 2503 1768 1140 836 496 
                 32: 326 273 349 269 144 121 94 82 
                 40: 64 80 66 59 56 47 50 46 
                 48: 64 35 42 53 42 40 41 34 
                 56: 35 41 39 39 47 30 30 39 
                 Total count: 12345
        ]
    some text
]
some text
[
 some text
   [
              0: 0 0 0 0 0 0 0 0 
              8: 0 0 0 0 0 0 0 0 
             16: 0 0 0 4 212 3079 8890 8941 
             24: 6177 4359 3625 2420 1639 974 594 438 
             32: 323 286 318 296 206 132 96 85 
             40: 65 73 62 53 47 55 49 52 
             48: 29 44 44 41 43 36 50 36 
             56: 40 30 29 40 35 30 25 31 
             64: 47 31 25 29 24 30 35 31 
             72: 28 31 17 37 35 30 20 33 
             80: 28 20 37 25 21 23 25 36 
             88: 27 35 22 23 15 24 34 28 
    ]
    some text
some text
]
]]

local arrays = data_to_arrays (data)

for n = 1, #arrays do
  local ar   = arrays[n]
  local size = #ar
  io.write (string.format ("[%d] = { --[[size: %d items]]\n  ", n, size))
  for i = 1, size do
    io.write (string.format ("%d,%s", ar[i], (i % 5 == 0) and "\n  " or " "))
  end
  if ar.count ~= nil then
    io.write (string.format ("\n  [\"count\"] = %d,", ar.count))
  end
  io.write (string.format ("\n}\n"))
end
Philipp Gesang
  • 496
  • 1
  • 6
  • 16
  • Hi @phg , yes, the input array is exact 8 integers per line. But this text file is more than 100 MB. How could I read the file in? I tried local assert(io.open(filepath)). It fails to read file as string.. Shall I read whole file as string? – Decula Oct 16 '13 at 20:56
  • ``f = io.open(filename, "r") if f then data = f:read"*all" f:close() end`` will read everything into memory. If that doesn’t work you may have to process the file in chunks. – Philipp Gesang Oct 16 '13 at 21:12
  • Hi @phg yes, this read file content well. But in the text file, real scenario is `[some text[data array]some text]`. You lpeg not work then. I failed to modify your lpeg to match this condition. Could you help me on that? – Decula Oct 16 '13 at 21:41
  • @Decula Your updated example parses fine here. Can you post the part that doesn’t work? – Philipp Gesang Oct 16 '13 at 21:52
  • I just realize my data array has two different types. One has total count there..... I added ` local lower = R"az"^1 local upper = R"AZ"^1 local words = lower+upper` and change `local line = optws * digits * colon * optws * getnum * getnum * getnum * getnum * getnum * getnum * getnum * getnum * eol * letter`.. It still doesn't work. I want just add total count to the end of the array – Decula Oct 16 '13 at 22:31
  • @Decula Updated (also fixed a duplication). This will match the exact phrase “Total count:” and then extract the integer and append it. Though I suggest you use the commented alternative definition of the rule *count*. This puts the count not at the end of the array -- where it would be indistinguishable from all the other values -- but in a separate field *count* on the hash part of the table. – Philipp Gesang Oct 16 '13 at 23:08
  • It works great. And I will take commented definition. Where could I get more comprehensive tutorial for LPeg? The doc in the website is limited. – Decula Oct 17 '13 at 01:12
  • @Decula I don’t know about any tutorials. What I can recommend is reading LPEG code written by others to “see how it’s done”. After some practice you’ll see that the reference documentation is acutally quite good -- as a reference, that is. – Philipp Gesang Oct 17 '13 at 11:47
  • Sorry, I update sample log 1 again.. I want to read timestamp of the data array.. So I can keep data array in database with three columns: timestamp, data array and total count. I add an defination `local timestamp = P"(%d+)/(%d+)/(%d+) (%d+):(%d+):(%d+)"` and redefine start. But my attempt failed. Please help~~~ – Decula Oct 17 '13 at 18:30
3

My pure Lua string library solution would be something like this:

local bracket_pattern = "%b[]" --pattern for getting into brackets
local number_pattern = "(%d+)%s+" --pattern for parsing numbers
local output_array = {} --output 2-dimensional array
local i = 1
local j = 1
local tmp_number
local tmp_sub_str

for tmp_sub_str in file_content:gmatch(bracket_pattern) do --iterating through [string]
    table.insert(output_array, i, {}) --adding new [string] group
    for tmp_number in tmp_sub_str:gmatch(number_pattern) do --iterating through numberWHITESPACE
        table.insert(output_array[i], tonumber(tmp_number)) --adding [string] group element (number)
    end
    i = i + 1
end

EDIT: This does work properly with an uptaded file format either.

Kamiccolo
  • 7,758
  • 3
  • 34
  • 47
3

Try this code, which does no use LPEG:

-- assume T contains the text
local a={}
local i=0
for b in T:gmatch("%b[]") do
        b=b:gsub("%d+:","")
        i=i+1
        local t={}
        local j=0
        for n in b:gmatch("%d+") do
                j=j+1; t[j]=tonumber(n)
        end
        a[i]=t
end
lhf
  • 70,581
  • 9
  • 108
  • 149
  • Hi @lhf , Actually, the text file is [some text[data array]some text]; %b[] will capture outside []. How to capture inside data array? – Decula Oct 16 '13 at 20:53
  • 2
    %b[] works great... But I really want to learn lpeg for other cases~~` – Decula Oct 16 '13 at 21:46
2

phg already provided a nice LPeg solution for your question but here's another one using LPeg's re module. The syntax is closer to BNF and the operators used are more 'regex' like so this solution may be easier to grok.

re = require 're'

function dump(t)
  io.write '{'
  for _, v in ipairs(t) do
    io.write(v, ',')
  end
  io.write '}\n'
end

local textformat = [[
  data_in   <-  block+
  block     <-  text '[' block_content ']'
  block_content <- {| data_arr |} / (block / text)*
  data_arr  <- (text ':' nums whitesp)+
  text      <- whitesp [%w' ']+ whitesp
  nums      <- (' '+ {digits} -> tonumber)+
  digits    <- %d+
  whitesp   <- %s*
]]
local parser = re.compile(textformat, {tonumber = tonumber})
local arr1, arr2 = parser:match(data)

dump(arr1)
dump(arr2)

Each block of data array gets captured into a separate table and returned as one of the outputs by match.

With data being the same input as above, two blocks are matched and captured and so 2 tables are returned. Inspecting these two tables gives:

{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,343,3938,9433,8756,6270,4472,3182,2503, 1768,1140,836,496,326,273,349,269,144,121,94,82,64,80,66,59,56,47,50,46,64,35,42 ,53,42,40,41,34,35,41,39,39,47,30,30,39,12345,} {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,212,3079,8890,8941,6177,4359,3625,2420, 1639,974,594,438,323,286,318,296,206,132,96,85,65,73,62,53,47,55,49,52,29,44,44, 41,43,36,50,36,40,30,29,40,35,30,25,31,47,31,25,29,24,30,35,31,28,31,17,37,35,30 ,20,33,28,20,37,25,21,23,25,36,27,35,22,23,15,24,34,28,}

greatwolf
  • 20,287
  • 13
  • 71
  • 105
  • I found [this](http://fanf.livejournal.com/97572.html) deal with timestamp with BNF style LPeg... But I failed to implement it. – Decula Oct 17 '13 at 18:46
  • @Decula Note the above grammar is only a rough approximation of how to parse the input, based on eyeballing the input from your original question. Since you know better the extent of the format being parsed, you should refine the grammar to better match it. – greatwolf Oct 17 '13 at 20:44
  • Hi @greatwolf , I'm still struggling with BNF syntax... Actually we have tons of strange logs, images to process.. And we already have a Googler as contractor in Milpitas. We really need a C,C++&Lua expert like you. I found you answered most of my questions. If you interested, please drop me a [email](http://zhusiyao@gmail.com) – Decula Oct 17 '13 at 20:58
1

I know this is a late reply but defining much less grammar the following pattern finds opening [ and captures every number that is not suffixed by : until a closing ] is reached. Then repeats the whole block until nothing is matched.

local patt = re.compile([=[
    data    <- {| block |}+
    block   <- ('[' ((%d+ ':') / { %d+ } -> int / [^]%d]+)+ ']') / ([^[]+ block)
]=], { int = tonumber })

You can capture all recovered arrays at once in a table with something like this

local a = { patt:match[=[ ... ]=] }
wqw
  • 11,771
  • 1
  • 33
  • 41