Reading a messy csv using readLines until a certain row / cell value

Question

I'm dealing with a messy csv file that I'm trying to load. readLines seems to do the job if I hardcode the row number:

readLines(file_path, n = 31)

What I need, it to make n (or skip) argument variable to make my function more robust.

I need n to be:

a cell with a particular string, e.g. Data,
an empty row

Not at the same time. I will use separate calls.

What would be the potential options to achieve this? I can think of which, is.na, or grep but I don't know how to use them in this particular case.

I know how to clean the file after reading it all but I want to avoid this step (if possible by reading only a portion of the file).

Can you think of a solution?

My data is the output of an ETG-4000 fNIRS.

Here's the entire file:

messy_data <- c("Header,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", "File Version,1.08,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"Patient Information,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"ID,someID,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", "Name,someName,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"Comment,someComment,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"Age,23,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", "Sex,Male,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"Analyze Information,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"AnalyzeMode,Continuous,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"Pre Time[s],20,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"Post Time[s],20,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"Recovery Time[s],40,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"Base Time[s],20,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"Fitting Degree,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"HPF[Hz],No Filter,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"LPF[Hz],No Filter,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"Moving Average[s],5,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"Measure Information,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"Date,17/12/2016 12:15,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"Mode,3x3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", "Wave[nm],695,830,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"Wave Length,CH1(699.2),CH1(828.2),CH2(697.2),CH2(826.7),CH3(699.2),CH3(828.2),CH4(697.5),CH4(827.8),CH5(697.2),CH5(826.7),CH6(697.5),CH6(827.8),CH7(697.5),CH7(827.8),CH8(698.8),CH8(828.7),CH9(697.5),CH9(827.8),CH10(698.7),CH10(830.2),CH11(698.8),CH11(828.7),CH12(698.7),CH12(830.2),CH13(698.3),CH13(825.7),CH14(697.5),CH14(826.6),CH15(698.3),CH15(825.7),CH16(699.1),CH16(825.9),CH17(697.5),CH17(826.6),CH18(699.1),CH18(825.9),CH19(699.1),CH19(825.9),CH20(698.7),CH20(825.2),CH21(699.1),CH21(825.9),CH22(697.7),CH22(825.7),CH23(698.7),CH23(825.2),CH24(697.7),CH24(825.7)", 
"Analog Gain,6.980392,6.980392,6.980392,6.980392,24.235294,24.235294,6.980392,6.980392,18.745098,18.745098,24.235294,24.235294,18.745098,18.745098,24.235294,24.235294,531.764706,531.764706,18.745098,18.745098,531.764706,531.764706,531.764706,531.764706,42.823529,42.823529,42.823529,42.823529,34.352941,34.352941,42.823529,42.823529,8.54902,8.54902,34.352941,34.352941,8.54902,8.54902,34.352941,34.352941,6.039216,6.039216,8.54902,8.54902,6.039216,6.039216,6.039216,6.039216", 
"Digital Gain,7.67,4.19,7,4.41,7.48,3.02,9.94,5.87,5.05,2.62,8.09,3.83,9.9,5.47,55.48,19.09,9.47,3.27,46.93,19.65,18.88,5.08,41.32,10.19,1.54,0.57,0.39,0.16,1.46,0.37,0.11,0.06,1.2,0.52,0.24,0.08,0.26,0.18,0.27,0.07,0.11,0.06,0.08,0.07,1.17,0.44,0.27,0.21", 
"Sampling Period[s],0.1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"StimType,STIM,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"Stim Time[s],,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"A,45,B,100,C,15,D,15,E,15,F,15,G,15,H,15,I,15,J,15,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"Repeat Count,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"Exception Ch,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,,,,,,,", 
",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", ",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", ",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", ",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", ",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"Data,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", "Probe1(Total),CH1,CH2,CH3,CH4,CH5,CH6,CH7,CH8,CH9,CH10,CH11,CH12,CH13,CH14,CH15,CH16,CH17,CH18,CH19,CH20,CH21,CH22,CH23,CH24,Mark,Time,BodyMovement,RemovalMark,PreScan,,,,,,,,,,,,,,,,,,,"
)

It is always good to include a [reproducible example](http://stackoverflow.com/questions/5963269) . For now here an example of a similar problem: http://stackoverflow.com/questions/37663246/extract-data-between-a-pattern-from-a-text-file-in-r/37665045#37665045 — Jaap, Jan 26 '17 at 11:24
I though that the question is generic enough so that it doesn't need the example csv. I'll attach an output of a `dput()` in a moment. The question that you attached is not really answering my question. It only uses `readLines` to load the whole file, and later filters it. — epo3, Jan 26 '17 at 11:28

rosscova · Accepted Answer · 2017-01-26T22:43:10.307

I think this is most likely a bad idea, in that it's more likely to slow down the process, rather than speed it up. I can see though, that if you've got a very large file, a large portion of which can be avoided by doing this, there could be a benefit.

library( readr )
line <- 0L
input <- "start"
while( !grepl( "Data", input ) & input != "" ) {
    line <- line + 1L
    input <- read_lines( file, skip = line - 1L, n_max = 1L )
}
line

We read one line at a time. For each line, we check for the text "Data" or a blank line. If either condition is fulfilled, we stop reading, which leaves us with line, a value telling us the first line not to be read in. This way you can then call something like:

df <- read_lines( file, n_max = line - 1L )

UPDATE: adding an option to test and read concurrently, as per @konvas's suggestion.

read_with_condition <- function( file, lines.guess = 100L ) {
    line <- 1L
    output <- vector( mode = "character", length = lines.guess )
    input <- "start"
    while( !grepl( "Data", input ) & input != "" ) {
        input <- readr::read_lines( file, skip = line - 1L, n_max = 1L )
        output[line] <- input
        line <- line + 1L
    }
    # discard any unwanted space in the output vector
    # this will also discard the last line to be read in (which failed the test)
    output <- output[ seq_len( line - 2L ) ]
    cat( paste0( "Stopped reading at line ", line - 1L, ".\n" ) )
    return( output )
}

new <- read_with_condition( file, lines.guess = 100L )

So here we are testing the input condition, and writing the input line to an object at the same time. You can preallocate space in the output vector with lines.guess (a good guess should speed up the processing, be generous rather than conservative here), and any excess will be cleaned up at the end. Note this is a function, so the last line new <- ... is showing how to call the function.

The first part finding `line` is perfect. However, using `read_csv()` results in a df that only has two columns. This is correct for the first 22 rows but the later part of my file has more columns (it's messy, I warned you ;) ). I will use `line` with `read_lines` instead of `read_csv`. Thanks. — epo3, Jan 26 '17 at 14:41

konvas · Answer 2 · 2017-01-26T12:55:10.093

readr comes with a function read_lines_chunked which facilitates reading large files, but does not have an option to break out of the function when a condition is met.

I can see three possibilities to achieve your goal

1) Read the whole file, keep only the desired rows - I realise this is probably not an option for you, otherwise you wouldn't be posting the question :)

lines <- readr::read_lines(file_path)
lines <- lines[seq(1, grep("Data", lines)[1] - 1)]

2) Do a first pass reading the file to find n and then a second pass to read up to that value. One way to do this is @rosscova 's answer, another would be to use some external tool, like gnu grep, and a third way would be to use read_lines_chunked from readr, like

n <- tryCatch(
    readr::read_lines_chunked(
        file = file_path, 
        callback = readr::DataFrameCallback$new(
            function(x, pos) {
                if (grepl("Data", x)) stop(pos - 1)
            }
        ), 
        chunk_size = 1
    ), 
    error = function(e) as.numeric(e$message)
) 
lines <- readLines(file_path, n = n)

3) Go through the file only once, saving each line up until you meet the condition. To do this you can modify @rosscova 's script accordingly (to save "input" to a variable) or again use read_lines_chunked

lines <- character(1e6) # pre-allocate some space, depending on how 
                        # many lines you are expecting to get

# Define a callback function to read a line and save it; if it meets
# the condition, it breaks by throwing an error
cb <- function(x, pos) {
    if (grepl("Data", x)) {
        # condition met, save only lines up to the current one and break
        lines <<- lines[seq(pos - 1)]
        stop(paste("Stopped reading on line", pos))
    }
    lines[[pos]] <<- x # condition not met yet, save the current line
}

# now call the above in read_lines_chunked
# need to wrap in tryCatch to handle the error 
tryCatch(
    readr::read_lines_chunked(
        file = file_path, 
        callback = readr::DataFrameCallback$new(cb), 
        chunk_size = 1, 
    ), 
    error = identity
)

In general this involves some bad practice, including use of <<- so use with care!

All of the above can be done with data.table::fread as well, which is supposed to be faster than readr.

Method 1 will definitely be the fastest for small files.

Would be great if you could benchmark some of these on your large files and let us know which is the fastest!

"you can modify @rosscova 's script accordingly", good idea on writing to the object alongside the input condition. I'll add a version to my answer. I'd rather avoid involving `<<-` and `tryCatch` though. — rosscova, Jan 26 '17 at 22:38

Reading a messy csv using readLines until a certain row / cell value

2 Answers2