How can I quickly find all the files in a directory that are missing a first row?

Question

I have a folder of files that are in .csv format. They have blank lines in them that are necessary (this indicates an absence of a measure from a LiDAR unit, which is good and needs to stay in). But occasionally, the first row is empty this throws off the code and the package and everything aborts.

Right now I have to open each .csv and see if the first line is empty.

I would like to do one of the following, but am at a loss how to:

1) write a code that quickly scans through all of the files in the directory and tells me which ones are missing the first line

2) be able to skip the empty lines that are only at the beginning--which can vary, sometimes more than one line is empty

3) have a code that cycles through all of the .csv files and inserts a dummy first line of numbers so the files all import no problem.

Thanks!

What have you tried so far? Please read the info about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and how to give a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610). This will make it much easier for others to help you. — Jaap, Sep 27 '17 at 16:39
Rather than read each file with `read.csv` you could use `readLines` to read in, say, 10 lines, count how many blank lines are at the start, and then use `read.csv` telling it to skip the appropriate number of lines. — Andrew Gustar, Sep 27 '17 at 16:50
Just saw @AndrewGustar 's comment; that's what the code below does. — Ben B-L, Sep 27 '17 at 17:06
strip.white eliminates all of the blank lines later in the file, which I need. Since the question refers to importing multiple .csv files, I was unsure of how to actually put in an example as it is more of a theoretical question. Many apologies there. — Jeff, Sep 27 '17 at 17:16

score 2 · Accepted Answer · answered Sep 27 '17 at 17:01

Here's a bit of code that does 1 and 2 above. I'm not sure why you'd want to insert dummy line(s) given the ability to do 1 and 2; it's straightforward to do, but usually it's not a good idea to modify raw data files.

# Create some test files
cat("x,y", "1,2", sep="\n", file = "blank0.csv")
cat("", "x,y", "1,2", sep="\n", file = "blank1.csv")
cat("", "", "x,y", "1,2", sep="\n", file = "blank2.csv")


files <- list.files(pattern = "*.csv", full.names = TRUE)

for(i in seq_along(files)) {
  filedata <- readLines(files[i])
  lines_to_skip <- min(which(filedata != "")) - 1
  cat(i, files[i], lines_to_skip, "\n")
  x <- read.csv(files[i], skip = lines_to_skip)
}

This prints

1 ./blank0.csv 0 
2 ./blank1.csv 1 
3 ./blank2.csv 2

and reads in each dataset correctly.

Brilliant. Completely error checks everything in the console. At some point I am going to sort it out to put the output in a data frame or something that only reads out those with lines to skip (i.e. the ones missing rows). Thanks! — Jeff, Sep 27 '17 at 17:25

score 1 · Answer 2 · answered Sep 27 '17 at 17:08

I believe that the two functions that follow can do what you want/need.
First, a function to determine the files with a second line blank.

second_blank <- function(path = ".", pattern = "\\.csv"){
    fls <- list.files(path = path, pattern = pattern)
    second <- sapply(fls, function(f) readLines(f, n = 2)[2])
    which(nchar(gsub(",", "", second)) == 0)
}

Then, a function to read in the files with such lines, one at a time. Note that I assume that the first line is the columns header and that at least the second line is left blank. There is a dots argument, ..., for you to pass other arguments to read.table, such as stringsAsFactors = FALSE.

skip_blank <- function(file, ...){
    header <- readLines(file, n = 1)
    header <- strsplit(header, ",")[[1]]
    count <- 1L
    while(TRUE){
        txt <- scan(file, what = "character", skip = count, nlines = 1)
        if(nchar(gsub(",", "", txt)) > 0) break
        count <- count + 1L
    }
    dat <- read.table(file, skip = count, header = TRUE, sep = ",", dec = ".", fill = TRUE, ...)
    names(dat) <- header
    dat
}

Now, an example usage.

second_blank(pattern = "csv")  # a first run as an example usage
inx <- second_blank()          # this will be needed later

fl_names <- list.files(pattern = "\\.csv")  # get all the CSV files

df_list <- lapply(fl_names[inx], skip_blank)  # read the problem ones
names(df_list) <- fl_names[inx]               # tidy up the result list
df_list

How can I quickly find all the files in a directory that are missing a first row?

2 Answers2