Read multiple CSV into R based on where header begins and compile

Question

I'm trying to read in several CSVs with headers that begin on different rows and then map them into one data frame. I tried the code provided here, but I couldn't get the function to work.

Read CSV into R based on where header begins

Here are two example DFs:

file1 <- structure(list(X..Text = c("# Text", "#", "agency_cd", "5s", 
"USGS", "USGS"), X = c("", "", "site_no", "15s", "4294000", "4294000"
), X.1 = c("", "", "datetime", "20d", "6/24/13 0:00", "6/24/13 0:15"
), X.2 = c("", "", "tz_cd", "6s", "EDT", "EDT"), X.3 = c("", 
"", "Gage height", "14n", "1.63", "1.59"), X.4 = c("", "", " Discharge", 
"14n", "1310", "1250")), class = "data.frame", row.names = c(NA, 
-6L))

file2 <- structure(list(X..Text = c("# Text", "# Text", "#", "agency_cd", 
"5s", "USGS", "USGS"), X = c("", "", "", "site_no", "15s", "4294002", 
"4294002"), X.1 = c("", "", "", "datetime", "20d", "6/24/13 0:00", 
"6/24/13 0:15"), X.2 = c("", "", "", "tz_cd", "6s", "EDT", "EDT"
), X.3 = c("", "", "", "Gage height", "14n", "1.63", "1.59"), 
X.4 = c("", "", "", " Discharge", "14n", "1310", "1250")), class = 
"data.frame", row.names = c(NA, 
-7L))

I would like to use a similar solution to the related question I asked above, though I also need to skip the line after the header (header row = row that starts with "agency_cd"), and then do something similar to this to bind all the CSVs into one data frame with the file names in a column:

# Path to the data
data_path <- "Data/folder1/folder2"

# Bind all files together to form one data frame
discharge <-

  # Find all file names ending in CSV in all subfolders
  dir(data_path, pattern = "*.csv", recursive = TRUE) %>% 

  # Create a dataframe holding the file names
  data_frame(filename = .) %>% 

  # Read in all CSV files into a new data frame, 
  # Create a new column with the filenames
  mutate(file_contents = map(filename, ~ read_csv(file.path(data_path, .), col_types = cols(.default = "c")))
    ) %>% 

  # Unpack the list-columns to make a useful data frame
  unnest()

If using the example function provided in the related question above: A) I can't get the header_begins line to give me a vector, and B) I don't know how to then incorporate the function in the read_csv function above.

As a start I tried this using the solution to the related question:

# Function
detect_header_line <- function(file_names, column_name) {
    header_begins <- NULL
    for(i in 1:length(file_names)){
      lines_read <- readLines(file_names[i], warn=F)
      header_begins[i] <- grep(column_name, lines_read)
    }
   }

# Path to the data
data_path <- "Data/RACC_2012-2016/discharge"

# Get all CSV file names
file_names = dir(data_path, pattern = "*.csv", recursive = TRUE)

# Get beginning rows of each CSV file
header_begins <- detect_header_line(file.path(data_path, file_names), 'agency_cd')

But the header_begins vector was empty. And if I can fix that, I still need help getting that incorporated into my code above.

Any help is greatly appreciated!

G. Grothendieck · Answer 1 · 2019-01-16T18:51:11.353

1

Using file1 shown in the question convert that to lines of text in Lines1 and then read that using read.table as shown and similarly for file2.

Lines1 <- capture.output(write.table(file1, stdout(), row.names = FALSE, quote = FALSE))
ix <- grep("agency", Lines1) # line number of header
DF1 <- read.table(text = Lines1[-c(seq_len(ix-1), ix+1)], header = TRUE)

giving:

> DF1
  agency_cd site_no datetime tz_cd Gage height Discharge
1      USGS 4294000  6/24/13  0:00  EDT   1.63      1310
2      USGS 4294000  6/24/13  0:15  EDT   1.59      1250

Updated

fixed.

edited Jan 16 '19 at 18:51

answered Jan 16 '19 at 18:45

G. Grothendieck

254,981
17
203
341

Any idea how to loop this with a list of file names? – D Kincaid Jan 16 '19 at 19:56
`lapply(filenames, function(filename) { Lines <- readLines(filename); ix <- grep("agency", Lines); read.table(text = Lines1[-c(seq_len(ix-1), ix+1)], header = TRUE)} )` – G. Grothendieck Jan 16 '19 at 20:25
I get this error when I run your slightly modified code: lapply(file.path(data_path, file_names), function(filename) { + Lines <- readLines(filename); + ix <- grep("agency", Lines); + read.table(text = Lines[-c(seq_len(ix-1), ix+1)], header = TRUE) + } ) Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 1 did not have 2 elements Called from: scan(file = file, what = what, sep = sep, quote = quote, dec = dec, nmax = nrows, skip = 0, . . . – D Kincaid Jan 16 '19 at 22:05

Dave2e · Answer 2 · 2019-01-17T00:07:37.663

Here is a base R solution which puts the process of finding the header row and then reading the file into a loop to process directory of files.

    #define column names
    #columnnames<-c("agency_cd","site_no", "datetime", "tz_cd", "Gage height", "Discharge")

    #find files that match pattern
    fname<-dir( pattern = "file[0-9]\\.csv")

    #loop and read all files
    dfs<-lapply(fname, function(f) {
    #find header row
    headerline<-grep("agency_cd", readLines(f))
    #read data with header row and following row
    #by reading the header row bind will align the columns
    df<- read.csv(f, skip=headerline-1, stringsAsFactors = FALSE)
})

finalanswer<-do.call(rbind, dfs)

> finalanswer
#  agency_cd site_no     datetime tz_cd Gage.height Discharge
#        5s     15s          20d    6s         14n       14n
#      USGS 4294000 6/24/13 0:00   EDT        1.63      1310
#      USGS 4294000 6/24/13 0:15   EDT        1.59      1250
#        5s     15s          20d    6s         14n       14n
#      USGS 4294002 6/24/13 0:00   EDT        1.63      1310
#      USGS 4294002 6/24/13 0:15   EDT        1.59      1250

Now one needs to remove the rows without the USGS and then convert the columns from character to numeric.

Note the "\." in the dir function, the dot has a special meaning in regular expressions. The dot means any character. For the dot to mean just a period then escape it with the double \ in R.

Thanks! This could work. The issue is that there are a lot of files and I can't be sure the columns are always in the same order. For example, I noticed in some files the columns "Gage height" and "Discharge" are swapped, i.e., "Discharge" is first. Is there any way to ensure that when I compile the dfs that the appropriate columns match up? — D Kincaid, Jan 16 '19 at 19:37
Using read.csv(f, skip=headerline-1, stringsAsFactors = FALSE) definitely got rid of some errors that were thrown during the do.call(rbind, dfs). But, as suspected, because "Gage Height" and "Discharge" columns are not in the same order all the time, Gage Height values sometimes end up in the Discharge column and vice versa. — D Kincaid, Jan 16 '19 at 21:37
@DKincaid, Sorry, needed to removed the line which renamed the columns. The above code was tested and should solve the problem. — Dave2e, Jan 17 '19 at 00:08

score 0 · Accepted Answer · answered Jan 17 '19 at 17:19

I found 2 solutions. This first one uses most @Dave2e's solution, but instead of using do.call(rbind, dfs) to bind all the dfs into one, I used dplyr::bind_rows(). The do.call(rbind, dfs) didn't work because my header columns ended up having slightly different names sometimes, whic resulted in this error: Error in match.names(clabs, names(xi)) : names do not match previous names. dplyr::bind_rows() was more flexible with different column names. I also use readr::read_csv in place of read.csv for personal preferences.

# First solution using most of @Dave2e's solution
library(tidyverse)

# Path to the data
data_path <- "Data/RACC_2012-2016/discharge"
# Get all CSV file names
file_names = dir(data_path, pattern = "*.csv", recursive = TRUE)

# Loop and read all files
dfs <- lapply(file.path(data_path, file_names), function(f) {
  # Find header row
  headerline <- grep("agency_cd", readLines(f))
  # Read data with header row and following row
  # by reading the header row bind will align the columns
  df <- read_csv(f, col_types = cols(.default = "c"), skip = headerline-1)
}) %>% 
  # Bind all into one data frame
  bind_rows() %>% 
  # Filters the row below the header row that doesn't contain data
  dplyr::filter(agency_cd != "5s") %>% 
  # Combine "Gage Height" and "Gage height" columns into one
  # First rename the columns to make them easier to call
  rename(Gage_height = "Gage Height", Gage_height2 = "Gage height") %>% 
  mutate(Gage_height = ifelse(is.na(Gage_height), Gage_height2, Gage_height)) %>% select(-Gage_height2)

The second solution accomplishes the same thing as solution 1, except it also allowed me to add the original file names as a column in the final data frame. Instead of lapply as above I use purrr::map. And I also use the package fs to deal with file paths.

# Second solution

library(tidverse)
library(fs)

# Path to the data
data_path <- "Data/RACC_2012-2016/discharge"

# Bind all files together to form one data frame
discharge <-
  # Find all file names ending in CSV in all subfolders
  fs::dir_ls(data_path, regexp = "*.csv", recursive = TRUE) %>% 
  # Create a dataframe holding the file names
  data_frame(filename = .) %>% 
  # Read in all CSV files into a new data frame, 
         # Create a new column with the filenames
  mutate(file_contents = map(filename, 
                             # Here we append path to the data before the file name & force all columns to be as character
                             # because the typecasting was causing problems
                             # We use skip = grep("agency_cd", readLines(.))-1)) to find header row
                             ~ read_csv(., col_types = cols(.default = "c"), skip = grep("agency_cd", readLines(.))-1))
        ) %>% 
  # Unpack the list-columns to make a useful data frame
  unnest() %>% 
  # Filters the row below the header row that doesn't contain data
  dplyr::filter(agency_cd != "5s") %>% 
  # Combine "Gage Height" and "Gage height" columns into one
  # First rename the columns to make them easier to call
  rename(Gage_height = "Gage Height", Gage_height2 = "Gage height") %>% 
  mutate(Gage_height = ifelse(is.na(Gage_height), Gage_height2, Gage_height)) %>% select(-Gage_height2)

Thanks to everyone for their help! I also got help from: https://serialmentor.com/blog/2016/6/13/reading-and-combining-many-tidy-data-files-in-R and How to import multiple .csv files at once?

Read multiple CSV into R based on where header begins and compile

3 Answers3

Updated