R: Reading specific columns from txt files with slightly different column headers (differing spaces) and binding them?

Question

I have many txt files that contain the same type of numerical data in columns separated by ;. But some files have column headers with spaces and some don't (created by different people). Some have extra columns which that I don't want.

e.g. one file might have a header like:

ASomeName; BSomeName; C(someName%)

whereas another file header might be

A Some Name; B Some Name; C(someName%); D some name

How can I clean the spaces out of the names before I call a "read" command?

#These are the files I have

filenames<-list.files(pattern = "*.txt",recursive = TRUE,full.names = TRUE)%>%as_tibble()

#These are the columns I would like:

colSelect=c("Date","Time","Timestamp" ,"PM2_5(ug/m3)","PM10(ug/m3)","PM01(ug/m3)","Temperature(C)",  "Humidity(%RH)", "CO2(ppm)")

#This is how I read them if they have the same columns

ldf <- vroom::vroom(filenames, col_select = colSelect,delim=";",id = "sensor" )%>%janitor::clean_names()

Clean Headers script

I've written a destructive script that will read in the entirety of the file, clean the header of spaces, delete the file and re-write (vroom complained sometimes of not being able to open X thousands of files) the file using the same name. Not an efficiency way of doing things.

cleanHeaders<-function(filename){
  d<-vroom::vroom(filename,delim=";")%>%janitor::clean_names()
  #print(head(d))
  if (file.exists(filename)) {
    #Delete file if it exists
    file.remove(filename)
  }
  vroom::vroom_write(d,filename,delim = ";")
}

lapply(filenames,cleanHeaders)

pheymanss · Accepted Answer · 2021-04-05T14:54:15.997

fread's select parameter admits integer indexes. If the desired columns are always in the same position, your job is done.

colIndexes = c(1,3,4,7,9,18,21)
data = lapply(filenames, fread, select = colIndexes)

I imagine vroom also has this capability, but since you are already selecting your desired columns, I don't think lazily evaluating your character columns would be helpful at all, so I advice you stick to data.table.

For a more robust solution though, since you have no control over the structure of the tables: you can read one row of each file, capture and clean the column names, and then match them against a clean version of your colSelect vector.

library(data.table)
library(janitor)
library(purrr)

filenames <- list.files(pattern = "*.txt",
                        recursive = TRUE,
                        full.names = TRUE)

# read the first row of data to capture and clean the column names
clean_col_names <- function(filename){
  colnames(janitor::clean_names(fread(filename, nrow = 1)))
}

clean_column_names <- map(.x = filenames, 
                          .f = clean_col_names)

# clean the colSelect vector
colSelect <- janitor::make_clean_names(c("Date",
                                         "Time",
                                         "Timestamp" ,
                                         "PM2_5(ug/m3)",
                                         "PM10(ug/m3)",
                                         "PM01(ug/m3)",
                                         "Temperature(C)",
                                         "Humidity(%RH)",
                                         "CO2(ppm)"))

# match each set of column names against the clean colSelect
select_indices <- map(.x = clean_column_names, 
                      .f = function(cols) match(colSelect, cols))

# use map2 to read only the matched indexes for each column
data <- purrr::map2(.x = filenames, 
                    .y = select_indices, 
                    ~fread(input = .x, select = .y))

(Here purrr can be easily replaced with traditional lapply's, I opted for purrr because of its cleaner formula notation)

Thank you very much for your time and effort. One of the benefits of vroom was that I could easily ask it to create a column with the name of the file the data came from. Is that shomething purrr does ? — HCAI, Apr 01 '21 at 21:08
You can directly replace the fread call with vroom, and add the corresponding parameters into the formula expression. I got too accustomed to using purrr + data.table, so I would do it by directly adding a data.table mutate statement right inside the formula notation: ` ~fread(input = .x, select = .y))[, filename = .x]` — pheymanss, Apr 02 '21 at 15:53
Thank you. I'm finding that it says that `[, filename = .x]` is an unused argument. When I replace the whole of fread for `~vroom(file = .x,col_select = .y, id="file")` it complains of unknown columns. Strange, no? — HCAI, Apr 04 '21 at 13:02
Oh, sorry about that, it was an error on my syntax. data.table's mutate operations are done with the walrus operator ":=" instead of the equal sign "=". So it should be [, filename := .x]. Regarding vroom, I am not familiar with it unfortunately. — pheymanss, Apr 05 '21 at 14:50
Thank you. It still doesn't work for me. So what I've done instead is make a data frame with nested lists for each file: `tibble(sensor=filenames) %>% mutate(sensor=stringr::str_match(sensor, "/\\s*(.*?)\\s*/")[,2],.keep="unused") %>% mutate(file_contents=purrr::map2(.x = filenames, .y = select_indices, ~fread(input = .x, select = .y)) ) %>% unnest(cols = file_contents)` — HCAI, Apr 06 '21 at 08:46

R: Reading specific columns from txt files with slightly different column headers (differing spaces) and binding them?

1 Answers1