1

I am trying to merge multiple text files in one data file, but the first few lines and the last couple of lines don't have the same data-structure as the rest of my file. I want to combine multiple files with this type of structure. Doing so, I also would like to interpolate the timestamp given at each end of a file over the entire dataset.

First problems arise when trying to import the data, what I have tried so far:

file_list <- list.files()
for (file in file_list) {
  # if the merged dataset doesn't exist, create it
  if (!exists('dataset')) {
    dataset <- read.table(file, sep = ';', skip = 6, nrow = length(readLines(file)) - 4 -6)
  }
  # if the merged dataset does exist, append to it
  if (exists('dataset')) {
    temp_dataset <- read.table(file, sep = ';', skip = 6, nrow = length(readLines(file)) - 4 - 6)
    dataset <- rbind(dataset, temp_dataset)
    rm(temp_dataset)
  }
}

But then I got the following error message:

Error in rbind(deparse.level, ...) : 
  numbers of columns of arguments do not match

Does anyone know how to do this?

shaalboom
  • 11
  • 2
  • Is the number of lines identical in all files? – Roland May 25 '18 at 11:07
  • You're reading only one the first file here, as soon as the dataset is created remaining files are skipped. I see no reason to give that back to a variable in anyway and don't remembre list.files() accepting a block to execute so I doubt this is really what is causing the error at all. – Tensibai May 25 '18 at 11:49
  • (well it may works on a random because of variables defined elsewhere and which you didn't show here) – Tensibai May 25 '18 at 11:50
  • @Roland, the number of lines differs from file to file – shaalboom May 25 '18 at 11:51
  • the error means your columns do not have same name in all file. Either they need to be rename (if all columns should have same name) or you need to do a merge not rbind (if some datasets have additional columns). In all cases, growing the final dataset like you're doing is nto good, you should import everything in a list (`l_files <- lapply(list.files, read.table, ....)`) and then merge everything afterwards. – Cath May 25 '18 at 12:35
  • 1
    see https://stackoverflow.com/q/2209258/4137985 for example (there are many other SO questions on the subject) – Cath May 25 '18 at 12:39

3 Answers3

1

A tidyverse option:

Steps

  • Read files by skipping the first row and bind datasets. I find
    purrr::map(read_csv) and purrr::reduce(bind_rows) is better than for loop.

  • Add a logical column called is_metadata.

  • Fill the NA values with in is_metadata column.

  • Filter the rows

One solution with tidyverse:

library(tidyverse)
library(stringr)
file_list <- list.files()
file_list %>% 
                map(read_csv, skip = 1) %>% 
                reduce(bind_rows) %>% 
                rename(X1 = `mmho/cm`) %>% # rename the column for simplicity
                mutate(is_metadata = if_else(
                                condition = str_detect(X1, "Profile|turned"),
                                true = X1,
                                false = NA_character_)) %>% # set the same column class of X1
                tidyr::fill(is_metadata, .direction = "up") %>% 
                filter(!is.na(Celcius)) %>% 
                # additional processing for timestamp
                # ...
                # ...
Imran Kocabiyik
  • 419
  • 5
  • 15
0
    dataset <- read.table(
          file, 
          sep = ',', 
          skip = 6, 
          nrow = length(readLines(file)) - 4 - 6
          )
0

Below the complete code on how we managed to merge all the datafiles and get the timestamps on there:

library(data.table)

file_list <- list.files()                       #list all the files in the directory
datalist <- vector('list', length(file_list))   #create datalist
timelist <- vector('list', length(file_list))   #create timelist

#create for-loop
#read data
#read time
#separate time
#define start- and endtime
#define time difference
#define time interval
#create vector with timestamps
#add timestamps to existing datafile

for (i in 1:length(file_list)) {
  datalist[[i]] <- read.table(file_list[i], sep = ',', skip = 6, nrow = length(readLines(file_list[i]))-4-6, stringsAsFactors = FALSE)
  timelist[[i]] <- read.table(file_list[i], sep = ',', skip = length(readLines(file_list[i]))-2, stringsAsFactors = FALSE)
  timelist[[i]][1,1] <- gsub('  CTD turned on at ', '', timelist[[i]][1,1])
  timelist[[i]][2,1] <- gsub('  CTD turned off at ', '', timelist[[i]][2,1])
  starttime <- strptime(timelist[[i]][1,1], format = '%m/%d/%Y %H:%M:%S')
  endtime <- strptime(timelist[[i]][2,1], format = '%m/%d/%Y %H:%M:%S')
  diff <- difftime(endtime, starttime, unit = 's')
  int <- diff/(nrow(datalist[[i]]) - 1)
  extratime <- seq(starttime, endtime, by = int)
  datalist[[i]] <- cbind(datalist[[i]], extratime)
  print(i)
}

dt1 <- rbindlist(datalist)
dt2 <- as.data.frame(dt1)
colnames(dt2) <- c('mmho/cm', 'celcius', 'dbars', 'hz', 'datetime')

write.table(dt2, 'M1traveller.csv', quote = FALSE, sep = ',', row.names = FALSE)
shaalboom
  • 11
  • 2