0

I have multiple text files consisting of 18 or 20 columns. I would like to bind all files together, but in order to do so, I have to remove the first two columns in the files consisting of 20 columns (the two columns are date and time).

I cannot find out how to solve the issue (I am only a novice in R) "numbers of columns of arguments do not match". Thus, I would like to identify if the first two rows of the file are called date and time and then remove those columns. Here is the code I'm working on:

file_list <- list.files()

for (file in file_list){
    if (!exists("dataset")){
        dataset <- read.table(file, header=TRUE, sep="\t", stringsAsFactors=FALSE)
    if (colnames(dataset)[1] == "date" & colnames(dataset)[2] == "time"){
        dataset$date <- NULL
        dataset$time <- NULL
    }
}

if (exists("dataset")){
    temp_dataset <-read.table(file, header=TRUE, sep="\t", stringsAsFactors=FALSE)
    dataset<-rbind(dataset, temp_dataset)
    rm(temp_dataset)
    }

}

Thanks!

Juul
  • 7
  • 4
  • 1
    One alternative (if you like `dplyr`) is to use `dplyr::bind_rows`, as it matches columns by name and allows additional columns (which will be `NA`ed in the smaller tables). – r2evans Jul 07 '16 at 18:39
  • I think this should help you. http://stackoverflow.com/questions/5788117/only-read-limited-number-of-columns-in-r – user5249203 Jul 07 '16 at 18:40
  • Do you know before loading the text file if it will have one-too-many columns? If so, @user5249203's comment should address it. – r2evans Jul 07 '16 at 18:41
  • Did any of the answers resolve your problem? If so, it's customary on SO to accept an answer by checking the checkmark to the left of the preferred answer. – r2evans Jul 13 '16 at 14:26

3 Answers3

1

As @user5249203 commented, if you know based on the filename (or something else) before loading that a file has too many columns, then you can skip columns programmatically. If not, continue.

I'm going to assume that you are reading in your files using something like this:

fnames <- list.files(pattern = "*.csv", path = "some/dir")
# replace `read.csv` with whichever function you're using to read in the data
alldata <- sapply(fnames, read.csv, stringsAsFactors = FALSE, simplify = FALSE)

Lacking any files to read like that, I'll generate a fake alldata list:

set.seed(42)
fnames <- paste0("mtcars", 1:5)
alldata <- sapply(fnames, function(fn) {
  if (runif(1) < 0.7) mtcars[,-1] else mtcars
})
# should have 3 with 11 columns, 2 with 10 columns
sapply(alldata, ncol)
# mtcars1 mtcars2 mtcars3 mtcars4 mtcars5 
#      11      11      10      11      10 

No surprise, we can't rbind them using base R:

do.call("rbind", alldata)
# Error in rbind(deparse.level, ...) : 
#   numbers of columns of arguments do not match

dplyr

We can however use dplyr::bind_rows, though it will retain the unwanted column, causing the value of that column in the narrower tables to be NA:

library(dplyr)
str( bind_rows(alldata) )
# 'data.frame': 160 obs. of  11 variables:
#  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
#  $ disp: num  160 160 108 258 360 ...
#  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
#  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
#  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
#  $ qsec: num  16.5 17 18.6 19.4 17 ...
#  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
#  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
#  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
#  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

What you don't see in this str summarization is that some of the mpg variables are NA:

table(is.na(bind_rows(alldata)$mpg))
# FALSE  TRUE 
#    96    64 

(Remove it if desired.)

Base R

(Assuming you chose to not use dplyr). Start from here with your actual list of alldata:

numColumnsWanted <- 10    # you want this to be 18, I think
alldata2 <- lapply(alldata, function(dat) {
  # this grabs the *last* 'numColumnsWanted' columns
  if (ncol(dat) > numColumnsWanted) dat[, 1 + ncol(dat) - numColumnsWanted:1] else dat
})

Verify that the data.frames are all the same size. (You probably should also verify the column names:

sapply(alldata2, ncol)
# mtcars1 mtcars2 mtcars3 mtcars4 mtcars5 
#      10      10      10      10      10 

Now you should be able to rbind them safely:

str( do.call("rbind", alldata2) )
# 'data.frame': 160 obs. of  10 variables:
#  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
#  $ disp: num  160 160 108 258 360 ...
#  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
#  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
#  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
#  $ qsec: num  16.5 17 18.6 19.4 17 ...
#  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
#  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
#  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
#  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

($mpg is not present in this solution.)

r2evans
  • 141,215
  • 6
  • 77
  • 149
0

Consider using lapply() with an inverted grep() on the column names to remove date and time. Below works regardless where these two columns are positioned or if they exist in the smaller sets.

dfList <- lapply(file_list, function(f) {
                    df <- read.table(f, header=TRUE, sep="\t", stringsAsFactors=FALSE)
                    df <- df[grep("(date|time)", names(df), invert = TRUE)]
                 })

finaldf <- do.call(rbind, dfList)

Alternatively, the not matching regex pattern without using invert = TRUE:

dfList <- lapply(file_list, function(f) {
                    df <- read.table(f, header=TRUE, sep="\t", stringsAsFactors=FALSE)
                    df <- df[grep("[^(date|time)]", names(df))]
                 }) 

finaldf <- do.call(rbind, dfList)
Parfait
  • 104,375
  • 17
  • 94
  • 125
0

Thanks for the suggestions!

One solution that did work for me was replacing

dataset<-rbind(dataset, temp_dataset)

by

dataset<-rbind.fill(dataset, temp_dataset)

Missing data was replaced by NA's and I could easily delete the incomplete columns.

Juul
  • 7
  • 4