0

I have searched a lot and none of the similar sounding posts are the same as mine. I am struggling for last few days to get this apparently simple job done.

I have created a simple example to demonstrate the problem:

Let us assume there are 4 csv files, r.trials1-jan1, ..., rtrials1-jan4

This is the directory.

dir(pattern = "r.trial*")
[1] "r.trials-jan1.csv" "r.trials-jan2.csv" "r.trials-jan3.csv" "r.trials-jan4.csv"

Now I store all 4 files in a vector filenames, as follows.

filenames <- list.files(pattern = "r.trial*")

So far so good. Now comes the challenge.

wind_data <- lapply(filenames, read.csv)
combined_data <- rbind(wind_data)

The combined_data object does not appear in a single large data frame. Instead, is broken into as many lists as the number of csv files...This is not what I want.

I can get the files into one large data frame successfully if I do the read.csv one by one.

Like this one

x1 <- read.csv(filenames[1])
x2 <- read.csv(filenames[2])

and then doing an rbind

x12 <- rbind(x1, r2)

See the difference in the data structure between x1, x12 and combined_data here:

str(x1)
'data.frame':   24 obs. of  5 variables:
 $ date    : Factor w/ 24 levels "1-Jan-2017 0:00:00",..: 1 12 17 18 19 20 21 22 23 24 ...
 $ pressure: num  2.541 4.729 7.569 0.784 1.526 ...
 $ temp    : num  30.8 12.3 53 45.7 18.2 ...
 $ speed   : num  296.9 104.68 8.18 260.2 40.23 ...
 $ dia     : num  920 664 806 427 824 ...`

The above is one single csv file imported into a df.

str(x12)
'data.frame':   48 obs. of  5 variables:
 $ date    : Factor w/ 48 levels "1-Jan-2017 0:00:00",..: 1 12 17 18 19 20 21 22 23 24 ...
 $ pressure: num  2.541 4.729 7.569 0.784 1.526 ...
 $ temp    : num  30.8 12.3 53 45.7 18.2 ...
 $ speed   : num  296.9 104.68 8.18 260.2 40.23 ...
 $ dia     : num  920 664 806 427 824 ...`

The above is two csvs combined one by one.

But with a large number of files, the above approach becomes very tedious. Hence the lapply() function was used to get all csv into one file.

And here is the structure output of combined_data.

str(combined_data)
    List of 4
     $ :'data.frame':   24 obs. of  5 variables:
      ..$ date    : Factor w/ 24 levels "1-Jan-2017 0:00:00",..: 1 12 17 18 19 20 21 22 23 24 ...
      ..$ pressure: num [1:24] 2.869 7.881 0.908 4.616 2.719 ...
      ..$ temp    : num [1:24] 14 61.4 52.7 97.5 99 ...
      ..$ speed   : num [1:24] 267.9 36.4 231.7 299.5 203 ...
      ..$ dia     : num [1:24] 880 932 514 661 580 ...
     $ :'data.frame':   24 obs. of  5 variables:
      ..$ date    : Factor w/ 24 levels "2-Jan-2017 0:00:00",..: 1 12 17 18 19 20 21 22 23 24 ...
      ..$ pressure: num [1:24] 4.96 9.57 0.34 5.18 7.34 ...
      ..$ temp    : num [1:24] 26.5 74.5 76.8 52.8 68.2 ...
      ..$ speed   : num [1:24] 238.3 37 16.4 30.8 12.2 ...
      ..$ dia     : num [1:24] 163 548 161 631 437 ...
     $ :'data.frame':   24 obs. of  5 variables:
      ..$ date    : Factor w/ 24 levels "3-Jan-2017 0:00:00",..: 1 12 17 18 19 20 21 22 23 24 ...
      ..$ pressure: num [1:24] 9.79 7.01 5.7 2.46 2.46 ...
      ..$ temp    : num [1:24] 76.8 11.9 30.6 16.2 90.9 ...
      ..$ speed   : num [1:24] 208.6 240 270.1 46.4 224.5 ...
      ..$ dia     : num [1:24] 50.6 374.9 265.2 816 315.5 ...
     $ :'data.frame':   24 obs. of  5 variables:
      ..$ date    : Factor w/ 24 levels "4-Jan-2017 0:00:00",..: 1 12 17 18 19 20 21 22 23 24 ...
      ..$ pressure: num [1:24] 0.761 3.384 8.696 3.355 9.007 ...
      ..$ temp    : num [1:24] 42.9 94 4.7 44.9 74 ...
      ..$ speed   : num [1:24] 199.73 223.39 128.77 56.29 6.64 ...
      ..$ dia     : num [1:24] 832 764 389 293 686 ...
     - attr(*, "dim")= int [1:2] 1 4
     - attr(*, "dimnames")=List of 2
      ..$ : chr "wind_data"
      ..$ : NULL`

So my questions are

  1. What are other ways to convert many csv files into one large data frame and not the list of dataframes?

  2. Why is the lapply read.csv not combining all csv into one?


Uwe
  • 41,420
  • 11
  • 90
  • 134
Lazarus Thurston
  • 1,197
  • 15
  • 33
  • More likely than not the data column came in as character data. Run `str(combined_data)` to see if that column is character. If so, just convert it to numerical `combined_data$vel<-as.numeric(combined_data$vel)` should do it. Frequently CSV files are all character, even the numbers and dates,s you need to convert them back upon importing them. – sconfluentus Apr 05 '17 at 21:27
  • Apologies if this appeared to be a duplicate question. But I can't find the answer. If you are kind enought to attach a link to the exact duplicate question I will appreciate. – Lazarus Thurston Apr 06 '17 at 13:23
  • It may not be a duplicate...there is not enough here to say. Try running the `str()` and post the results – sconfluentus Apr 06 '17 at 18:11
  • and if you can show a bit of data, just a few lines...use` dput(x12)`... – sconfluentus Apr 06 '17 at 18:17
  • Apologies for the delay in coming back and thanks @bethany for the guidance in reposting details. I have edited the entire question. Please read and let me know if anyone has a clear solution. – Lazarus Thurston Apr 07 '17 at 20:04
  • The question was that they would not row bind. It seems that your dates are coming in as factors, each frame has different levels, this might be problematic. if you use `as.POSIXct(file$date)` (those are made up file names to get the dates into the same format, you might be able to open all of them and simply `rbind`. See if fixing the format of dates works. If it jams up, try `as.POSIXct(as.character(file$date))`...it is implicit in posix but fails at times – sconfluentus Apr 07 '17 at 20:21
  • The reason your `lapply` created a list of frames is because `lapply` iterates to a list...so it appended each frame into a list. – sconfluentus Apr 07 '17 at 20:21
  • Thanks @bethanyP but I got an error in both the functions. `Error in as.POSIXlt.character(as.character(x), ...) : character string is not in a standard unambiguous format` – Lazarus Thurston Apr 08 '17 at 11:52
  • We can look at factor to date connversion later also. Right now the big challenge is how to get the data into one full data.frame. – Lazarus Thurston Apr 08 '17 at 11:56
  • `as.POSIXct(x1$date, format ="%d-%b-%Y %H:%M:%S)"` this tells it what format the day, month, year, hours, minutes and seconds are in as well as their separators. You can see this here: https://stat.ethz.ch/R-manual/R-devel/library/base/html/strptime.html – sconfluentus Apr 08 '17 at 14:55
  • the reason it WONT go into data frames is because the factor levels do not agree in each frame. Once they do, you will be able to row bind them. – sconfluentus Apr 08 '17 at 14:56
  • thanks @bethany. You gave enough inspiration to find my solution. The _as.POSIXct_ was giving an error to left it. Then I tried this. `read_all <- lapply(filenames,read.csv,stringsAsFactor=FALSE)` While it imported all files into `read_all`, the structure of read_all still remains a _list of dataframes. I fiddled around a bit and then this piece of code is **successfully** importing all the files into one data frame x. `filenames<-list.files(pattern="r.trial*") x<-NULL for(name in filenames){ x<-rbind(x,read.csv(name,stringsAsFactors=F)) } print(x)' Yaay!! – Lazarus Thurston Apr 08 '17 at 17:00
  • You're pretty near with your approach but `rbind` doesn't expect of list of objects, but the objects themselves (see `?rbind`). Please, try `library(data.table); rbindlist(lapply(filenames, fread)` instead. `rbindlist` expects a list of data.frame objects as input. – Uwe Apr 09 '17 at 11:46
  • You need to rbind the individual elements not entire list. Use do.call: `do.call(rbind, wave_data)` which is equivalent to many items passed in rbind: `rbind(df1, df2, df3, ...others in wave_data list)` – Parfait Apr 09 '17 at 15:31
  • hmmm.. I have made a note of `do.call` and data.table library. Will try these approaches soon. At the moment my `rbind` inside a **for- loop** is producing the desired results. Ofcourse it may not be the most optimum solution. Thanks @Parfait and @UweB – Lazarus Thurston Apr 09 '17 at 16:26

0 Answers0