-1

I am new in R. I have a directory contain many dataframe files that have the same structure (3 columns separated by ",") as the following: Date and timestamp, V2, V3

I am supposed to calculate mean, median, SD, variance and skewness for both columns V2 and V3 for all files, and then save these 10 columns in separate file. each column (for example, V2 median) will contain medians of all V2 columns in all files. I need your help for that.

Thanks

rafa.pereira
  • 13,251
  • 6
  • 71
  • 109
  • What all have you tried so far? you can use `.SD` in data.table to calculate aggregation function of all columns in a dataset – Gaurav Taneja May 01 '16 at 14:17

2 Answers2

1

If I understand you questions correctly, here is a fast solution using data.table.

# Load library
  library(data.table)
  library(moments)
  library(readr)



# Get a List of  `.csv` files in your folder
  filenames <- list.files("C:/your/folder", pattern="*.csv", full.names=TRUE)


# Load and bind all data sets into one single data frame
  df <- rbindlist(lapply(filenames,fread))


# run the summary statistics for V2 and V3

output <- df[,.(  V2_mean= mean(V2)
                , V2_median= median(V2)
                , V2_SD= sd(V2)
                , V2_var= var(V2)
                , V2_skw= skewness(V2)
                , V3_mean= mean(V3)
                , V3_median= median(V3)
                , V3_SD= sd(V3)
                , V3_var= var(V3)
                , V3_skw= skewness(V3)) ]


# save output in a different file
  write_csv(output, "output.csv")
rafa.pereira
  • 13,251
  • 6
  • 71
  • 109
  • 2
    `data.table` recently has `fwrite` implemented (and runs in parallel). Check [this post](http://stackoverflow.com/questions/10505605/speeding-up-the-performance-of-write-table/36465497#36465497) for a benchmark (~17x faster than `write.csv`) and [this blog](http://blog.h2o.ai/2016/04/fast-csv-writing-for-r/) for a nice explanation. And `write_csv` seems slower than `write.csv`. – Arun May 01 '16 at 19:57
  • 1
    Hi @Arun, I've used `fwrite{data.table}` but didn't use it in my answer because I believe it is still in the development version of `data.table` package. Apart from there, `fwrite` is the cherry on top of the cake, making data.table the fastest package for data manipulation. – rafa.pereira May 01 '16 at 20:56
  • 1
    Rafael, thanks. The comment was mostly to inform other users who read this post about `fwrite`. – Arun May 01 '16 at 20:59
0

One more method :

#This library is needed to compute skewness
library(e1071) 

#Set the directory where your files are present as home directory
setwd("Directory where your files are present")

#file names will be stored
files <- list.files(path = ".",pattern = ".CSV$", ignore.case = TRUE)
file<-NULL
for (i in 1:length(files)){
current_file=read.table(files[i],header = TRUE,sep = ",")
v2_mean=sapply(current_file[2], mean, na.rm = TRUE)
v2_median=sapply(current_file[2], median, na.rm = TRUE)
v2_SD=sapply(current_file[2], sd, na.rm = TRUE)
v2_variance=sapply(current_file[2], var, na.rm = TRUE)
v2_Skew=sapply(current_file[2], skewness, na.rm = TRUE)

v3_mean=sapply(current_file[3], mean, na.rm = TRUE)
v3_median=sapply(current_file[3], median, na.rm = TRUE)
v3_SD=sapply(current_file[3], sd, na.rm = TRUE)
v3_variance=sapply(current_file[3], var, na.rm = TRUE)
v3_Skew=sapply(current_file[3], skewness, na.rm = TRUE)
file<-rbind.data.frame(file,c(v2_mean,v2_median,v2_SD,v2_variance,v2_Skew,v3_mean,v3_median,v3_SD,v3_variance,v3_Skew))
}
names(file)<-c("v2_mean","v2_median","v2_SD","v2_variance","v2_Skew","v3_mean","v3_median","v3_SD","v3_variance","v3_Skew")

#Final file will be saved in the home directory
write.csv(file, "file_stats.csv")
Gopal
  • 66
  • 1
  • 1
  • 8