0

I'm trying to analyze some data acquired from experimental tests with several variables being recorded. I've imported a dataframe into R and I want to obtain some statistical information by processing these data. In particular, I want to fill in an empty dataframe with the same variable names of the imported dataframe but with statistical features like mean, median, mode, max, min and quantiles as rows for each variable. The input dataframes are something like 60 columns x 250k rows each.

I've already managed to do this using apply as in the following lines of code for a single input file.

df[1,] <- apply(mydata,2,mean,na.rm=T)
df[2,] <- apply(mydata,2,sd,na.rm=T)
...

Now I need to do this in a for loop for a number of input files mydata_1, mydata_2, mydata_3, ... in order to build several summary statistics dataframes, one for each input file. I tried in several different ways, trying with apply and assign but I can't really manage to access each row of interest in the output dataframes cycling over the several input files. I wuold like to do something like the code below (I know that this code does not work, it's just to give an idea of what I want to do). The output df dataframes are already defined and empty.

for (xx in 1:number_of_mydata_files) {
df_xx[1,]<-apply(mydata_xx,2,mean,na.rm=T)
df_xx[2,]<-apply(mydata_xx,2,sd,na.rm=T)
...
}

Actually I can't remember the error message given by this code, but the problem is that I can't even run this because it does not work.

I'm quite a beginner of R, so I don't have so much experience in using this language. Is there a way to do this? Are there other functions that could be used instead of apply and assign)?

EDIT:

I add here a simple table description that represents the input dataframes I’m using. Sorry for the poor data visualization right here. Basically the input dataframes I’m using are .csv imported files, looking like tables with the first row being the column description, aka the name of the measured variable, and the following rows being the acquired data. I have 250 000 acquisitions for each variable in each file, and I have something like 5-8 files like this being my input.

Current [A] | Force [N] | Elongation [%] | ...
—————————————————————————————————————

Value_a_1 | Value_b_1 | Value_c_1 | ...

I just want to obtain a data frame like this as an output, with the same variables name, but instead with statistical values as rows. For example, the first row, instead of being the first values acquired for each variable, would be the mean of the 250k acquisitions for each variable. The second row would be the standard deviation, the third the variance and so on. I’ve managed to build empty dataframes for the output summary statistics, with just the columns and no rows yet. I just want to fill them and do this iteratively in a for loop.

NelsonGon
  • 13,015
  • 7
  • 27
  • 57
GioP
  • 9
  • 2

1 Answers1

1

Not sure what your data looks like but you can do the following where lst represents your list of data frames.

lst <- list(iris[,-5],mtcars,airquality)
lapply(seq_along(lst), 
       function(x) sapply(lst[[x]],function(x)
         data.frame(Mean=mean(x,na.rm=TRUE),
                    sd=sd(x,na.rm=TRUE))))

Or as suggested by @G. Grothendieck simply:

lapply(lst, sapply, function(x) 
data.frame(Mean = mean(x, na.rm = TRUE), sd = sd(x, na.rm = TRUE)))

If all your files are in the same directory, set working directory as that and use either list.files() or ls() to walk along your input files.

If they share the same column names, you can rbind the result into a single data set.

NelsonGon
  • 13,015
  • 7
  • 27
  • 57
  • 1
    Thank you for the answer! I'll try it as soon as I can! – GioP Jul 06 '19 at 12:20
  • Hi @NelsonGon! Actually this answer does not work because mean and sd are working on each dataframe contained in the list lst but not acting on each column of that dataframe. So the output is like [[1]] [,1] Mean NA sd NA with warning “argument is not numeric or logical: returning NA”. – GioP Jul 08 '19 at 06:50
  • Could you explain why? There was no sample data so not sure what is wrong. Use `dput` to provide data. Also add what code you ran to you question. – NelsonGon Jul 08 '19 at 06:51
  • Yes sorry, I’ve updated the comment and the question itself with a sample of the input dataframes. – GioP Jul 08 '19 at 06:54
  • Mean and sd cannot be used for non numeric data. Filter that out first. – NelsonGon Jul 08 '19 at 06:55
  • Actually it’s not a matter of non numeric data. If i apply the mean and sd to each of those dataframes separately, without using any for loop, i can get easily what i need, as shown in the initial code example. Also, i can’t show you the dput of the data. – GioP Jul 08 '19 at 07:03
  • If you can't show the data, I'm afraid there is no way for me to know what's wrong since the code given in the solution above works with an inbuilt data set that was based on what you described. I also don't get why you insist on using a for loop. There is almost never a need for that in R. I have no idea what code you ran and on what data. Did you run the code in this answer and ascertain that: **Actually this answer does not work because mean and sd are working on each dataframe contained in the list lst but not acting on each column of that dataframe** – NelsonGon Jul 08 '19 at 07:05
  • 1
    Yes, sorry. It’s not that i want to use a foor loop at any cost, I’m just trying to get the right result. I know that your answer works fine with the inbuilt dataframes. Maybe the problem is in the class of the object I’m using. The in built dataframe returns a class [1] “data.frame” while the one i’m using is different [1] “tbl_df” “tbl” “data.frame”. I’ll try to post the dput this evening, i’m from the phone right now. – GioP Jul 08 '19 at 07:14
  • Mention me when you add the dput, I can not really know without looking at the data. – NelsonGon Jul 08 '19 at 07:15
  • 1
    Hi @NelsonGon! Actually it worked just fine after i used the ‘as.data.frame’ on my input files! – GioP Jul 08 '19 at 09:45
  • Glad to know it helped. – NelsonGon Jul 08 '19 at 12:03