I have a list of files in CSV format, for example:
20150507.csv
a,10
b,20
c,30
20150506.csv
a,100
b,20
c,1
and so on. I have a text file containing variable names:
list.txt
a
b
c
d
I need to import the data in such a way that I have values for the variables in the list.txt
file:
a: 10, 100,...
b: 20, 20, ...
c: 30, 1, ...
Values of elements (from list.txt
) are searched from all CSV files that match the date name pattern, and after collecting all the values, I need to compute the mean and SD of each variable, and mark observations that are more than 2 SD away from the mean as outliers.
Currently I am using bash
commands to create statistics for each element of list.txt
and then load data in R.
for i in `cat list.txt |cut -d, -f1`; do echo "$i";grep "^$i" 2015* | cut -d: -f2 > /tmp/$i.stat;done
and then use a for
loop in R to find outliers:
files=list.files(path="/tmp/",pattern=".stat")
for( i in 1:length( files)){
myfunction(paste("/tmp/",files[[i]],sep='')
}
myfunction(filename)
{df <- read.csv(filename, header=F);
names(df)=c("symbol","num");
x=df[abs(df$num-mean(df$num))>2*sd(df[,2]),];outl=(nrow(x)/nrow(df))*100;if(outl>1){cat(filename,"\n");
cat("outliers=",outl);cat("\n\n")
However I would like to do the whole process in R only rather than creating multiple files through bash
and then read them using a for
loop.
I read this grep manual for R , however searching multiple file options does not show up.
Edit:
Instead of using for
, I have now used:
df <- do.call(rbind, lapply(list.files(path="/tmp/",pattern = "*.stat"), read.csv,header=FALSE))
which looks like a better way.
However df2 <- do.call(rbind, lapply(list.files(path="/orignal/dir/",pattern = "2015*.abc.csv"), read.csv,header=FALSE))
doesnt make R understand that I'm searching for all files with names starting with 2015* and ending with abc.csv