1

I have a list of files in CSV format, for example:

20150507.csv

a,10 
b,20 
c,30 

20150506.csv

a,100 
b,20 
c,1 

and so on. I have a text file containing variable names:

list.txt

a 
b 
c 
d 

I need to import the data in such a way that I have values for the variables in the list.txt file:

a: 10, 100,... 
b: 20, 20, ... 
c: 30, 1, ... 

Values of elements (from list.txt) are searched from all CSV files that match the date name pattern, and after collecting all the values, I need to compute the mean and SD of each variable, and mark observations that are more than 2 SD away from the mean as outliers.

Currently I am using bash commands to create statistics for each element of list.txt and then load data in R.

for i in `cat list.txt |cut -d, -f1`; do echo "$i";grep "^$i" 2015* | cut -d: -f2 > /tmp/$i.stat;done 

and then use a for loop in R to find outliers:

files=list.files(path="/tmp/",pattern=".stat")

for( i in 1:length( files)){
    myfunction(paste("/tmp/",files[[i]],sep='')
}

myfunction(filename)
    {df <- read.csv(filename, header=F); 
    names(df)=c("symbol","num");

    x=df[abs(df$num-mean(df$num))>2*sd(df[,2]),];outl=(nrow(x)/nrow(df))*100;if(outl>1){cat(filename,"\n");

    cat("outliers=",outl);cat("\n\n") 

However I would like to do the whole process in R only rather than creating multiple files through bash and then read them using a for loop.

I read this grep manual for R , however searching multiple file options does not show up.

Edit:

Instead of using for, I have now used:

df <- do.call(rbind, lapply(list.files(path="/tmp/",pattern = "*.stat"), read.csv,header=FALSE)) 

which looks like a better way.

However df2 <- do.call(rbind, lapply(list.files(path="/orignal/dir/",pattern = "2015*.abc.csv"), read.csv,header=FALSE)) doesnt make R understand that I'm searching for all files with names starting with 2015* and ending with abc.csv

pythonRcpp
  • 2,042
  • 6
  • 26
  • 48
  • As you have managed to import all CSVs into 1 dataframe, your problem sounds like a duplicate of this post: [Collapse text by group in data frame](http://stackoverflow.com/questions/22756372/collapse-text-by-group-in-data-frame) – zx8754 May 07 '15 at 08:10
  • @zx8754 , but the problem of creating multiple stat files still remains. I have added more to the question. @ fg nu added more details, hope it makes it more clear. And i couldnt find what's wrong with the function, could you please hint – pythonRcpp May 07 '15 at 08:15
  • There is no need for a hint -- that is not how you define a function in R. – tchakravarty May 07 '15 at 08:16
  • df2 <- do.call(rbind, lapply(list.files(path="/orignal/dir/",pattern = "2015*.abc.csv"), read.csv,header=FALSE)) doesnt make it understand like bash , it searches for all files instead. @fgnu yes syntactically it is wrong I'll correct it – pythonRcpp May 07 '15 at 08:21
  • Is the number and order of columns variable or fixed? – Roman Luštrik May 07 '15 at 10:43
  • @RomanLuštrik number of columns in all date.abc.csv are same (its 2), number of rows may vary. Current problem is stuck with me unable to specify particular files to list.files – pythonRcpp May 07 '15 at 11:25
  • I think it would be best if you break your question into two or three separate specific questions. Also, please provide a reproducible example (it's easy to generate the data/files using built in functions). – Roman Luštrik May 07 '15 at 11:50

0 Answers0