6

I am trying to using each row in a data frame as inputs to a function to process some data and then write the output to a csv file. As per the following example

myfunction <- function(X, Y, Z){

                       data <- read.csv("mydata.csv")
                       subsetedData <- subset(data, x=X & y=Y & z=Z, select=x:z)
                       write.csv(subsetedData, file="mycsvfile.csv")
                       }

apply(myXYZdata, MARGIN = 1, function(x1, x2, x3) myfunction(X, Y, Z))

I want to subset based on every row in the dataframe myXYZdata. However this does not appear to work or I am not fully understanding the correct usage of apply.

I know this can be done using a loop but would prefer not to do it that way.

Edit:

The purpose of this is that I have a large data file which I want to subset based on combinations of variables found in my data frame "myXYZdata" and store the results in new data files.

The large data file I want to subset is in the format.

date                      x   y  z    count          
1 2015-08-20 00:00:00.000 a   d  h    56
2 2015-08-26 00:00:00.000 b   e  h     4
3 2015-08-18 00:00:00.000 b   f  i     8
4 2015-09-03 00:00:00.000 c   e  l     32
5 2015-08-12 00:00:00.000 a   g  l     3
Stephen Saidani
  • 101
  • 1
  • 2
  • 8
  • 1
    possible duplicate of [R - how to call apply-like function on each row of dataframe with multiple arguments from each row of the df](http://stackoverflow.com/questions/15059076/r-how-to-call-apply-like-function-on-each-row-of-dataframe-with-multiple-argum) – Dhawal Kapil Sep 09 '15 at 08:47
  • `apply(myXYZdata, MARGIN = 1, function(x) myfunction(x$X, x$Y, x$Z))` should do. – Tensibai Sep 09 '15 at 08:52
  • Surely this is an [XY problem (What is an XY problem?)](http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). Sounds like you're trying to index into a very large DB file, with logical indexing, using a compound index/multiindex. Or you could use a hash. Can you restate your question with more context? What size are these items? – smci Sep 09 '15 at 09:35
  • Yes you are probably correct this does look like an XY problem, Sorry about that. I have edited my original post to include what I am trying to do. – Stephen Saidani Sep 09 '15 at 14:47
  • *"I want to subset a large (Gb?) data file based on combinations of variables found in my data and store the results in new data files."* **This really smells like multiindexing into a huge disk-backed DB file to get views. Look into HDFS, R bigmemory vs ff etc.** – smci Sep 10 '15 at 05:35
  • Thanks. No its not in Gb in size. The reason I am doing this is because the number of possibilities of views. If it was large enough I would have use HDFS. Hadoop/HDFS is the solution I probably will use if I take a larger date range. but I have found a solution which works for now. Thanks for the help. – Stephen Saidani Sep 10 '15 at 14:37

2 Answers2

4

I believe its easier to pass a row as argument to your function.

myfunction <- function(row){

                   data <- read.csv("mydata.csv")
                   subsetedData <- subset(data, x=row[1] & y=row[2] & z=row[3], select=x:z)
                   write.csv(subsetedData, file="mycsvfile.csv")
                   }

apply(myXYZdata[,c("X","Y","Z")], MARGIN = 1, myfunction)
Wannes Rosiers
  • 1,680
  • 1
  • 12
  • 18
2

What about using mapply (multi-variable apply):

mapply(myfunction, myXYZdata$X, myXYZdata$Y, myXYZdata$Z, fnms)

You will need to create a vector of file names (fnms) so that each entry is written to a different file and then change myfunction so that it takes an argument for the file name.

Alternatively put append = TRUE as an argument to write.csv in myfunction to get it all written to the same file (but be aware that successive runs of the code will not overwrite the file - you could precede the write.csv(..., append = TRUE) with if(file.exists("mycsvfile.csv")) file.remove("mycsvfile.csv")).

CJB
  • 1,759
  • 17
  • 26