205

I have a dataframe, and for each row in that dataframe I have to do some complicated lookups and append some data to a file.

The dataFrame contains scientific results for selected wells from 96 well plates used in biological research so I want to do something like:

for (well in dataFrame) {
  wellName <- well$name    # string like "H1"
  plateName <- well$plate  # string like "plate67"
  wellID <- getWellID(wellName, plateName)
  cat(paste(wellID, well$value1, well$value2, sep=","), file=outputFile)
}

In my procedural world, I'd do something like:

for (row in dataFrame) {
    #look up stuff using data from the row
    #write stuff to the file
}

What is the "R way" to do this?

jogo
  • 12,469
  • 11
  • 37
  • 42
Carl Coryell-Martin
  • 3,410
  • 3
  • 26
  • 23
  • What is your question here? A data.frame is a two-dimensional object and looping over the rows is a perfectly normal way of doing things as rows are commonly sets of 'observations' of the 'variables' in each column. – Dirk Eddelbuettel Nov 09 '09 at 04:29
  • 21
    what I end up doing is: for (index in 1:nrow(dataFrame)) { row = dataFrame[index, ]; # do stuff with the row } which never seemed very pretty to me. – Carl Coryell-Martin Nov 09 '09 at 05:33
  • 1
    Does getWellID call a database or anything? Otherwise, Jonathan is probably right and you could vectorize this. – Shane Nov 09 '09 at 14:44

9 Answers9

129

You can use the by() function:

by(dataFrame, seq_len(nrow(dataFrame)), function(row) dostuff)

But iterating over the rows directly like this is rarely what you want to; you should try to vectorize instead. Can I ask what the actual work in the loop is doing?

Ken Williams
  • 22,756
  • 10
  • 85
  • 147
Jonathan Chang
  • 24,567
  • 5
  • 34
  • 33
  • 6
    this will not work well if the data frame has 0 rows because `1:0` is not empty – sds Apr 21 '13 at 17:08
  • 11
    Easy fix for the 0 row case is to use [seq_len()](http://stat.ethz.ch/R-manual/R-devel/library/base/html/seq.html), insert `seq_len(nrow(dataFrame))` in place of `1:nrow(dataFrame)`. – Jim Jun 10 '14 at 16:42
  • 16
    How do you actually implement (row)? Is it dataframe$column? dataframe[somevariableNamehere]? How do you actually say its a row. The pseudocode "function(row) dostuff" how would that actually look? – uh_big_mike_boi Apr 07 '16 at 11:00
  • 2
    @Mike, change `dostuff` in this answer to `str(row)` You'll see multiple lines printed in the console beginning with _" 'data.frame': 1 obs of x variables."_ But be careful, changing `dostuff` to `row` does not return a data.frame object for the outer function as a whole. Instead it returns a list of one row data-frames. – pwilcox May 01 '17 at 15:22
  • 1
    Not everything should be vectorized. But in this case it would make sense I guess. – stephanmg Sep 11 '20 at 08:50
  • I fixed the issue noted by `sds` and `Jim` with an edit. – Ken Williams Dec 14 '20 at 23:40
109

You can try this, using apply() function

> d
  name plate value1 value2
1    A    P1      1    100
2    B    P2      2    200
3    C    P3      3    300

> f <- function(x, output) {
 wellName <- x[1]
 plateName <- x[2]
 wellID <- 1
 print(paste(wellID, x[3], x[4], sep=","))
 cat(paste(wellID, x[3], x[4], sep=","), file= output, append = T, fill = T)
}

> apply(d, 1, f, output = 'outputfile')
Uli Köhler
  • 13,012
  • 16
  • 70
  • 120
knguyen
  • 2,974
  • 5
  • 25
  • 27
  • 81
    Be careful, as the dataframe is converted to a matrix, and what you end up with (`x`) is a vector. This is why the above example has to use numeric indexes; the by() approach gives you a data.frame, which makes your code more robust. – Darren Cook Dec 19 '11 at 05:20
  • 1
    did not work for me. The apply function treated every x given to f as a character value and not a row. – Zahy Aug 10 '14 at 07:36
  • 4
    Note too that you can refer to the columns by name. So: `wellName <- x[1]` could also be `wellName <- x["name"]`. – founddrama Sep 03 '14 at 11:02
  • 1
    When Darren mentioned robust, he meant something like shifting the orders of the columns. This answer would not work whereas the one with by() would still work. – ABCD Jan 04 '16 at 06:26
108

First, Jonathan's point about vectorizing is correct. If your getWellID() function is vectorized, then you can skip the loop and just use cat or write.csv:

write.csv(data.frame(wellid=getWellID(well$name, well$plate), 
         value1=well$value1, value2=well$value2), file=outputFile)

If getWellID() isn't vectorized, then Jonathan's recommendation of using by or knguyen's suggestion of apply should work.

Otherwise, if you really want to use for, you can do something like this:

for(i in 1:nrow(dataFrame)) {
    row <- dataFrame[i,]
    # do stuff with row
}

You can also try to use the foreach package, although it requires you to become familiar with that syntax. Here's a simple example:

library(foreach)
d <- data.frame(x=1:10, y=rnorm(10))
s <- foreach(d=iter(d, by='row'), .combine=rbind) %dopar% d

A final option is to use a function out of the plyr package, in which case the convention will be very similar to the apply function.

library(plyr)
ddply(dataFrame, .(x), function(x) { # do stuff })
Shane
  • 98,550
  • 35
  • 224
  • 217
  • Shane, thank you. I'm not sure how to write a vectorized getWellID. What I need to do right now is to dig into an existing list of lists to look it up or pull it out of a database. – Carl Coryell-Martin Nov 09 '09 at 23:45
  • Feel free to post the getWellID question (i.e. can this function be vectorized?) separately, and I'm sure I (or someone else) will answer it. – Shane Nov 10 '09 at 01:30
  • 2
    Even if getWellID is not vectorized, I think you should go with this solution, and replace getWellId with `mapply(getWellId, well$name, well$plate)`. – Jonathan Chang Nov 10 '09 at 02:28
  • Even if you pull it from a database, you can pull them all at once and then filter the result in R; that will be faster than an iterative function. – Shane Nov 10 '09 at 03:13
  • +1 for `foreach` - I'm going to use the hell out of that one. – Josh Bode Jan 24 '13 at 06:52
30

I think the best way to do this with basic R is:

for( i in rownames(df) )
   print(df[i, "column1"])

The advantage over the for( i in 1:nrow(df))-approach is that you do not get into trouble if df is empty and nrow(df)=0.

Capt.Krusty
  • 597
  • 1
  • 7
  • 26
Funkwecker
  • 766
  • 13
  • 22
21

I use this simple utility function:

rows = function(tab) lapply(
  seq_len(nrow(tab)),
  function(i) unclass(tab[i,,drop=F])
)

Or a faster, less clear form:

rows = function(x) lapply(seq_len(nrow(x)), function(i) lapply(x,"[",i))

This function just splits a data.frame to a list of rows. Then you can make a normal "for" over this list:

tab = data.frame(x = 1:3, y=2:4, z=3:5)
for (A in rows(tab)) {
    print(A$x + A$y * A$z)
}        

Your code from the question will work with a minimal modification:

for (well in rows(dataFrame)) {
  wellName <- well$name    # string like "H1"
  plateName <- well$plate  # string like "plate67"
  wellID <- getWellID(wellName, plateName)
  cat(paste(wellID, well$value1, well$value2, sep=","), file=outputFile)
}
11

I was curious about the time performance of the non-vectorised options. For this purpose, I have used the function f defined by knguyen

f <- function(x, output) {
  wellName <- x[1]
  plateName <- x[2]
  wellID <- 1
  print(paste(wellID, x[3], x[4], sep=","))
  cat(paste(wellID, x[3], x[4], sep=","), file= output, append = T, fill = T)
}

and a dataframe like the one in his example:

n = 100; #number of rows for the data frame
d <- data.frame( name = LETTERS[ sample.int( 25, n, replace=T ) ],
                  plate = paste0( "P", 1:n ),
                  value1 = 1:n,
                  value2 = (1:n)*10 )

I included two vectorised functions (for sure quicker than the others) in order to compare the cat() approach with a write.table() one...

library("ggplot2")
library( "microbenchmark" )
library( foreach )
library( iterators )

tm <- microbenchmark(S1 =
                       apply(d, 1, f, output = 'outputfile1'),
                     S2 = 
                       for(i in 1:nrow(d)) {
                         row <- d[i,]
                         # do stuff with row
                         f(row, 'outputfile2')
                       },
                     S3 = 
                       foreach(d1=iter(d, by='row'), .combine=rbind) %dopar% f(d1,"outputfile3"),
                     S4= {
                       print( paste(wellID=rep(1,n), d[,3], d[,4], sep=",") )
                       cat( paste(wellID=rep(1,n), d[,3], d[,4], sep=","), file= 'outputfile4', sep='\n',append=T, fill = F)                           
                     },
                     S5 = {
                       print( (paste(wellID=rep(1,n), d[,3], d[,4], sep=",")) )
                       write.table(data.frame(rep(1,n), d[,3], d[,4]), file='outputfile5', row.names=F, col.names=F, sep=",", append=T )
                     },
                     times=100L)
autoplot(tm)

The resulting image shows that apply gives the best performance for a non-vectorised version, whereas write.table() seems to outperform cat(). ForEachRunningTime

Ferran E
  • 119
  • 1
  • 4
7

You can use the by_row function from the package purrrlyr for this:

myfn <- function(row) {
  #row is a tibble with one row, and the same 
  #number of columns as the original df
  #If you'd rather it be a list, you can use as.list(row)
}

purrrlyr::by_row(df, myfn)

By default, the returned value from myfn is put into a new list column in the df called .out.

If this is the only output you desire, you could write purrrlyr::by_row(df, myfn)$.out

RobinL
  • 11,009
  • 8
  • 48
  • 68
2

Well, since you asked for R equivalent to other languages, I tried to do this. Seems to work though I haven't really looked at which technique is more efficient in R.

> myDf <- head(iris)
> myDf
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
> nRowsDf <- nrow(myDf)
> for(i in 1:nRowsDf){
+ print(myDf[i,4])
+ }
[1] 0.2
[1] 0.2
[1] 0.2
[1] 0.2
[1] 0.2
[1] 0.4

For the categorical columns though, it would fetch you a Data Frame which you could typecast using as.character() if needed.

-1

you can do something for a list object,

data("mtcars")
rownames(mtcars)
data <- list(mtcars ,mtcars, mtcars, mtcars);data

out1 <- NULL 
for(i in seq_along(data)) { 
  out1[[i]] <- data[[i]][rownames(data[[i]]) != "Volvo 142E", ] } 
out1

Or a data frame,

data("mtcars")
df <- mtcars
out1 <- NULL 
for(i in 1:nrow(df)) {
  row <- rownames(df[i,])
  # do stuff with row
  out1 <- df[rownames(df) != "Volvo 142E",]
  
}
out1 
Seyma Kalay
  • 2,037
  • 10
  • 22