For each row in an R dataframe

Question

I have a dataframe, and for each row in that dataframe I have to do some complicated lookups and append some data to a file.

The dataFrame contains scientific results for selected wells from 96 well plates used in biological research so I want to do something like:

for (well in dataFrame) {
  wellName <- well$name    # string like "H1"
  plateName <- well$plate  # string like "plate67"
  wellID <- getWellID(wellName, plateName)
  cat(paste(wellID, well$value1, well$value2, sep=","), file=outputFile)
}

In my procedural world, I'd do something like:

for (row in dataFrame) {
    #look up stuff using data from the row
    #write stuff to the file
}

What is the "R way" to do this?

What is your question here? A data.frame is a two-dimensional object and looping over the rows is a perfectly normal way of doing things as rows are commonly sets of 'observations' of the 'variables' in each column. — Dirk Eddelbuettel, Nov 09 '09 at 04:29
what I end up doing is: for (index in 1:nrow(dataFrame)) { row = dataFrame[index, ]; # do stuff with the row } which never seemed very pretty to me. — Carl Coryell-Martin, Nov 09 '09 at 05:33
Does getWellID call a database or anything? Otherwise, Jonathan is probably right and you could vectorize this. — Shane, Nov 09 '09 at 14:44

score 129 · Answer 1 · edited Dec 14 '20 at 23:39

129

You can use the by() function:

by(dataFrame, seq_len(nrow(dataFrame)), function(row) dostuff)

But iterating over the rows directly like this is rarely what you want to; you should try to vectorize instead. Can I ask what the actual work in the loop is doing?

edited Dec 14 '20 at 23:39

Ken Williams

22,756
10
85
147

answered Nov 09 '09 at 05:54

Jonathan Chang

24,567
5
34
33

6

this will not work well if the data frame has 0 rows because `1:0` is not empty – sds Apr 21 '13 at 17:08
11

Easy fix for the 0 row case is to use [seq_len()](http://stat.ethz.ch/R-manual/R-devel/library/base/html/seq.html), insert `seq_len(nrow(dataFrame))` in place of `1:nrow(dataFrame)`. – Jim Jun 10 '14 at 16:42
16

How do you actually implement (row)? Is it dataframe$column? dataframe[somevariableNamehere]? How do you actually say its a row. The pseudocode "function(row) dostuff" how would that actually look? – uh_big_mike_boi Apr 07 '16 at 11:00
2

@Mike, change `dostuff` in this answer to `str(row)` You'll see multiple lines printed in the console beginning with _" 'data.frame': 1 obs of x variables."_ But be careful, changing `dostuff` to `row` does not return a data.frame object for the outer function as a whole. Instead it returns a list of one row data-frames. – pwilcox May 01 '17 at 15:22
1

Not everything should be vectorized. But in this case it would make sense I guess. – stephanmg Sep 11 '20 at 08:50
I fixed the issue noted by `sds` and `Jim` with an edit. – Ken Williams Dec 14 '20 at 23:40

score 109 · Accepted Answer · edited Jan 15 '14 at 00:26

109

You can try this, using apply() function

> d
  name plate value1 value2
1    A    P1      1    100
2    B    P2      2    200
3    C    P3      3    300

> f <- function(x, output) {
 wellName <- x[1]
 plateName <- x[2]
 wellID <- 1
 print(paste(wellID, x[3], x[4], sep=","))
 cat(paste(wellID, x[3], x[4], sep=","), file= output, append = T, fill = T)
}

> apply(d, 1, f, output = 'outputfile')

edited Jan 15 '14 at 00:26

Uli Köhler

13,012
16
70
120

answered Nov 09 '09 at 14:02

knguyen

2,974
5
25
27

81

Be careful, as the dataframe is converted to a matrix, and what you end up with (`x`) is a vector. This is why the above example has to use numeric indexes; the by() approach gives you a data.frame, which makes your code more robust. – Darren Cook Dec 19 '11 at 05:20
1

did not work for me. The apply function treated every x given to f as a character value and not a row. – Zahy Aug 10 '14 at 07:36
4

Note too that you can refer to the columns by name. So: `wellName <- x[1]` could also be `wellName <- x["name"]`. – founddrama Sep 03 '14 at 11:02
1

When Darren mentioned robust, he meant something like shifting the orders of the columns. This answer would not work whereas the one with by() would still work. – ABCD Jan 04 '16 at 06:26

Shane · Answer 3 · 2009-11-09T14:58:07.233

108

First, Jonathan's point about vectorizing is correct. If your getWellID() function is vectorized, then you can skip the loop and just use cat or write.csv:

write.csv(data.frame(wellid=getWellID(well$name, well$plate), 
         value1=well$value1, value2=well$value2), file=outputFile)

If getWellID() isn't vectorized, then Jonathan's recommendation of using by or knguyen's suggestion of apply should work.

Otherwise, if you really want to use for, you can do something like this:

for(i in 1:nrow(dataFrame)) {
    row <- dataFrame[i,]
    # do stuff with row
}

You can also try to use the foreach package, although it requires you to become familiar with that syntax. Here's a simple example:

library(foreach)
d <- data.frame(x=1:10, y=rnorm(10))
s <- foreach(d=iter(d, by='row'), .combine=rbind) %dopar% d

A final option is to use a function out of the plyr package, in which case the convention will be very similar to the apply function.

library(plyr)
ddply(dataFrame, .(x), function(x) { # do stuff })

edited Nov 09 '09 at 14:58

answered Nov 09 '09 at 14:04

Shane

98,550
35
224
217

Shane, thank you. I'm not sure how to write a vectorized getWellID. What I need to do right now is to dig into an existing list of lists to look it up or pull it out of a database. – Carl Coryell-Martin Nov 09 '09 at 23:45
Feel free to post the getWellID question (i.e. can this function be vectorized?) separately, and I'm sure I (or someone else) will answer it. – Shane Nov 10 '09 at 01:30
2

Even if getWellID is not vectorized, I think you should go with this solution, and replace getWellId with `mapply(getWellId, well$name, well$plate)`. – Jonathan Chang Nov 10 '09 at 02:28
Even if you pull it from a database, you can pull them all at once and then filter the result in R; that will be faster than an iterative function. – Shane Nov 10 '09 at 03:13
+1 for `foreach` - I'm going to use the hell out of that one. – Josh Bode Jan 24 '13 at 06:52

score 30 · Answer 4 · edited Jul 11 '19 at 17:18

30

I think the best way to do this with basic R is:

for( i in rownames(df) )
   print(df[i, "column1"])

The advantage over the for( i in 1:nrow(df))-approach is that you do not get into trouble if df is empty and nrow(df)=0.

edited Jul 11 '19 at 17:18

Capt.Krusty

597
1
7
26

answered Jul 16 '17 at 16:07

Funkwecker

766
13
22

Ł Łaniewski-Wołłk · Answer 5 · 2017-04-12T19:33:30.563

21

I use this simple utility function:

rows = function(tab) lapply(
  seq_len(nrow(tab)),
  function(i) unclass(tab[i,,drop=F])
)

Or a faster, less clear form:

rows = function(x) lapply(seq_len(nrow(x)), function(i) lapply(x,"[",i))

This function just splits a data.frame to a list of rows. Then you can make a normal "for" over this list:

tab = data.frame(x = 1:3, y=2:4, z=3:5)
for (A in rows(tab)) {
    print(A$x + A$y * A$z)
}

Your code from the question will work with a minimal modification:

for (well in rows(dataFrame)) {
  wellName <- well$name    # string like "H1"
  plateName <- well$plate  # string like "plate67"
  wellID <- getWellID(wellName, plateName)
  cat(paste(wellID, well$value1, well$value2, sep=","), file=outputFile)
}

edited Apr 12 '17 at 19:33

answered Aug 27 '15 at 18:44

Ł Łaniewski-Wołłk

328
2
6

It's faster to access a straight list then a data.frame. – Ł Łaniewski-Wołłk May 15 '16 at 08:38
1

Just realized it's even faster to make the same thing with double lapply: rows = function(x) lapply(seq_len(nrow(x)), function(i) lapply(x,function(c) c[i])) – Ł Łaniewski-Wołłk May 15 '16 at 16:45
So the inner `lapply` iterates over the columns of the entire dataset `x`, giving each column the name `c`, and then extracting the `i`th entry from that column vector. Is this correct? – Aaron McDaid May 16 '16 at 12:03
Very nice! In my case, I had to convert from "factor" values to the underlying value: `wellName <- as.character(well$name)`. – Steve Pitchers Feb 03 '17 at 19:02

Ferran E · Answer 6 · 2015-07-14T14:13:43.250

I was curious about the time performance of the non-vectorised options. For this purpose, I have used the function f defined by knguyen

f <- function(x, output) {
  wellName <- x[1]
  plateName <- x[2]
  wellID <- 1
  print(paste(wellID, x[3], x[4], sep=","))
  cat(paste(wellID, x[3], x[4], sep=","), file= output, append = T, fill = T)
}

and a dataframe like the one in his example:

n = 100; #number of rows for the data frame
d <- data.frame( name = LETTERS[ sample.int( 25, n, replace=T ) ],
                  plate = paste0( "P", 1:n ),
                  value1 = 1:n,
                  value2 = (1:n)*10 )

I included two vectorised functions (for sure quicker than the others) in order to compare the cat() approach with a write.table() one...

library("ggplot2")
library( "microbenchmark" )
library( foreach )
library( iterators )

tm <- microbenchmark(S1 =
                       apply(d, 1, f, output = 'outputfile1'),
                     S2 = 
                       for(i in 1:nrow(d)) {
                         row <- d[i,]
                         # do stuff with row
                         f(row, 'outputfile2')
                       },
                     S3 = 
                       foreach(d1=iter(d, by='row'), .combine=rbind) %dopar% f(d1,"outputfile3"),
                     S4= {
                       print( paste(wellID=rep(1,n), d[,3], d[,4], sep=",") )
                       cat( paste(wellID=rep(1,n), d[,3], d[,4], sep=","), file= 'outputfile4', sep='\n',append=T, fill = F)                           
                     },
                     S5 = {
                       print( (paste(wellID=rep(1,n), d[,3], d[,4], sep=",")) )
                       write.table(data.frame(rep(1,n), d[,3], d[,4]), file='outputfile5', row.names=F, col.names=F, sep=",", append=T )
                     },
                     times=100L)
autoplot(tm)

The resulting image shows that apply gives the best performance for a non-vectorised version, whereas write.table() seems to outperform cat(). ForEachRunningTime

RobinL · Answer 7 · 2017-06-03T19:26:01.190

You can use the by_row function from the package purrrlyr for this:

myfn <- function(row) {
  #row is a tibble with one row, and the same 
  #number of columns as the original df
  #If you'd rather it be a list, you can use as.list(row)
}

purrrlyr::by_row(df, myfn)

By default, the returned value from myfn is put into a new list column in the df called .out.

If this is the only output you desire, you could write purrrlyr::by_row(df, myfn)$.out

score 2 · Answer 8 · answered Feb 13 '15 at 15:04

Well, since you asked for R equivalent to other languages, I tried to do this. Seems to work though I haven't really looked at which technique is more efficient in R.

> myDf <- head(iris)
> myDf
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
> nRowsDf <- nrow(myDf)
> for(i in 1:nRowsDf){
+ print(myDf[i,4])
+ }
[1] 0.2
[1] 0.2
[1] 0.2
[1] 0.2
[1] 0.2
[1] 0.4

For the categorical columns though, it would fetch you a Data Frame which you could typecast using as.character() if needed.

Seyma Kalay · Answer 9 · 2020-08-05T09:41:14.897

you can do something for a list object,

data("mtcars")
rownames(mtcars)
data <- list(mtcars ,mtcars, mtcars, mtcars);data

out1 <- NULL 
for(i in seq_along(data)) { 
  out1[[i]] <- data[[i]][rownames(data[[i]]) != "Volvo 142E", ] } 
out1

Or a data frame,

data("mtcars")
df <- mtcars
out1 <- NULL 
for(i in 1:nrow(df)) {
  row <- rownames(df[i,])
  # do stuff with row
  out1 <- df[rownames(df) != "Volvo 142E",]
  
}
out1

For each row in an R dataframe

9 Answers9

Linked