How to add a counting column based on values in other columns in R

Question

I have a relatively large dataset (16,000+ x ~31). In other words, it's large enough that I don't want to manipulate it line by line in Excel. The data is in this form:

block  site     day  X1   X2
1      1        1    0.4  5.1 
1      1        2    0.8  1.1
1      1        3    1.1  4.2
1      2        1    ...  ...
1      2        2
1      2        3
2      3        1
2      3        2
2      3        3
2      4        1
2      4        2
2      4        3

As you can see, the site count is continuous but I would like a column where the site number resets with each block. For example, I would like something like this below:

block  site     day  X1   X2    site2
1      1        1    0.4  5.1   1
1      1        2    0.8  1.1   1
1      1        3    1.1  4.2   1
1      2        1    ...  ...   2
1      2        2               2
1      2        3               2
2      3        1               1
2      3        2               1
2      3        3               1
2      4        1               2
2      4        2               2
2      4        3               2

I was thinking about using the R function rle but am not sure if it will work because of complications with day. Otherwise, I would try something like:

Data$site2 <- sequence(rle(Data$block)$lengths)

Does anyone have any suggestions for adding a column counting (sequence) the number of sites within each block? If it helps, there are the same number of days (263) recorded for each site but there are a different number of sites per block.

score 6 · Accepted Answer · answered Dec 18 '11 at 05:04

6

Here's a slightly clumsy solution using plyr and ddply:

ddply(df,.(block),transform,
                  site1 = rep(1:length(unique(site)),
                             times = rle(site)$lengths))

Or a slightly slicker version:

ddply(df,.(block),transform,site1 = as.integer(as.factor(site)))

There may be a clever way of doing this directly, though, using the various seq, sequence and rle functions, but my brain is a bit hazy at the moment. If you leave this open for a bit someone will likely come along with a slick non-plyr solution.

answered Dec 18 '11 at 05:04

joran

169,992
32
429
468

Thank you, this worked perfectly. I've seen plyr before but never used it. ddply is perfect, I was actually temped to break apart the matrix, use reshape(=wide) on days, apply the rle(site) and then try to reshape(=long). I don't know if it would work but I figured there were about 1000 easier ways to do it. I like the dpdply solution. thanks again. – djhocking Dec 18 '11 at 16:25
Nice one, just tried the ddply option on a similar problem and worked perfectly first time – rg255 Jun 12 '13 at 09:10

score 1 · Answer 2 · answered Dec 18 '11 at 06:04

1

Using tapply could work

# Make some fake data
dat <- data.frame(block = rep(1:3, each = 4), site = rep(1:6, each  = 2), val = rnorm(12))
# For each block reset the count
dat$site2 <- unlist(tapply(dat$site, dat$block, function(x){x - min(x) + 1}))

answered Dec 18 '11 at 06:04

Dason

60,663
9
131
148

This doesn't seem to work for my messy data. I forgot that site is not completely continuous because the dataloggers at some sites failed or went missing. So occasionally throughout the dataset site numbers are skipped, but what I need is a variable that just counts so I can cycle through with indexing in loop functions as part of my analysis. Maybe if I try what you suggest but with some clever function adding to the previous value in site2 rather than basing it on the original site number. – djhocking Dec 18 '11 at 16:00

score 0 · Answer 3 · answered Oct 28 '20 at 17:55

I just wanted to update with an answer using dplyr to implement the approach by @joran for people who find this now.

library(dplyr)

# create data
df <- data.frame(block = rep(1:3, each = 4), 
                 site = rep(1:6, each  = 2), 
                 day = rep(1:2, times = 6), 
                 x = rnorm(12))

df %>%
  group_by(block) %>%
  mutate(site2 = as.integer(as.factor(site)))

The resulting output is:

block  site   day     x    site2
<int> <int> <int>  <dbl>   <int>
1     1       1    0.762     1
1     1       2   -0.612     1
1     2       1    1.06      2
1     2       2   -0.168     2
2     3       1    1.09      1
2     3       2    1.38      1
2     4       1    1.69      2
2     4       2    0.414     2
3     5       1    0.208     1
3     5       2   -0.647     1
3     6       1   -1.01      2
3     6       2   -0.354     2

score 0 · Answer 4 · answered Dec 18 '11 at 12:25

Via ave:

df1 <- structure(list(block = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2), 
    site = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4), day = c(1, 
    2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3)), .Names = c("block", "site", 
"day"), row.names = c("2", "3", "4", "5", "6", "7", "8", "9", 
"10", "11", "12", "13"), class = "data.frame")

df1$site2 <- ave(df1$site,df1$block,FUN=function(x) match(x,sort(unique(x))))

How to add a counting column based on values in other columns in R

4 Answers4

Linked