5

I have a relatively large dataset (16,000+ x ~31). In other words, it's large enough that I don't want to manipulate it line by line in Excel. The data is in this form:

block  site     day  X1   X2
1      1        1    0.4  5.1 
1      1        2    0.8  1.1
1      1        3    1.1  4.2
1      2        1    ...  ...
1      2        2
1      2        3
2      3        1
2      3        2
2      3        3
2      4        1
2      4        2
2      4        3

As you can see, the site count is continuous but I would like a column where the site number resets with each block. For example, I would like something like this below:

block  site     day  X1   X2    site2
1      1        1    0.4  5.1   1
1      1        2    0.8  1.1   1
1      1        3    1.1  4.2   1
1      2        1    ...  ...   2
1      2        2               2
1      2        3               2
2      3        1               1
2      3        2               1
2      3        3               1
2      4        1               2
2      4        2               2
2      4        3               2

I was thinking about using the R function rle but am not sure if it will work because of complications with day. Otherwise, I would try something like:

Data$site2 <- sequence(rle(Data$block)$lengths)

Does anyone have any suggestions for adding a column counting (sequence) the number of sites within each block? If it helps, there are the same number of days (263) recorded for each site but there are a different number of sites per block.

Kara
  • 6,115
  • 16
  • 50
  • 57
djhocking
  • 1,072
  • 3
  • 16
  • 28

4 Answers4

6

Here's a slightly clumsy solution using plyr and ddply:

ddply(df,.(block),transform,
                  site1 = rep(1:length(unique(site)),
                             times = rle(site)$lengths))

Or a slightly slicker version:

ddply(df,.(block),transform,site1 = as.integer(as.factor(site)))

There may be a clever way of doing this directly, though, using the various seq, sequence and rle functions, but my brain is a bit hazy at the moment. If you leave this open for a bit someone will likely come along with a slick non-plyr solution.

joran
  • 169,992
  • 32
  • 429
  • 468
  • Thank you, this worked perfectly. I've seen plyr before but never used it. ddply is perfect, I was actually temped to break apart the matrix, use reshape(=wide) on days, apply the rle(site) and then try to reshape(=long). I don't know if it would work but I figured there were about 1000 easier ways to do it. I like the dpdply solution. thanks again. – djhocking Dec 18 '11 at 16:25
  • Nice one, just tried the ddply option on a similar problem and worked perfectly first time – rg255 Jun 12 '13 at 09:10
1

Using tapply could work

# Make some fake data
dat <- data.frame(block = rep(1:3, each = 4), site = rep(1:6, each  = 2), val = rnorm(12))
# For each block reset the count
dat$site2 <- unlist(tapply(dat$site, dat$block, function(x){x - min(x) + 1}))
Dason
  • 60,663
  • 9
  • 131
  • 148
  • This doesn't seem to work for my messy data. I forgot that site is not completely continuous because the dataloggers at some sites failed or went missing. So occasionally throughout the dataset site numbers are skipped, but what I need is a variable that just counts so I can cycle through with indexing in loop functions as part of my analysis. Maybe if I try what you suggest but with some clever function adding to the previous value in site2 rather than basing it on the original site number. – djhocking Dec 18 '11 at 16:00
0

I just wanted to update with an answer using dplyr to implement the approach by @joran for people who find this now.

library(dplyr)

# create data
df <- data.frame(block = rep(1:3, each = 4), 
                 site = rep(1:6, each  = 2), 
                 day = rep(1:2, times = 6), 
                 x = rnorm(12))

df %>%
  group_by(block) %>%
  mutate(site2 = as.integer(as.factor(site)))

The resulting output is:

block  site   day     x    site2
<int> <int> <int>  <dbl>   <int>
1     1       1    0.762     1
1     1       2   -0.612     1
1     2       1    1.06      2
1     2       2   -0.168     2
2     3       1    1.09      1
2     3       2    1.38      1
2     4       1    1.69      2
2     4       2    0.414     2
3     5       1    0.208     1
3     5       2   -0.647     1
3     6       1   -1.01      2
3     6       2   -0.354     2
djhocking
  • 1,072
  • 3
  • 16
  • 28
0

Via ave:

df1 <- structure(list(block = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2), 
    site = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4), day = c(1, 
    2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3)), .Names = c("block", "site", 
"day"), row.names = c("2", "3", "4", "5", "6", "7", "8", "9", 
"10", "11", "12", "13"), class = "data.frame")

df1$site2 <- ave(df1$site,df1$block,FUN=function(x) match(x,sort(unique(x))))
Gregory Demin
  • 4,596
  • 2
  • 20
  • 20