1

I'm currently stuck with my data frame and i would like to know how to do "subsets of subsets of subsets" Here is a part of my data frame:

    YEAR    RN      DATE    NAME            SITE           LONG SP                         SUMNB            NB100
1   2011    RNN027  15056   ESTAGNOL    RNN027-Estagnol 02  310 Anthocharis cardamines (Linnaeus, 1758) 1   0.3225806
2   2011    RNN027  15075   ESTAGNOL    RNN027-Estagnol 02  310 Anthocharis cardamines (Linnaeus, 1758) 1   0.3225806
3   2003    RNN027  12166   ESTAGNOL    RNN027-Estagnol 03  330 Anthocharis cardamines (Linnaeus, 1758) 2   0.6060606
4   2006    RNN027  13252   ESTAGNOL    RNN027-Estagnol 03  330 Anthocharis cardamines (Linnaeus, 1758) 2   0.6060606
5   2006    RNN027  13257   ESTAGNOL    RNN027-Estagnol 03  330 Anthocharis cardamines (Linnaeus, 1758) 2   0.6060606
6   2005    RNN027  12895   ESTAGNOL    RNN027-Estagnol 01  540 Anthocharis cardamines (Linnaeus, 1758) 2   0.3703704

My point is to compute a abundance factor for each species. To do that, i have to isolate every count date for every species, every year, and every site.

My first idea was to do multiple loops and subseting every step by the previous criteria:

DF --> Loop SITE ; subset of each SITE -->loop YEAR; subset of each YEAR -->loop SP; subset of each SPECIES--> dates of observations

The point of isolating these dates require further modifications (adding rows), but i need to be capable of rewriting the modified subsets afterwards and reconstruct a new dataframe.

I built my loops command:

LOOPSITE<-sort(unique(DF$SITE))
for(i in LOOPSITE){
  print(i)
  LOOPSITESUB<-subset(DF,grepl(i,SITE))
  LOOPYEAR<-sort(unique(LOOPSITESUB$YEAR))
  print(LOOPYEAR)

  for(j in LOOPYEAR){
    print(j)
    LOOPYEARSUB<-subset(LOOPSITESUB,grepl(j,YEAR))
    LOOPSP<-sort(unique(LOOPYEARSUB$SP))
    print(length(LOOPSP))
       for(k in LOOPSP){
         print(k)
         LOOPSPSUB<-subset(LOOPYEARSUB,grepl(k,SP))
         print(sum(LOOPYEARSUB$SUMNB))
         print(head(LOOPSPSUB))
    }      
  }
}

I am able to follow that my script is working with all these "print" commands, and it is working until i reach the species subseting. For an unknown reason, the last subsetting dont concern each species, but only some of them. Here is a part of the output for the last SITE and the last YEAR:

"RNN027-Estagnol 01"
...(I skipped all the sites)
"RNN027-Estagnol 06"
"2003"
...(I skipped all the years)
"2011"
[1] 22
[1] "Aricia agestis D., 1775"
[1] 107
   YEAR     RN       DATE      NOM               SITE LONG                      SP SUMNB     NB100
66 2011 RNN027 2011-04-21 ESTAGNOL RNN027-Estagnol 06  260 Aricia agestis D., 1775         1 0.3846154
67 2011 RNN027 2011-05-22 ESTAGNOL RNN027-Estagnol 06  260 Aricia agestis D., 1775     1 0.3846154
68 2011 RNN027 2011-08-05 ESTAGNOL RNN027-Estagnol 06  260 Aricia agestis D., 1775     2 0.7692308
[1] "Brintesia circe (Fabricius, 1775)"
[1] 107
[1] YEAR  RN    DATE  NOM   SITE  LONG  SP    SUMNB NB100
<0 rows> (or 0-length row.names)
[1] "Carcharodus alceae (Esper, 1780)"
[1] 107
[1] YEAR  RN    DATE  NOM   SITE  LONG  SP    SUMNB NB100
<0 rows> (or 0-length row.names)

It is working for "Aricia agestis D., 1775" but not for "Brintesia circe (Fabricius, 1775)". I verified on my dataframe, that second species have been observed at this time and place,and have the same format than the previous one...it should be working.

How many loops can i stack like this ? Is there another way to do that? (it would be convenient and faster). I'm aware of the "split" function, who basically dismont every group, but as i cant exploit every"chunk", it dont fit to my task. I am maybe wrong.

At the last step (after modifing all the subsets), i should be able to write each subset in a new dataframe to reconstruct a modified version of my input.

I'm am maybe on the wrongest way i possibly can go! I can provide further explanations if needed!

Thanks for your help!

EDIT:

I'll try to explain what i want to do. In order to calculate my abundance index, i need to add "blank" rows before and after each temporal "session" of observation. Basically, i try to obtain a subset for every combination of 3 differents factors (SITE, YEAR and SP).

Here is an example of the type of output i would like to obtain. For every SITE X/YEAR Y/SP Z possible combination:

 YEAR    RN      DATE    NAME            SITE           LONG SP  SUMNB NB100
----ADD A NEW ROW----DATE MINUS 7 DAYS-----------------------------------------------------------------------------------
1   Y    RNN027  15056   ESTAGNOL    RNN027-Estagnol X  310 SP Z  1   0.3225806
2   Y    RNN027  15075   ESTAGNOL    RNN027-Estagnol X  310 SP Z  1   0.3225806
3   Y    RNN027  12166   ESTAGNOL    RNN027-Estagnol X  330 SP Z  2   0.6060606
4   Y    RNN027  13252   ESTAGNOL    RNN027-Estagnol X  330 SP Z  2   0.6060606
5   Y    RNN027  13257   ESTAGNOL    RNN027-Estagnol X  330 SP Z  2   0.6060606
6   Y    RNN027  12895   ESTAGNOL    RNN027-Estagnol X  540 SP Z  2   0.3703704
----ADD A NEW ROW----DATE PLUS 7 DAYS-----------------------------------------------------------------------------------

Then i rewrite and compile every modified subset in a new DF.

EDIT 2: The use of "split(DF, list(DF$SITE, DF$YEAR, DF$SP))" crashed my computer, unless I dropped the unused values. I got exactly what I want, but how can I access and modify every subset ?

user2542995
  • 241
  • 2
  • 4
  • 11
  • Could you please add the output of `dput(head(DF))` to your question? – Sven Hohenstein Dec 06 '13 at 12:08
  • If I understand you correctly, it seems that you want to 'calculate something per group' - abundance per species, site and year. If so, you might look have a look [**here**](http://stackoverflow.com/questions/3505701/r-grouping-functions-sapply-vs-lapply-vs-apply-vs-tapply-vs-by-vs-aggrega) and [**here**](http://lamages.blogspot.se/2012/01/say-it-in-r-with-by-apply-and-friends.html) for some ideas. And heaps of posts on this topic on SO. Cheers. – Henrik Dec 06 '13 at 12:21
  • For now, i dont want to calculate anything. In order to calculate my index without underestimating anything, I need to add a row before and after each subset. These rows will have a date set on 7 days before the first date and after the last date of observation. If i understand what is said in your first link (SO), i could apply that function with "tapply" but only for one factor right ? – user2542995 Dec 06 '13 at 14:04
  • To be honest I'm not clear with what you want, but is this of any use: `split(DF, list(DF$SITE, DF$YEAR, DF$SP), sep = " _ ")` and `lapply(split(DF, list(DF$SITE, DF$YEAR, DF$SP), sep = " _ "), function(x) x$DATE)`? Maybe you could add a desired output? – alexis_laz Dec 06 '13 at 14:24

2 Answers2

3

I suppose you are looking for aggregate.

aggregate(SUMNB ~ SITE + YEAR + SP, DF, sum)

#                 SITE YEAR                                      SP SUMNB
# 1 RNN027-Estagnol 03 2003 Anthocharis cardamines (Linnaeus, 1758)     2
# 2 RNN027-Estagnol 01 2005 Anthocharis cardamines (Linnaeus, 1758)     2
# 3 RNN027-Estagnol 03 2006 Anthocharis cardamines (Linnaeus, 1758)     4
# 4 RNN027-Estagnol 02 2011 Anthocharis cardamines (Linnaeus, 1758)     2

The command calculates the sum of all values in SUMNB for each combination of SITE, YEAR and SP.


Edit

Does the following code produce what your're looking for?

do.call(rbind, by(DF, DF[c("SITE", "YEAR", "SP")], FUN = function(x) {
  tmp <- x[c(1, seq(nrow(x)), nrow(x)), ]
  tmp$DATE[1] < tmp$DATE[1] - 7
  tmp$DATE[nrow(tmp)] <- tmp$DATE[nrow(tmp)] + 7
  return(tmp)
}))
Sven Hohenstein
  • 80,497
  • 17
  • 145
  • 168
  • In order to build my input DF, i already did aggregate. As explained in the other answer's comments, i need to add rows before and after each "chunk". Aggregate dont allow me to do that. – user2542995 Dec 06 '13 at 14:06
  • @user2542995 Which values do you want in the other columns of the additional rows? – Sven Hohenstein Dec 06 '13 at 18:28
  • Yes it is working! I was able to set the values of SUMNB and NB100 to zero as i wanted. Thanks a lot! – user2542995 Dec 10 '13 at 11:19
1

Based on your edits, I believe this could be useful:

set.seed(11)
DF <- data.frame(YEAR = sample(c(2001, 2003), 5, T),     #random data
                 SITE = sample(c("a", "b"), 5, T),
                 SP = sample(c("sp1", "sp2"), 5, T),
                 DATE = sample(12345:15678, 5))

res <- lapply(split(DF, list(DF$SITE, DF$YEAR, DF$SP)), 
                function(x) 
                {
                 if(nrow(x) > 0)
                  {
                   row1 <- x[1,]
                   names(row1) <- colnames(x)
                   row1["DATE"] <- x$DATE[1] - 7

                   rown <- x[nrow(x),]
                   names(rown) <- colnames(x)
                   rown["DATE"] <- x$DATE[nrow(x)] + 7

                   rbind(row1, x, rown)
                  } 
                })
DF2 <- do.call(rbind, res)
rownames(DF2) = seq_len(nrow(DF2))

DF
#  YEAR SITE  SP  DATE
#1 2001    b sp1 14257
#2 2001    a sp1 13950
#3 2003    a sp2 13446
#4 2001    b sp2 12870
#5 2001    a sp2 13943
DF2
#   YEAR SITE  SP  DATE
#1  2001    a sp1 13943
#2  2001    a sp1 13950
#3  2001    a sp1 13957
#4  2001    b sp1 14250
#5  2001    b sp1 14257
#6  2001    b sp1 14264
#7  2001    a sp2 13936
#8  2001    a sp2 13943
#9  2001    a sp2 13950
#10 2001    b sp2 12863
#11 2001    b sp2 12870
#12 2001    b sp2 12877
#13 2003    a sp2 13439
#14 2003    a sp2 13446
#15 2003    a sp2 13453
alexis_laz
  • 12,884
  • 4
  • 27
  • 37