R: create n subsets before before modeling

Question

I have a large dataframe (300k + rows) based on animals positions (GPS).

Something like this:

idAnimal    date        elevation   sex   Distance  park     Presence
animal1     01-09-2018  2376        M     678       park1    1
animal1     01-09-2018  2402        M     1023      park1    1
animal1     01-09-2018  2366        M     933       park1    1
animal1     02-09-2018  2402        M     239       park1    1
animal1     02-09-2018  2428        M     423       park1    1
animal1     02-09-2018  2376        M     817       park1    1
animal1     02-09-2018  2354        M     1073      park1    1
animal1     03-09-2018  2337        M     210       park1    1
animal1     03-09-2018  2334        M     967       park1    1
animal1     03-09-2018  2406        M     242       park1    1
animal2     04-09-2018  2231        F     547       park1    0
animal2     04-09-2018  2343        F     506       park1    0
animal2     04-09-2018  2306        F     1190      park1    0
animal2     04-09-2018  2177        F     1219      park1    0
animal2     05-09-2018  2206        F     271       park1    0
animal2     05-09-2018  2318        F     142       park1    0
animal3     05-09-2018  2324        F     263       park2    1
animal3     05-09-2018  2259        F     996       park2    1
animal3     06-09-2018  2396        F     54        park2    1
animal3     06-09-2018  2436        F     1129      park2    1
animal3     06-09-2018  2380        F     811       park2    1

I created a subset in order to consider one observation per day per animal (to avoid temporal autocorrelation)

data%>% group_by (idAnimal, date)%>% sample_n (size = 1)

What I want to do is to create n subsets (let's say 100) with the condition above (one observation per day per animal) and then insert these 100 subsets in a binomial mixed model. All this to use up all my data, not only a part.

What worries me most is how to run such a model with 100 dataframes keeping together the estimates:

glmer (Presence~ Distance * sex  + elevation* sex + (1 |park), family = binomial, data = data)

Thank you for any help.

Is there a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)? Besides, what are `Presence` and `park` and how they are defined? They are not mentioned in the question. Include them in the example data too. — Matthew Hui, Apr 18 '20 at 09:39
Your `date` isn't a date. It looks like a POSIXct so be careful when grouping to give "per day". You'll need to convert or create a date variable. — Edward, Apr 18 '20 at 09:44
Wait.. sounds like you are going to randomly take 1 sample from each animal and run the model ... doesn't this ignore the time information you have and simply introduce more unaccounted errors into your model? — StupidWolf, Apr 18 '20 at 10:13
you can use purrr to keep your models together , see this https://r4ds.had.co.nz/many-models.html — StupidWolf, Apr 18 '20 at 10:14
@MatthewHui I just edited the example. Presence is a number: 0 or 1, where 1 is associated with a real observation (presence) and 0 with a fictitious observation (which represents absence). This was made to use a binomial family. Park is just the name of the park of origin for the animal. I'm sorry if the example is not reproducible but this is my first post. Plus I'm new in the world of statistics. — Baffo, Apr 18 '20 at 18:23
@Edward You're right. I've already done that but forgot to report into the example (now updated). Sorry — Baffo, Apr 18 '20 at 18:25
@StupidWolf Yes, I do ignore that information: my goal is to understand if the presence of animals is related to the distance from a disturbance point. I sincerely don't know if this can produce errors: I'm new to statistics. — Baffo, Apr 18 '20 at 18:30

score 0 · Accepted Answer · answered Apr 18 '20 at 10:17

Solution, however, glmer not run because Presence vector not included in sample data:

# Install pacakges if they are not already installed: necessary_packages => character vector
necessary_packages <-
  c("lme4")

# Create a vector containing the names of any packages needing installation:
# new_pacakges => character vector
new_packages <- necessary_packages[!(necessary_packages %in%
                                       installed.packages()[, "Package"])]

# If the vector has more than 0 values, install the new pacakges
# (and their) associated dependencies: varied => stdout
if (length(new_packages) > 0) {
  install.packages(new_packages, dependencies = TRUE)
}

# Initialise the packages in the session: bool => stdout
lapply(necessary_packages, require, character.only = TRUE)

# Append a grouping vector to df thats values 
# a concatenation of idAnimal and a downsampled date vec: group => character vector
df$group <- paste(df$idAnimal, 
                   gsub("\\s+|[[:punct:]]", "_", 
                        gsub(" .*", "", df$date)),
                  sep = "_")

# Allocate some memory by building an empty list the same length 
# as the number of unique group values: df_list => list
df_list <- vector("list", length(unique(df$group)))

# Replicate the list in order to store the results of glm: 
# glm_res_list => list
glm_res_list <- df_list

# Store each animalID day subset: df_list => list 
df_list <- split(df, df$group)

# Run an lme4 glmer model on each list element, store the result
# in glm_res_list: glm_res_list => list
# Note: not run as Presence vector is not included in sample data
glm_res_list <- lapply(df_list, function(x){
  glmer(Presence~ Distance * sex + elevation* sex + (1 |park), 
         family = binomial,
         data = x)
  }
)

Data:

df <- structure(
  list(
    idAnimal = c(
      "animal1",
      "animal1",
      "animal1",
      "animal1",
      "animal1",
      "animal1",
      "animal1",
      "animal1",
      "animal1",
      "animal1",
      "animal2",
      "animal2",
      "animal2",
      "animal2",
      "animal2",
      "animal2",
      "animal3",
      "animal3",
      "animal3",
      "animal3",
      "animal3"
    ),
    date = structure(
      c(
        1535724000,
        1535767200,
        1535788800,
        1535810400,
        1535832000,
        1535853600,
        1535875200,
        1535896800,
        1535940000,
        1535961600,
        1535983200,
        1536004800,
        1536026400,
        1536048000,
        1536069600,
        1536091200,
        1536112800,
        1536134400,
        1536156000,
        1536177600,
        1536220800
      ),
      class = c("POSIXct",
                "POSIXt"),
      tzone = "Australia/Sydney"
    ),
    elevation = c(
      2376L,
      2402L,
      2366L,
      2402L,
      2428L,
      2376L,
      2354L,
      2337L,
      2334L,
      2406L,
      2231L,
      2343L,
      2306L,
      2177L,
      2206L,
      2318L,
      2324L,
      2259L,
      2396L,
      2436L,
      2380L
    ),
    sex = c(
      "M",
      "M",
      "M",
      "M",
      "M",
      "M",
      "M",
      "M",
      "M",
      "M",
      "F",
      "F",
      "F",
      "F",
      "F",
      "F",
      "F",
      "F",
      "F",
      "F",
      "F"
    ),
    Distance = c(
      678L,
      1023L,
      933L,
      239L,
      423L,
      817L,
      1073L,
      210L,
      967L,
      242L,
      547L,
      506L,
      1190L,
      1219L,
      271L,
      142L,
      263L,
      996L,
      54L,
      1129L,
      811L
    )
  ),
  row.names = c(NA,-21L),
  class = "data.frame"
)

I tested the code with a small part of the dataframe and it seems to work. Thank you a lot. One thing I don't understand is how to view the overall summary (with estimates and p-values) at the end of the process. I'm currently running the model with full data. — Baffo, Apr 18 '20 at 18:36
@Baffo lapply(glm_res_lost, summary). Please upvote and accept my answer. — hello_friend, Apr 19 '20 at 03:39

R: create n subsets before before modeling

1 Answers1