1

I'm looking at transitions from place to place. I will simplify my problem to make it clearer what I'm trying to do.

Suppose there are 3 places a person can go: A, B, C.

I currently have a data frame that captures transitions from place to place (it is allowed that sequential transitions can be to the same place; e.g., A to A).

Origin Dest Time 
     A    B  Mon
     B    C  Wed
     C    B  Fri

I also have created a distance matrix that captures distance from place to place.

   A  B  C
A  0  8 11
B  8  0  6
C 11  6  0

I also have a "popularity" data frame that captures how many times total any person went to A, B, C prior to some time.

  Popularity
B         47
C         32
A         25

What I'm looking to do is create a data frame (for each transition) with each row being a possible destination that the person could've went with the target variable coded 0 or 1 based on whether the person actually went to that destination. In other words, the data frame for the first transition in the above transitions data frame (A to B on Monday) should be:

 Origin  Dest  Went?  Dist  Time_Dest  Pop
      A     A      0     0      Mon_A   25
      A     B      1     8      Mon_B   47
      A     C      0    11      Mon_C   32

I want to iterate this, creating a different data frame for each transition (each row in the transitions data frame) and then append all of these data frames together to create a large data frame that will allow me to run a massive logistic regression to predict probabilities of next destination, given an origin and time.

Could someone possibly provide any insight as to how to write a function to do this? Or perhaps there is an easier way?

Thanks so much for any help.

jp334
  • 165
  • 1
  • 9
  • 1
    You can use `split()` on your first dataframe. This will create a list of dataframes where each dataframe corresponds to one of the possible pairwise transitions. The code would look something like: `splitdf <- split(df, list(df$Origin, df$Dest))` – Marcus Campbell Apr 08 '18 at 21:28

1 Answers1

1

Most of this is can be accomplished with a few merge statements. I'm using the tidyverse suite of packages to do the work, but you can very easily do this in base R. I'll point out the changes - but the biggest will be the use of temporary variables or nesting instead of pipes. The pipe command %>% is just going to call the next function in the chain with the previous result as the first argument.

library(tidyverse)

# generating your data
locations <- LETTERS[1:3]
n_locations <- length(locations)

# using base R, use the function expand.grid instead of crossing
location_combinations <- crossing(Origin = locations, Dest = locations)

dist_matrix <- matrix(0,nrow = n_locations, ncol = n_locations)
dist_matrix[lower.tri(dist_matrix)] <- c(8, 11, 6)
dist_matrix <- dist_matrix + t(dist_matrix)

transitions <- data_frame(
  Origin = locations,
  Dest = locations[c(2,3,2)],
  Time = c("Mon", "Wed", "Fri")
)

# Make "Dest" a vector instead of the rownames to work with it a little more easily.
popularity <- data_frame(
  Dest = locations,
  Popularity = as.integer(c(25, 47, 32))
)

# left_join can be replaced with "merge" using base R.  
# mutate can be replaced by defining/redefining each variable separately, or using the "within" command.   
tmp <- location_combinations %>%
  left_join(transitions, by = c("Origin", "Dest")) %>%
  left_join(popularity, by = "Dest") %>%
  mutate(
    Origin = as_factor(Origin),
    Dest = as_factor(Dest),
    `Went?` = !is.na(Time),
    Time_Dest = paste(Time, Dest, sep = "_"),
    index = (as.numeric(Origin)-1) * n_locations + as.numeric(Dest),
    Dist = dist_matrix[(as.numeric(Origin)-1) * length(locations) + as.numeric(Dest)]
  ) %>%
  select(-Time)
tmp

This gives you almost what you want. Two differences - first, I left Went? as a logical vector instead of 1/0. Multiply by 1 to fix this if needed for logistic regression. The other difference is the "Time_Dest" column, which doesn't have a date for an event that didn't happen. In other words, "instead of "Mon_A" for A to A, it sees "NA_A". If this is a big problem, I can almost certainly address this with another merge/join, so let me know if you need it and can't figure it out. (Hint - do a 2nd merge with Transitions data frame, but with by = origin).

To see partial work (and better understand pipes, you can run pieces of this code. For example, try

location_combinations %>%
  left_join(transitions, by = c("Origin", "Dest"))

Alright, so now you (more or less) have the entire data set in one spot. To split it, there are several options.

  1. You can use split to split it up by Origin. The code looks like

    list_of_dfs <- split(tmp, tmp$Origin)
    

    This produces exactly what you asked for, a list of data frames which can be analyzed separately.

  2. You can use group_by function in the dplyr package (part of tidyverse.) An example using this approach is at Linear Regression and group by in R. The caveat here is that the do function is/will be depreciated, so this isn't a solution that will work forever. I haven't needed it recently, so I'm not sure what the "new" solution is, but this, in combination with the broom package can almost certainly help you to organize your results. (See https://cran.r-project.org/web/packages/broom/vignettes/broom_and_dplyr.html).

Update to include all possible destinations

location_combinations %>%
  left_join(transitions, by = c("Origin", "Dest")) %>%
  left_join(transitions %>% select(Origin, Time), by = "Origin") %>%
  left_join(popularity, by = "Dest") %>%
  mutate(
    Origin = as_factor(Origin),
    Dest = as_factor(Dest),
    `Went?` = !is.na(Time.x),
    Time_Dest = paste(Time.y, Dest, sep = "_"),
    index = (as.numeric(Origin)-1) * n_locations + as.numeric(Dest),
    Dist = dist_matrix[(as.numeric(Origin)-1) * length(locations) + as.numeric(Dest)]
  ) %>%
  select(-Time.x, -Time.y, -index)
Melissa Key
  • 4,476
  • 12
  • 21
  • Thanks a lot for your help. I really appreciate it. – jp334 Apr 09 '18 at 00:26
  • Melissa, I'm having trouble replacing the "NA_Dest". It is important that that corresponds to the time and destination for all possible destinations the person could have went, even if he/she did not go there. I'm a bit confused how to do the second merge with the transitions df with by = "Origin". Could you explain? Thanks so much for any help. – jp334 Apr 09 '18 at 21:28
  • Look at the change - I've added another merge with the transition df, but ignoring destination completely. We now have two copies of ``Time``, and the merge will label these ``Time.x`` and ``Time.y`` automatically (``Time.x`` has NAs, ``Time.y`` does not), so we just need to make sure we use the right one for ``Went?`` and ``Time_Dest``. – Melissa Key Apr 09 '18 at 22:09
  • Thank you so much. I understand now. – jp334 Apr 10 '18 at 02:06