I have a question about the best way to set up R targets to track files and update a big dataset.
I've read other posts, including this one, but none answer this question.
To illustrate what I need to accomplish, consider the following Reprex:
Different family members are traveling to different cities. Build a tibble to store this information
city_log <- tibble::tibble(
city = c("new_york", "sf", "tokyo"),
traveler = list(c("Bob", "Mary", "Johnny", "Jenny"),
c("Bob", "Mary", "Jenny"),
c("Johnny", "Jenny"))
)
The goal would be to take this city-based information and convert it to person-based information.
traveler_log_full <- #this is a separate object b/c I need to re-use traveler_log_full
city_log %>%
tidyr::unnest("traveler")
traveler_log <-
traveler_log_full %>%
dplyr::nest_by(traveler, .key = "cities") %>%
dplyr::ungroup() %>%
dplyr::mutate(num_cities = map_dbl(cities, ~ nrow(.x))) #lets summarize the number of cities visited/person
The challenge: an updated dataset
The challenge is that this dataset will be updated often, and I want to use the computation from traveler_log_full to update it, and then remake the final traveler_log with the summary stats
city_log_updated <- tibble::tibble(
city = c("new_york", "sf", "tokyo", "paris"),
traveler = list(c("Bob", "Mary", "Johnny", "Jenny"),
c("Bob", "Mary", "Jenny"),
c("Johnny", "Jenny"),
c("Bob", "Mary"))
)
I could do something like filtering out the old cities, to get only new cities
old_cities <- unique(traveler_log_full$city)
city_log_updated %>%
dplyr::filter(!city %in% old_cities)
Given that I have 7.7M cities and 20,000 travelers, I do not want to recalculate the traveler_log_full each time I get a new city_log_updated
How can I set up R targets to carry out this task?
- I have read all the documentation on targets/targetopia.
- I do not want to use dynamic branching, becuase if the dynamic branches change, then I will have to regenerate all of the intermediate targets.
- I considered static branching via tar_map(), but there are no values that I would use for iteration.
- I think the ideal would be to manually take big file (7.7 M cities) and break it into 10 small files (manually assign idx?), and map along those.
- Then, when an updated dataset arrives, try to a create new file just with the new cities.
- An added challenge is that city_log_updated is technically called city_log, same as the first. So if this gets updated with a new file, then targets will trigger the generation of all of the intermediate objects too.
Thanks in advance for your help!