Merge many pairs of data together into separate data sets

Question

I have about 100 dataset pairs that need to be merged into single data sets, I have looked at posts that show how to merge many data sets together at once (e.g., here and here), but my issue is unique. My real-world data are stored on my hard drive and are similarly named (e.g., household2010, household2011, household2012 and person2010, person2011, person2012. They don't need to be loaded into the global environment.Example below.

Working data:

library(tidyverse)

household2010 <- tribble(
~id, ~var2, ~var3, ~var4, ~var5,
"1",   "1",   "1",   "a",   "d",
"2",   "2",   "2",   "b",   "e",
"3",   "3",   "3",   "c",   "f"
)

person2010 <- tribble(
~id, ~var6, ~var7,
"1",   "1",   "1",
"2",   "2",   "2",
"3",   "3",   "3",
"4",   "4",   "4"
)

household2011 <- tribble(
~id, ~var8, ~var9, ~var10,
"1",   "1",   "1",    "1",
"2",   "2",   "2",    "2",
"3",   "3",   "3",    "3",
"4",   "4",   "4",    "4"
)

person2011 <- tribble(
~id, ~var11, ~var12, ~var13,
"1",   "1",   "1",    "1",
"2",   "2",   "2",    "2",
"3",   "3",   "3",    "3",
"4",   "4",   "4",    "4",
"5",   "5",   "5",    "5"
 )

I need to merge household2010 with person2010 and create a new dataset called hhperson2010. I need to do this to household2011 and person2011 too. Individually I could do:

hhperson2010 <- left_join(household2010, person2010, by = "id")

hhperson2011 <- left_join(household2011, person2011, by = "id")

This gets clunky when I have over a 100 data pairs. Can I use lapply to have it go through a list of the data sets and merge? Something like:

dflist1 <- list(household2010, household2011)

dflist2 <- list(person2011,    person2011)

lapply(function(x) left_join(dflist, dflist2, by = "id")

Thanks @d.b., but this doesn't work as it creates one dataset. I need to the pairs to be merged into separate datasets. — scottsmith, Dec 26 '17 at 19:25
Why don't you do it at inhale. These 100 tables come from somewhere. Are you importing them from files? `lapply` there and never have 100 objects in the memory. Give us a little more idea of data provenance, so we can suggest smarter way of avoiding 100 variables in the global environment — dmi3kno, Dec 26 '17 at 19:41
The files are all RData objects stored on my hard drive (edited my example above). If I could do it all at once without actually loading them into the global environment, that would be great. — scottsmith, Dec 26 '17 at 19:47

pogibas · Accepted Answer · 2017-12-26T19:44:45.880

1

Maybe something like this:

years <- 2010:2011
result <- lapply(years, 
              function(x) left_join(get(paste0("household", x)), 
                                    get(paste0("person", x)), 
                                    "id"))

names(result) <- paste0("household", years)

edited Dec 26 '17 at 19:44

answered Dec 26 '17 at 19:18

pogibas

27,303
19
84
117

Thanks @PoGibas, but this saves an empty file. I wish the `save` command would accept something like you have. I prefer to save the files as `Rdata` rather than `RDS`. – scottsmith Dec 26 '17 at 19:31
I removed the need to `save` the datasets, if that makes it easier. – scottsmith Dec 26 '17 at 19:41
@scottsmith I updated my answer with saving result to list and naming list entries with wanted names. You can save this list as single object or extract entries with `result[["householdYEAR"]]` – pogibas Dec 26 '17 at 19:45
Thanks @PoGibas, this works! Following @dmi3kno's suggestion, is there a way to modify your code to call a list of files from my hard drive so I don't have to load `household201x` and `person201x` files into the global environment? Maybe using something like `files <- list.files(pattern = "household*.RData)` for both the `household` and `person` files and put that into the `function`? – scottsmith Dec 26 '17 at 20:00
@scottsmith 1. write a loop to iterate over one year at a time; 2. save all year data into a single tibble object. – pogibas Dec 26 '17 at 20:05
@scottsmith, are those bunch of .Rdata files? – dmi3kno Dec 26 '17 at 21:49

Vishal786btc · Answer 2 · 2017-12-27T15:40:44.943

0

just an alternate solution:
years <- c("2010", "2011", "2014") 

for (x in years){
  result <- merge(get(paste0("household", x)), get(paste0("person", x)), "id")
  names <- paste0("household", x)
  print(names)
  print(result)
}

you can make a choice between loop or lapply depending on your further processing if any. In case, you don't have any more to do with your dataset, I think lapply would just solve the purpose.

edited Dec 27 '17 at 15:40

answered Dec 26 '17 at 19:33

Vishal786btc

428
1
5
17

Thanks @Vishal786btc, but this I would still have to manually do `hhperson2010 = merge(x = df1, y = df2, by = "id", all.x = TRUE)` for each year, and I am trying to avoid that. – scottsmith Dec 26 '17 at 19:38
1

This is exactly want OP want to avoid – pogibas Dec 26 '17 at 19:47
1

oh okay. Thanks @PoGibas for pointing that. your answer looks good then. I'll fix mine. Thanks. – Vishal786btc Dec 27 '17 at 15:34

score 0 · Answer 3 · answered Dec 26 '17 at 22:54

Here's how I would do it with tidyverse and list-columns

library(dplyr)
library(tidyr)
library(purrr)

env2listcol <- function(rdata_file)
{
  e <- new.env()
  load(rdata_file, envir = e)
  # since you know that there's only 1 df in each environment
  as.list.environment(e)[[1]] 
}

# assuming files are stored in `input` folder
dir("input", full.names = T) %>% as_tibble() %>% 
  # split the path
  separate(value, into=c("dir", "file", "ext"), remove=FALSE) %>% 
  # get the category and the key in separate columns
  extract(file, into=c("key", "year"), regex="([a-z]+)(\\d+)") %>% 
  # file path by category by year, remove unnecessary columns
  spread(key, value) %>% select(-dir, -ext) %>% 
  # extract dataframes from environments, and join them
  mutate(household=map(household, env2listcol),
         person=map(person, env2listcol),
         joined=map2(household, person, left_join)) %>% 
  # rbind joined tables, although you could pull(joined) or imap over it
  unnest(joined)

#> # A tibble: 7 x 14
#>    year    id  var2  var3  var4  var5  var6  var7  var8  var9 var10 var11 var12 var13
#>   <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1  2010     1     1     1     a     d     1     1  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
#> 2  2010     2     2     2     b     e     2     2  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
#> 3  2010     3     3     3     c     f     3     3  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
#> 4  2011     1  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>     1     1     1     1     1     1
#> 5  2011     2  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>     2     2     2     2     2     2
#> 6  2011     3  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>     3     3     3     3     3     3
#> 7  2011     4  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>     4     4     4     4     4     4

You decide what you want to do with it. You may write it back to R objects (please, please, please... use Rds instead). You may write it back into a single table (I believe it is MUCH easier to handle going forward). You can probably even export it as json.

Merge many pairs of data together into separate data sets

3 Answers3