-3

I have 2 different tables from 2 different samples, TABLE 1 and TABLE 2(with different number of rows or individuals). The 2 tables contain the same variables : AGE (quantitative), PROVINCE (qualitative with 11 choices), PLACE (urban or rural), AGE AT MARRIAGE, PARITY or number of children (quantitative).
From TABLE 1, I want to create a new variable (PAR_REF) in TABLE 2 defined as the average PARITY achieved by an individual from the same PROVINCE , the same PLACE, with the same AGE and with the same AGE AT MARRIAGE as a woman in TABLE 1. I mean, the values for the new variable I want to create(PAR_RF)in TABLE 2 must be the average PARITY for all individuals having the same characteristics (for the variables AGE, PROVINCE, PLACE, AGE AT MARRIAGE) in TABLE 1. This is what I want to do

enter image description here

What is the process?

Nimantha
  • 6,405
  • 6
  • 28
  • 69
  • 4
    It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Jul 17 '21 at 19:16
  • @MrFlick It's a rather common task, and the description given looks enough to me. –  Jul 18 '21 at 07:51
  • Did you notice that your image does not show what you describe? Among the rows you group to compute the average, there are different values of the four variables that are supposed to be identical. –  Jul 18 '21 at 07:54
  • There was a mistake in the previous image. I have updated it. Thank you ! – user16470746 Jul 18 '21 at 08:00
  • There is still one on row 5. AGE AT MARRIAGE. And row 2 should be counted with the yellow group. –  Jul 18 '21 at 08:39
  • No, it should be counted in the grew group. I've just updated the image content – user16470746 Jul 18 '21 at 09:09
  • The new variable in TABLE 2, PAR_REF, can also be the PARITY (number of children) of the nearest neighbor in TABLE 1. I do not know what is easier between both methods – user16470746 Jul 18 '21 at 09:24

1 Answers1

0

On TABLE1, group_by those variables and summarize, then join the result with TABLE2.

However, for this to work, you need in TABLE1 at least one individual of the same province/place/age/age of marriage as every one found in TABLE2, otherwise the join will leave missing values.

library(tidyverse)

TABLE_REF <- TABLE1 %>%
  group_by(AGE, PROVINCE, PLACE, AGE_AT_MARRIAGE) %>%
  summarize(PAR_REF = mean(PARITY)) %>%
  ungroup()

TABLE2 %>% left_join(TABLE_REF, by = c("AGE", "PROVINCE", "PLACE", "AGE_AT_MARRIAGE"))
  • Jean-Claude Arbaut, when running the scripts, i get the following error message. Have I to create first an empty column (PAR_REF) in TABLE2 ? Error: Join columns must be present in data. x Problem with `AGE, PROVINCE, PLACE, AGE_AT_MARRIAGE`. Run `rlang::last_error()` to see where the error occurred. – user16470746 Jul 18 '21 at 08:55
  • @user16470746 My bad. the `by` argument is a vector of characters, with one string for each varaible name. I typed too quickly in RStudio, it added quotes around the whole group of variables, and of course there is no variable named "AGE, PROVINCE, PLACE, AGE_AT_MARRIAGE". I saw it, but I forgot to make the correction in my answer. –  Jul 18 '21 at 09:00
  • Jean-Claude Arbaut, THANK YOU SO MUCH. Everything is now OKEY – user16470746 Jul 18 '21 at 09:54