0

I have two datasets. I want to add the wealth variable of dataset-2 to dataset-1 beside the occupation variable/column. dataset-2 was collected from households head, one response from one household. However, dataset-1 was collected from all women from the household. For this reason, dataset-1 has more responses.

dataset-1: women dataset (total 8678 responses/rows)

 Women_id household_id BMI   Education Occupation
101 1 1   101 1        22.5  1         3
101 1 1   101 2        28.6  3         5
101 1 1   101 3        19.6  2         3
101 1 1   101 3        20.1  2         2
101 1 1   101 4        26.8  3         3

dataset-2: households dataset (total 6784 responses/rows)

household_id    wealth
101 1           2                         
101 2           1
101 3           2
101 4           4

I want to add wealth variable to dataset-1 based on household_id. I have tried the marge function of dplyr

joined_df <- merge(dataset_1, dataset_2, by.x = "household_id", all.x = TRUE, all.y = FALSE )

The two datasets added, however, the value of wealth shows NA.

Women_id household_id BMI   Education Occupation  Wealth
    101 1 1   101 1        22.5  1         3      NA
    101 1 1   101 2        28.6  3         5      NA
    101 1 1   101 3        19.6  2         3      NA
    101 1 1   101 3        20.1  2         2      NA
    101 1 1   101 4        26.8  3         3      NA

I want like follows

 Women_id household_id BMI   Education Occupation  Wealth
    101 1 1   101 1        22.5  1         3       2
    101 1 1   101 2        28.6  3         5       5
    101 1 1   101 3        19.6  2         3       2
    101 1 1   101 3        20.1  2         2       2
    101 1 1   101 4        26.8  3         3       4

1 Answers1

2

Since both datasets have same household_id column, you can do so

joined_df <- dplyr::left_join(dataset_1, dataset_2, by = "household_id")

Using merge() would be

joined_df <- merge(dataset_1, dataset_2, by = "household_id", all.x = TRUE)
Ben Toh
  • 742
  • 5
  • 9