1

I currently have a product table and an ingredient table, as follows:

Product (actual: 13 columns, truncated for ease of explanation):

Product_code  Product_name  Ingredient  Brand_name
A123          Product1      NA          BrandA
B123          Product2      NA          BrandB
C123          Product3      NA          BrandC

Ingredient (actual has 2 columns, as shown):

Product_code  Ingredient
A123          Ing1
A123          Ing2
B123          IngA
B123          IngB
B123          IngC

The aim is "fill in" the ingredient details for each product in the "Product" table such that it appears as such:

Combined table:

Product_code  Product_name  Ingredient  Brand_name
A123          Product1      Ing1        BrandA
A123          Product1      Ing2        BrandA
B123          Product2      IngA        BrandB
B123          Product2      IngB        BrandB
B123          Product2      IngC        BrandB
C123          Product3      NA          BrandC

As some products have one ingredient while others have multiple, I want to be able to duplicate new rows for products with multiple ingredients (all other column values to retain the same except for the "Ingredient" column). Products without data in the "Ingredients" table will remain as "NA".

I have tried using dplyr::right_join but it left me with "NA" values for all the other columns other than "Product_code" and "Ingredient". Hoping to get some advice on how to proceed with this. Thanks!

Desmond
  • 49
  • 4

2 Answers2

3

I think you are looking for full_join. It is also OK to remove the Ingredient column in the Product data frame as it does not provide information of the ingredient.

library(dplyr)

combined <- Ingrident %>%
  full_join(Product %>% select(-Ingredient), by = c("Product_code"))
combined
#   Product_code Ingredient Product_name Brand_name
# 1         A123       Ing1     Product1     BrandA
# 2         A123       Ing2     Product1     BrandA
# 3         B123       IngA     Product2     BrandB
# 4         B123       IngB     Product2     BrandB
# 5         B123       IngC     Product2     BrandB
# 6         C123       <NA>     Product3     BrandC

DATA

Product <- read.table(text = "Product_code  Product_name  Ingredient  Brand_name
A123          Product1      NA          BrandA
B123          Product2      NA          BrandB
C123          Product3      NA          BrandC",
                  header = TRUE, stringsAsFactors = FALSE)

Ingredient <- read.table(text = "Product_code  Ingredient
A123          Ing1
A123          Ing2
B123          IngA
B123          IngB
B123          IngC",
                        header = TRUE, stringsAsFactors = FALSE)
www
  • 38,575
  • 12
  • 48
  • 84
  • c <- merge(Ingredient,Product,by="Product_code",all=TRUE) – Prahlad Apr 07 '20 at 06:49
  • @Prahlad Thanks for the base R solution. Two suggestion. 1. Do not use an object name as `c` because `c` is a function in R and that causes confusions. 2. Remove the `Ingredient` column in the `Product` data frame before the merge otherwise there will be duplicated `Ingredient` column with `.x` and `.y` in the column names. – www Apr 07 '20 at 06:55
  • sd <- merge(Ingredient,Product[,c("Product_code","Product_name","Brand_name")],by="Product_code",all=TRUE) – Prahlad Apr 07 '20 at 07:01
  • @Prahlad Thanks. But `sd` is still a function in R. – www Apr 07 '20 at 07:08
  • 1
    my bad, it is standard deviation,,,d <- merge(Ingredient,Product[,c("Product_code","Product_name","Brand_name")],by="Product_code",all=TRUE) – Prahlad Apr 07 '20 at 07:25
0

If you don't want NAs, look at dplyr::inner_join. With right join you are asking as much rows as your ingredients dataframe. Non matching keys in the left dataframe will have NAs

linog
  • 5,786
  • 3
  • 14
  • 28
  • thanks @linog! I previously tried `inner_join` but it gave me less rows than my original "Product" table, that's why I considered `right_join`. I just tried `inner_join` again and it worked! Realised I had to `group_by` "Product_code" first before joining. Thanks alot again! – Desmond Apr 07 '20 at 06:42