Merging code values of categories to dataset in R

Question

I have a political donation dataset that holds industry categories in alphanumeric codes. A separate text document has a listing for how these alphanumeric codes translate into industry name, sector name, and category-of-industry name.

For instance, "A1200", is the Crop Production Industry, in the Agribusiness sector, in the Sugar cane industry category. I'd like to know how to pair the alphanumeric codes with their respective sector, industry, and category values in separate columns.

Right now, the code values dataset is in

    Catcode Catname     Catorder    Industry             Sector      
    A1200   Sugar cane  A01         Crop Production  Agribusiness

and this industry donation dataset:

Business name    Amount donated    Year   Category
Sarah Farms      1000              2010   A1200

The category dataset is about 444 rows and the donation set is about 1M rows. How do I feel out the donation dataset so it looks like this. Category would be the common name

    Catcode Catname     Catorder    Industry             Sector          Business name    Amount donated    Year   Category
    A1200   Sugar cane  A01         Crop Production  Agribusiness     Sarah Farms      1000              2010   A1200

I'm a little new to these forums, so if there was a better way to ask this question, please let me know. Thank you for the help!

Try the `merge()` function with the `by.x` and `by.y` arguments. See also http://stackoverflow.com/q/5963269/946850 for improving the question. — krlmlr, Dec 02 '14 at 02:51

score 2 · Accepted Answer · answered Dec 02 '14 at 03:52

If speed is a matter, you may want to use data.table or dplyr. Here, I modified your sample data a bit to provide some ideas.

df1 <- data.frame(Catcode = c("A1200", "B1500", "C1800"),
                  Catname = c("Sugar", "Salty", "Butter"),
                  Catorder = c("cane A01", "cane A01", "cane A01"),
                  Industry = c("Crop Production", "Crop Production", "Crop Production"),
                  Sector = c("Agribusiness", "Agribusiness", "Agribusiness"),
                  stringsAsFactors = FALSE)

#  Catcode Catname Catorder        Industry       Sector
#1   A1200   Sugar cane A01 Crop Production Agribusiness
#2   B1500   Salty cane A01 Crop Production Agribusiness
#3   C1800  Butter cane A01 Crop Production Agribusiness

df2 <- data.frame(BusinessName = c("Sarah Farms", "Ben Farms"),
                  AmountDonated = c(100, 200),
                  Year = c(2010, 2010),
                  Category = c("A1200", "B1500"),
                  stringsAsFactors = FALSE)

#  BusinessName AmountDonated Year Category
#1  Sarah Farms           100 2010    A1200
#2    Ben Farms           200 2010    B1500

library(dplyr)
library(data.table)

# 1) dplyr option
# Catcode C1800 will be dropped since it does not exist in both data frames.
inner_join(df1, df2, by = c("Catcode" = "Category"))

#      Catcode Catname Catorder        Industry       Sector BusinessName AmountDonated Year
#1   A1200   Sugar cane A01 Crop Production Agribusiness  Sarah Farms           100 2010
#2   B1500   Salty cane A01 Crop Production Agribusiness    Ben Farms           200 2010

# Catcide C1800 remains
left_join(df1, df2, by = c("Catcode" = "Category"))

#      Catcode Catname Catorder        Industry       Sector BusinessName AmountDonated Year
#1   A1200   Sugar cane A01 Crop Production Agribusiness  Sarah Farms           100 2010
#2   B1500   Salty cane A01 Crop Production Agribusiness    Ben Farms           200 2010
#3   C1800  Butter cane A01 Crop Production Agribusiness         <NA>            NA   NA

# 2) data.table option
# Convert data.frame to data.table
setDT(df1)
setDT(df2)

#Set columns for merge
setkey(df1, "Catcode")
setkey(df2, "Category")

df1[df2]

#   Catcode Catname Catorder        Industry       Sector BusinessName AmountDonated Year
#1:   A1200   Sugar cane A01 Crop Production Agribusiness  Sarah Farms           100 2010
#2:   B1500   Salty cane A01 Crop Production Agribusiness    Ben Farms           200 2010

df2[df1]
#   BusinessName AmountDonated Year Category Catname Catorder        Industry       Sector
#1:  Sarah Farms           100 2010    A1200   Sugar cane A01 Crop Production Agribusiness
#2:    Ben Farms           200 2010    B1500   Salty cane A01 Crop Production Agribusiness
#3:           NA            NA   NA    C1800  Butter cane A01 Crop Production Agribusiness

score 0 · Answer 2 · edited Dec 02 '14 at 03:17

0

I think you're asking how to query that.. aren't you?

SELECT * 
FROM
code values dataset(your table for this) a
LEFT JOIN industry donation dataset(your table for this) b 
ON a.CatCode = b.Category

edited Dec 02 '14 at 03:17

Barry

286,269
29
621
977

answered Dec 02 '14 at 02:57

Sniper30

59
1
1
2

score 0 · Answer 3 · answered Dec 02 '14 at 02:59

As krlmlr suggested:

> merge(df1, df2, by.x = "Catcode", by.y = "Category", all = T)
  Catcode    Catname Catorder        Industry       Sector Business_name Amount_donated Year
1   A1200 Sugar_cane      A01 Crop_Production Agribusiness   Sarah_Farms           1000 2010

But you should avoid whitespaces in the column names and the values. I replaced them with _

Merging code values of categories to dataset in R

3 Answers3