1

currently I have a data frame that is segmented based on state, so for example:

  STATE     
1 Connecticut
2 Maine
3 New York
4 Pennsylvania

I want to be able to segment these states based on region, and would want a column based on that:

  STATE        REGION
1 Connecticut  New England
2 Maine        New England
3 New York     Mid-Atlantic
4 Pennsylvania Mid-Atlantic

Currently, I have this:

ne <- c("Connecticut", "Maine", "Massachusetts", "New Hampshire", "Rhode Island", "Vermont")
values <- c("New England")
df$REGION <- values[match(jails_cleaned$STATE, ne)]

midatlantic <- c("New Jersey", "New York", "Pennsylvania")
values2 <- c("Mid-Atlantic")
df$REGION <- values[match(df$STATE, midatlantic)]

but when running this, it only matches the New England values and not Mid-Atlantic.

Bhargav Rao
  • 50,140
  • 28
  • 121
  • 140
Joseph Li
  • 11
  • 1

1 Answers1

0

Create a dataframe that you will merge to your jails_cleaned data

df_stats <- data.frame(
  'STATE' = c(
    c("Connecticut", "Maine", "Massachusetts", "New Hampshire", "Rhode Island", "Vermont"),
    c("New Jersey", "New York", "Pennsylvania")),
  'REGION' = c(rep("New England",6), rep("Mid-Atlantic",3))
)

df_stats
          STATE       REGION
1   Connecticut  New England
2         Maine  New England
3 Massachusetts  New England
4 New Hampshire  New England
5  Rhode Island  New England
6       Vermont  New England
7    New Jersey Mid-Atlantic
8      New York Mid-Atlantic
9  Pennsylvania Mid-Atlantic

merge(jails_cleaned,df_stats, by = 'STATE')

Update

If you want to perform a left join, i.e. keep every rows of jails_cleaned, use all.x = TRUE argument

merge(jails_cleaned,df_stats, by = 'STATE', all.x = TRUE)
linog
  • 5,786
  • 3
  • 14
  • 28
  • When I take into account every region and merge it with my jails_cleaned dataframe that consists of 841 rows, the output only returns 98 rows of data, discounting certain states/regions – Joseph Li May 11 '20 at 20:28
  • That's because `merge` performs by default a inner join. Add `all.x = TRUE` to perform a left join (see update) – linog May 11 '20 at 20:46
  • Hi, sorry for all the questions! I initially performed the operations you listed above and created df_stats, a dataframe with 51 rows (for each state + DC, and the corresponding region). However, when I merge using a left join, many regions are output as NA values. – Joseph Li May 11 '20 at 21:01
  • You have something (or several things) that provoke mismatch. To investigate, you can do `jails_cleaned[!(jails_cleaned$STATE %in% df_stats$STATE), ]` and `df_stats[!(df_stats$STATE %in% jails_cleaned$STATE), ]` and look at the keys that don't match. For instance, problem with capital letters ? – linog May 11 '20 at 21:09
  • You can also perform an `anti-join` using `data.table`. It's more powerful but it's more advanced practice (see [here](https://rstudio-pubs-static.s3.amazonaws.com/52230_5ae0d25125b544caab32f75f0360e775.html)) – linog May 11 '20 at 21:11