0

I realize there have already been many asked and answered questions about merging datasets here, but I've been unable to find one that addresses my issue.

What I'm trying to do is merge to datasets using two variables and keeping all data from each. I've tried merge and all of the join operations from dplyr, as well as cbind and have not gotten the result I want. Usually what happens is that one column from one of the datasets gets overwritten with NAs. Another thing that will happen, as when I do full_join in dplyr or all = TRUE in merge is that I get double the number of rows.

Here's my data:

  Primary_State       Primary_County     n
     <fctr>               <fctr> <int>
1            AK                         12
2            AK       Aleutians West     1
3            AK            Anchorage   961
4            AK               Bethel     1
5            AK Fairbanks North Star   124
6            AK               Haines     1


  Primary_County Primary_State Population
1       Autauga             AL      55416
2       Baldwin             AL     208563
3       Barbour             AL      25965
4          Bibb             AL      22643
5        Blount             AL      57704
6       Bullock             AL      10362

So I want to merge or join based on Primary_State and Primary_County, which is necessary because there are a lot of duplicate county names in the U.S. and retain the data from both n and Population. From there I can then divide the Population by n and get a per capita figure for each county. I just can't figure out how to do it and keep all of the data, so any help would be appreciated. Thanks in advance!

EDIT: Adding code examples of what I've already described above.

This code (as well as left_join):

countyPerCap <- merge(countyLicense, countyPops, all.x = TRUE)

Produces this:

  Primary_State       Primary_County   n Population
1            AK                       12         NA
2            AK       Aleutians West   1         NA
3            AK            Anchorage 961         NA
4            AK               Bethel   1         NA
5            AK Fairbanks North Star 124         NA
6            AK               Haines   1         NA

This code:

countyPerCap <- right_join(countyLicense, countyPops)

Produces this:

  Primary_State Primary_County     n Population
      <chr>          <chr> <int>      <int>
1            AL       Autauga     NA      55416
2            AL       Baldwin     NA     208563
3            AL       Barbour     NA      25965
4            AL          Bibb     NA      22643
5            AL        Blount     NA      57704
6            AL       Bullock     NA      10362

Hope that's helpful.

EDIT: This is what happens with the following code:

countyPerCap <- merge(countyLicense, countyPops, all = TRUE)

  Primary_State  Primary_County   n Population
1            AK                  12         NA
2            AK Aleutians East   NA       3296
3            AK  Aleutians West   1         NA
4            AK Aleutians West   NA       5647
5            AK       Anchorage 961         NA
6            AK      Anchorage   NA     298192

It duplicates state and county and then adds n to one record and Population in another. Is there a way to deduplicate the dataset and remove the NAs?

ldlpdx
  • 61
  • 1
  • 13
  • Please see https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Aurèle Nov 03 '17 at 14:55
  • Can you add some code of what you tried that failed? It'd help if you showed the result with the example you give, rather than the entire dataset. Also, I see a possible missing value in your first dataset example, which could be complicating things. – aosmith Nov 03 '17 at 14:59
  • @aosmith, I've added some code and output. It's what I had stated in my question, but providing code and output, per your request. – ldlpdx Nov 03 '17 at 16:23
  • 2
    This is the output you would expect given the 6 lines of data you've given from the two datasets - they share no state/county combinations. What do you think should fill the column when a combination of state/county isn't present? – aosmith Nov 03 '17 at 16:45
  • If missing combinations of state/county is what might be causing the problems you are running into, `anti_join` can be helpful for troubleshooting. It allows you to see what combinations are missing in one of the datasets but present in the other. – aosmith Nov 03 '17 at 16:59
  • @aosmith There are no missing combinations of the state/county data. Both datasets have all of the counties/states in the U.S. The problem is that doing either `merge` or some type of `join` operation is *erasing* data from the other variables, i.e., `n` and `Populations`. So what I need is a solution that keeps all of the data. – ldlpdx Nov 04 '17 at 10:10
  • @Neoromanzer Yes, and I've now posted above what happens. Basically it duplicates every record, retaining `n` in one row and `Population` in another. – ldlpdx Nov 04 '17 at 10:32

2 Answers2

1

We can give column names in merge by mentioning "by" in merge statement

merge(x,y, by=c(col1, col2 names))

in merge statement

0

I figured it out. There were trailing whitespaces in the Census data's county names, so they weren't matching with the other dataset's county names. (Note to self: Always check that factors match when trying to merge datasets!)

trim.trailing <- function (x) sub("\\s+$", "", x)
countyPops$Primary_County <- trim.trailing(countyPops$Primary_County)
countyPerCap <- full_join(countyLicense, countyPops, 
                  by=c("Primary_State", "Primary_County"), copy=TRUE)

Those three lines did the trick. Thanks everyone!

ldlpdx
  • 61
  • 1
  • 13