0

So I have two data sets,

one is US Census data which is saved as data frame "US_Census_county" has columns Year,State,County,Census

and the other is Mass Shooting Data saved as data frame "GVA_MassShooting" which has Year,State,County,num_killed,num_injured.

I want to merge them within R and create a new data frame. I'm in progress of learning R on DataCamp this week so I am still very new to this.

I tried

merge(US_Census_county,GVA_MassShooting,all=TRUE)

but this created a dataframe where I think it was just one on top of the other as when i checked it was NAs for the extra columns

imy
  • 1

2 Answers2

0

See if this works for you:

library(dplyr)
inner_join(US_Census_county,GVA_MassShooting)
itsMeInMiami
  • 2,324
  • 1
  • 13
  • 34
  • Thanks for the suggestion. I had a go and it produced a blank table. – imy Jul 28 '20 at 21:00
  • That likely means that at least one of the matching column names has different data. Llikely the problem is with capitalization. Check the syntax for inner_join by typing `?inner_join` and you will see the syntax to specify what variables it should use to find matching records. Try to have it join on a single variable. Then add a second until suddenly it stops being able to make a match. – itsMeInMiami Jul 28 '20 at 21:10
  • got it to work now, thanks! – imy Jul 29 '20 at 10:46
0

Setting all = TRUE with merge() is equivalent to asking for a (full) outer join.

One reason that you might be seeing these data frames 'stacked' could be because merge() cannot uniquely identify rows. Is it possible that the shared columns between your data frames (e.g. Year, State, and County) do not uniquely identify rows? For example, could there be multiple shootings in a year, i.e. multiple rows in your GVA_MassShooting with the same Year-State-County combination?

As for the NAs, could their be slight differences in how Year, State, or County are represented between the two data.frames? For example, leading or trailing white space after string names (e.g. "OR" == "OR " will return FALSE) will cause a mismatch and result in NA values.

CzechInk
  • 443
  • 2
  • 13