-1

I'm having an odd issue using FBI crime data. There are some cities/towns that have the same name in the same state, so county is given as a way to separate these values. For the years 2003-2017 there are roughly 1700 values that also have counties. However, when I try to join this dataset with another dataset, or even filter by a county (for instance, COUNTY == "york county") I'll only get six values/rows, when I should be getting 48. I've made them all lowercase and have tried trimming (if there were whitespace) and have run as.character(), but I still get the same behavior. It's weird that it's returning a handful of values, but not all of them. Any ideas?

If I try running

data%>%filter(COUNTY=="adams county")

it will only return two values: conewago and cumberland.

I used the following code to cut those data values with a county from those without a county (in which case there will be an NA. Then I make sure the white space is removed.

crime.06_17.slice <- crime.06_17%>%arrange(COUNTY)%>%slice(1:1758)
crime.06_17.slice$COUNTY <- trimws(crime.06_17.slice$COUNTY, which = c("both"), whitespace = "[\t\r\n]")
structure(list(CITY = c("washington", "conewago", "conewago", 
"cumberland", "conewago", "cumberland", "liberty", "conewago", 
"liberty", "conewago", "cumberland", "liberty", "conewago", "cumberland", 
"liberty", "conewago", "cumberland", "liberty", "conewago", "cumberland", 
"conewago", "cumberland", "conewago", "cumberland", "conewago", 
"cumberland", "conewago", "cumberland", "liberty", "cumberland"
), COUNTY = c("  mercer county", " adams county", " adams county", 
" adams county", " adams county", " adams county", " adams county", 
" adams county", " adams county", " adams county", " adams county", 
" adams county", " adams county", " adams county", " adams county", 
" adams county", " adams county", " adams county", " adams county", 
" adams county", " adams county", " adams county", " adams county", 
" adams county", " adams county", " adams county", " adams county", 
" adams county", " adams county", " adams township"), CRIME_VIOLENT = c(8, 
6, 4, 4, 3, 1, 0, 3, 1, 3, 2, 2, 1, 1, 1, 8, 3, 0, 6, 3, 3, 2, 
4, 3, 5, 5, 5, 5, 0, 1), CRIME_PROPERTY = c(125, 64, 92, 35, 
98, 47, 4, 125, 29, 113, 43, 24, 90, 55, 15, 84, 66, 20, 89, 
52, 48, 49, 54, 53, 48, 38, 30, 41, 11, 23), CRIME_TOTAL = c(133, 
70, 96, 39, 101, 48, 4, 128, 30, 116, 45, 26, 91, 56, 16, 92, 
69, 20, 95, 55, 51, 51, 58, 56, 53, 43, 35, 46, 11, 24), year = c(2005, 
2006, 2007, 2007, 2008, 2008, 2008, 2009, 2009, 2010, 2010, 2010, 
2011, 2011, 2011, 2012, 2012, 2012, 2013, 2013, 2014, 2014, 2015, 
2015, 2016, 2016, 2017, 2017, 2017, 2009), STATE = c("new jersey", 
"pennsylvania", "pennsylvania", "pennsylvania", "pennsylvania", 
"pennsylvania", "pennsylvania", "pennsylvania", "pennsylvania", 
"pennsylvania", "pennsylvania", "pennsylvania", "pennsylvania", 
"pennsylvania", "pennsylvania", "pennsylvania", "pennsylvania", 
"pennsylvania", "pennsylvania", "pennsylvania", "pennsylvania", 
"pennsylvania", "pennsylvania", "pennsylvania", "pennsylvania", 
"pennsylvania", "pennsylvania", "pennsylvania", "pennsylvania", 
"pennsylvania")), row.names = c(NA, -30L), class = c("tbl_df", 
"tbl", "data.frame"))

James
  • 459
  • 2
  • 14
  • 4
    "Any ideas?" would be easier to answer if you gave a [mcve]. See [How to make a great R reproducible example?](https://stackoverflow.com/q/5963269/4996248) for some tips how to do so. As it is, your question is impossible to answer since we have no useful knowledge about what your data looks like and just what you are doing with it. – John Coleman Aug 29 '19 at 17:23
  • 2
    Jimmy, if this is the same sample data you provided in [your previous question](https://stackoverflow.com/questions/56812571), please copy that block over here (and don't just link to it). It would also help to know what your other dataset (to be merged) looks like. – r2evans Aug 29 '19 at 17:30
  • Did you get any errors or warnings when you ran the commands? Can you show the actual commands you ran? If they started as factors, you may have been unsuccessful in your modifications, then you ran `as.character()` which would (too late) make the modifications possible. But something is weird in your data or your code, and we can't help you much unless you show us a little bit of both. – Gregor Thomas Aug 29 '19 at 17:40
  • Sorry for not including a reproducible example; I thought maybe this was an issue someone has run into before. Now it's included. – James Aug 29 '19 at 21:29
  • Thank you for posting some data. I voted to reopen. It probably is an issue that someone has run into. Probably all serious R users have run into a problem where filtering a dataframe in some way results in far fewer rows than anticipated. The problem is that without knowing more, it is impossible to say just why it is happening to you. There can be multiple reasons behind that sort of behavior. – John Coleman Aug 30 '19 at 13:21
  • Thanks, John. Yep, I see now why a reproducible example was important in this case. @Gregor, thanks much for your response. It solved my problem. A good life lesson in what appears to be isn't always as it is. Perhaps it also points to another purpose of dput in using it as a diagnostic? Would there be another way of "looking at the data to make sure it is what you think it is?" If you've found any webpages, etc. for best practices in diagnosing these kinds of issues, it would be really helpful if you could send them my way! – James Aug 30 '19 at 18:04
  • What I recommend in the answer is printing the unique data values to the console: `unique(data$COUNTY)` is a lot more focused and easier to read than a full `dput()`, making the issue readily apparent. I just mentioned the `dput` because that's what you shared, and so skimming the question the issue is visible. – Gregor Thomas Aug 30 '19 at 18:48
  • In general, *most* bugs can be conceptualized as a bad assumption. You assume your data is one way, and it's not. So you need to figure out what assumptions you are making and how to test them. Here, you correctly identified that whitespace was the problem, and you tried to fix it. The only issue is that you assumed your fix worked, when it didn't. You didn't verify the fix, and you continued to assume it worked, despite the error message. So the lessons are (a) don't ignore errors and (b) get more rigorous in testing your assumptions. – Gregor Thomas Aug 30 '19 at 18:51

1 Answers1

1

If you look at the data in your dput, you can see that all your Adams County entries have a leading space: " adams county".

You should trim the whitespace. Since you say you've tried that, make sure you assign the modified (trimmed) result, and verify it. (N.B. when you verify something, look at the actual data to make sure it is what you think it is. In your post, you say "Then I make sure the white space is removed," but evidently that was not successful.)

data = mutate(data, COUNTY = trimws(COUNTY))
unique(data$COUNTY) # make sure this looks right

# now the filter will work as expected
data %>% filter(COUNTY == "adams county")

Why didn't your attempt work?

trimws takes two arguments, x and which. You give it 3 arguments, which causes an error:

trimws(data$COUNTY, which = c("both"), whitespace = "[\t\r\n]")
# Error in trimws(data$COUNTY, which = c("both"), whitespace = "[\t\r\n]") : 
#   unused argument (whitespace = "[\t\r\n]")

When there's an error, the code does not execute. (This is different from a warning, where the code executes but tells you something seems like it might be wrong.) So, because you added the extra argument, , whitespace = "[\t\r\n]", your code did not run. If you delete that argument, the error will go away and your code will probably work just fine.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294