Failed two methods to subset dataset with R, requesting assistance

Question

I am attempting to make a subset of some data in R (open source statistics scripting language). I attempt two methods, but I am unsuccessful with both. One returns a table with no data, the other returns a table of all "NA" cells, but of the apparently correct dimensions.

I laid out the code pretty clearly commented--

First, I create the list of zip codes I'll use to subset the data. The list of zip codes is from a dataset I'll be using. The list of zip codes is called "zipCodesOfData"
Next, I download the Crime Data I'll be subsetting. I basically just subset it into the data set that I need.
The last part, section three, shows that I try both %in% and the filter method to filter the Crime Data against the zip code data.

Unfortunately, neither method works. I was hoping someone might be able to point out my mistakes or recommend a different subsetting method for the third section.

(As an aside, in section two, I attempt to turn the list into a dataframe, but it does not work. I'm curious as to why, if anyone can shed light onto this for me.)

Thanks for your time & assistance!

####
#### Section zero: references and dependencies
####
# r's "choroplethr" library creator's blog for reference:
# http://www.arilamstein.com/blog/2015/06/25/learn-to-map-census-data-in-r/
# http://stackoverflow.com/questions/30787877/making-a-zip-code-choropleth-in-r-using-ggplot2-and-ggmap
# 
# library(choroplethr)
# library(choroplethrMaps)
# library(ggplot2)
# # use the devtools package from CRAN to install choroplethrZip from github
# # install.packages("devtools")
# library(devtools)
# install_github('arilamstein/choroplethrZip')
# library(choroplethrZip)
# library(data.table)
# 
####
#### Section one: the data set providing the zipcode we'll use to subset the crime set
####
austin2014_data_raw <- fread('https://data.austintexas.gov/resource/hcnj-rei3.csv')
names(austin2014_data_raw)
nrow(austin2014_data_raw)
## clean up: make any blank cells in column ZipCode say "NA" instead -> source:  http://stackoverflow.com/questions/12763890/exclude-blank-and-na-in-r
austin2014_data_raw[austin2014_data_raw$ZipCode==""] <- NA
# keep only rows that do not have "NA"
austin2014_data <- na.omit(austin2014_data_raw)
nrow(austin2014_data) # now there's one less row.

# selecting the first column, which is ZipCode
zipCodesOfData <- austin2014_data[,1]
View(zipCodesOfData)
# Now we have the zipcodes we need: zipCodesOfData

####
#### Section two: Crime data
####
# Crime by zipcode: https://data.austintexas.gov/dataset/Annual-Crime-2014/7g8v-xxja
#   (visualized: https://data.austintexas.gov/dataset/Annual-Crime-2014/8mst-ed5t )
# https://data.austintexas.gov/resource/<insertResourceNameHere>.csv  w/ resource "7g8v-xxja"

austinCrime2014_data_raw <- fread('https://data.austintexas.gov/resource/7g8v-xxja.csv')
View(austinCrime2014_data_raw)
nrow(austinCrime2014_data_raw)

# First, let's remove the data we don't need
names(austinCrime2014_data_raw)
columnSelection_Crime <- c("GO Location Zip", "GO Highest Offense Desc", "Highest NIBRS/UCR Offense Description")
austinCrime2014_data_selected_columns <- subset(austinCrime2014_data_raw, select=columnSelection_Crime)
names(austinCrime2014_data_selected_columns)
nrow(austinCrime2014_data_selected_columns)


####
#### Section Three: The problem: I am unable to make subsets with the two following methods.
####
# Neither of these methods work: 

# Attempt 1:

austinCrime2014_data_selected_columns <- austinCrime2014_data_selected_columns[austinCrime2014_data_selected_columns$`GO Location Zip` %in% zipCodesOfData , ]
View(austinCrime2014_data_selected_columns) # No data in the table

# Attempt 2:

# This initially told me an error:
# Then, I installed dplyr and the error went away.  
library(dplyr)
# However, it still doesn't create anything-- just an empty set w/ headers
austinCrime2014_data_selected_zips <- filter(austinCrime2014_data_selected_columns, `GO Location Zip` %in% zipCodesOfData)
View(austinCrime2014_data_selected_zips)

I edited out this section, after realizing it was unnecessary.

####
####  Bad section
####
nrow(austinCrime2014_data_selected_columns)

# Then, let's keep only the zipcodes we need
# doesnt work: austinCrime2014_data_selected_columns_df <- data.frame(austinCrime2014_data_selected_columns)
# typeof(austinCrime2014_data_selected_columns_df)

austinCrime<-do.call("rbind", austinCrime2014_data_selected_columns)
austinCrime_needsTranspose <-as.data.frame(austinCrime)
austinCrime <- t(austinCrime_needsTranspose)
typeof(austinCrime)
View(austinCrime)
names(austinCrime)

####
####  Bad section
####

1

`austinCrime_df` is a matrix – yeedle Apr 09 '17 at 23:21

score 2 · Accepted Answer · answered Apr 09 '17 at 23:41

I think readr and dplyr can solve your problem. It's simple:

library(readr)
library(dplyr)

### SECTION 1

# Import data
austin2014_data_raw <- read_csv('https://data.austintexas.gov/resource/hcnj-rei3.csv', na = '')
glimpse(austin2014_data_raw)
nrow(austin2014_data_raw)

# Remove NAs
austin2014_data <- na.omit(austin2014_data_raw)
nrow(austin2014_data) # now there's one less row.

# Get zip codes
zipCodesOfData <- austin2014_data$`Zip Code`

### SECTION 2

# Import data
austinCrime2014_data_raw <- read_csv('https://data.austintexas.gov/resource/7g8v-xxja.csv', na = '')
glimpse(austinCrime2014_data_raw)
nrow(austinCrime2014_data_raw)

# Select and rename required columns
columnSelection_Crime <- c("GO Location Zip", "GO Highest Offense Desc", "Highest NIBRS/UCR Offense Description")
austinCrime_df <- select(austinCrime2014_data_raw, one_of(columnSelection_Crime))
names(austinCrime_df) <- c("zipcode", "highestOffenseDesc", "NIBRS_OffenseDesc")
glimpse(austinCrime_df)
nrow(austinCrime_df)

### SECTION 3

# Filter by zipcode
austinCrime2014_data_selected_zips <- filter(austinCrime_df, zipcode %in% zipCodesOfData)
glimpse(austinCrime2014_data_selected_zips)
nrow(austinCrime2014_data_selected_zips)

Here I used read_csv() from the readr package to import data, and the subset methods select() and filter() from the dplyr package to get the required columns and rows.

You're welcome! I'm a big fan of `dplyr` and all other [tidyverse](http://tidyverse.org/) facilities. — David Pinto, Apr 10 '17 at 00:14

score 1 · Answer 2 · answered Apr 09 '17 at 23:38

I'm not sure why you're do.calling and transposing your data. You can just use something like dplyr's semi_join to get only the zipcodes you want:

library(data.table)
library(dplyr)
#> -------------------------------------------------------------------------
#> data.table + dplyr code now lives in dtplyr.
#> Please library(dtplyr)!
#> -------------------------------------------------------------------------
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:data.table':
#> 
#>     between, first, last
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
zipCodesOfData <- fread('https://data.austintexas.gov/resource/hcnj-rei3.csv') %>%
  mutate(`Zip Code` = ifelse(`Zip Code` == "", NA, `Zip Code`)) %>%
  na.omit() %>% 
  select(`Zip Code`)

austinCrime2014_data_raw <- fread('https://data.austintexas.gov/resource/7g8v-xxja.csv') %>%
  select(`GO Location Zip`, `GO Highest Offense Desc`, `Highest NIBRS/UCR Offense Description`) %>%
  semi_join(zipCodesOfData, by = c("GO Location Zip" = "Zip Code")) %>%
  rename(zipcode = `GO Location Zip`, 
         highestOffenseDesc = `GO Highest Offense Desc`, 
         NIBRS_OffenseDesc = `Highest NIBRS/UCR Offense Description`)

Thanks! I didn't realize I could do this while making the file download call! I'll have to look into dplyr! Yeah, after posting I ended up removing the parts you reference. However, I added them back on so as not to confuse anyone about what you reference in my original post. — Patrick Meaney, Apr 09 '17 at 23:55

Failed two methods to subset dataset with R, requesting assistance

2 Answers2