How to narrow down data frame in R

Question

Pardon my less than perfect title but having some issues grasping this.

So here's the manually created data. There are three fields; state, codetype, and code. The reason for this is that I am trying to join a more expansive version of this to a data frame consisting of 1.6 million rows and running into an issue of not having enough memory. My thought process is that I would greatly lower the number of rows in this table; industry.

state <- c(32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32)
codetype <- c(10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10)
code <- c(522,523,524,532,533,534,544,545,546,551,552,552,561,562,563,571,572,573,574)



industry = data.frame(state,codetype,code)

The desired result would be a two fold operation. First, I would shorten down the six digit codes to 2. That is done via.

industry<-industry %>% mutate(twodigit = substr(code,1,2).

This would produce a fifth column, twodigit. At present, there are 19 values. But only 7 unique values of twodigit; 52,53,54,55,56,57. How would tell it remove all nonunique values of the two digit?

Do you need `industry %>% distinct(twodigit, .keep_all = TRUE)` — akrun, Jun 11 '21 at 00:08
@akrun, write this as an answer. Yes, it worked and thanks for your assistance. — Tim Wilcox, Jun 11 '21 at 00:11

score 2 · Accepted Answer · answered Jun 11 '21 at 00:12

We can use distinct and specify the .keep_all as TRUE to get the entire columns

library(dplyr)
industry %>%
   distinct(twodigit, .keep_all = TRUE)

Another option would be to use duplicated in filter

industry %>%
    filter(!duplicated(twodigit))

To make this more efficient, perhaps use data.table approaches

library(data.table)
setDT(industry)[!duplicated(substr(code, 1, 2))]

score 1 · Answer 2 · answered Jun 11 '21 at 01:13

Usingunique() approach:

library(tidyverse)

state <- c(32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32,32)
codetype <- c(10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10)
code <- c(522,523,524,532,533,534,544,545,546,551,552,552,561,562,563,571,572,573,574)
industry = data.frame(state,codetype,code)
industry<-industry %>% mutate(twodigit = substr(code,1,2))


unique(industry$twodigit) %>%
    map_dfr(~filter(industry, twodigit == .x)[1, ])
#>   state codetype code twodigit
#> 1    32       10  522       52
#> 2    32       10  532       53
#> 3    32       10  544       54
#> 4    32       10  551       55
#> 5    32       10  561       56
#> 6    32       10  571       57

^{Created on 2021-06-10 by the reprex package (v2.0.0)}

How to narrow down data frame in R

2 Answers2