1

I have this dataframe:

structure(list(CATEGORY = c("Edible, Vape", "Concentrate, Flower", 
"Concentrate, Flower", "Concentrate, Flower", "Edible", "Concentrate, Flower", 
"Edible, Vape", "Edible", "Concentrate, Flower", "Concentrate, Flower", 
"Edible", "Edible", "Edible", "Concentrate, Flower", "Edible", 
"Edible", "Edible", "Edible, Vape", "Edible", "Edible", "Concentrate, Flower", 
"Edible", "Concentrate, Flower", "Concentrate, Flower", "Concentrate, Flower", 
"Edible", "Concentrate, Flower", "Concentrate, Edible, Flower", 
"Concentrate, Flower", "Edible", "Concentrate, Edible, Flower", 
"Edible", "Concentrate, Edible, Flower", "Concentrate, Edible, Flower, Vape", 
"Concentrate, Edible, Flower", "Concentrate, Flower", "Edible", 
"Edible", "Edible", "Concentrate, Edible, Flower, Vape", "Concentrate, Flower", 
"Concentrate, Flower", "Edible", "Concentrate, Flower", "Concentrate, Flower", 
"Concentrate, Flower", "Concentrate, Flower", "Concentrate, Flower", 
"Concentrate, Flower", "Edible, Vape", "Concentrate, Flower", 
"Edible, Vape", "Concentrate, Edible, Flower", "Edible, Vape", 
"Concentrate, Flower", "Edible", "Concentrate, Flower", "Concentrate, Flower", 
"Edible", "Concentrate, Flower", "Edible, Vape", "Edible", "Concentrate, Edible, Flower, Vape", 
"Edible", "Edible", "Concentrate, Flower", "Concentrate, Flower", 
"Edible, Vape", "Concentrate, Flower", "Edible", "Edible", "Edible, Vape", 
"Edible", "Edible", "Edible", "Concentrate, Flower", "Edible", 
"Edible", "Concentrate, Flower", "Edible, Vape", "Concentrate, Flower", 
"Edible", "Edible", "Edible", "Edible", "Concentrate, Flower", 
"Edible, Vape", "Edible", "Concentrate, Flower", "Edible, Vape", 
"Concentrate, Flower", "Concentrate, Flower", "Concentrate, Flower", 
"Concentrate, Flower", "Edible", "Edible", "Edible", "Edible, Vape", 
"Concentrate, Flower", "Edible")), row.names = c(NA, -100L), class = c("tbl_df", 
"tbl", "data.frame"))

enter image description here

Some of the items in the CATEGORY vector have only one string and some of them have two, three or four. (And larger, this is just a section of a bigger data frame.)

How can I filter to only include items with two or three items in the dataset?

For example, if I type this:

unique(interesting_baskets_df$CATEGORY)

I see these categories.

[1] "Edible, Vape"                      "Concentrate, Flower"               "Edible"                            "Concentrate, Edible, Flower"      
[5] "Concentrate, Edible, Flower, Vape"

But I only want to include "Edible, Vape" or "Concentrate, Flower" or "Edible".

I know in this case I could input a specific filter in dplyr with a set of items, but my dataset is much larger and I would need a more flexible solution. I would appreciate something that would be flexible in choosing the number of items, two or three or four, since I don't exactly know what will be most useful in association rule learning.

hachiko
  • 671
  • 7
  • 20

2 Answers2

1

Another option might be counting number of commas + 1 and filter less than 3 like this:

library(stringr)
library(dplyr)
filter(df, str_count(CATEGORY, ",") + 1 < 3)
#> # A tibble: 92 × 1
#>    CATEGORY           
#>    <chr>              
#>  1 Edible, Vape       
#>  2 Concentrate, Flower
#>  3 Concentrate, Flower
#>  4 Concentrate, Flower
#>  5 Edible             
#>  6 Concentrate, Flower
#>  7 Edible, Vape       
#>  8 Edible             
#>  9 Concentrate, Flower
#> 10 Concentrate, Flower
#> # … with 82 more rows

Created on 2023-01-05 with reprex v2.0.2

Quinten
  • 35,235
  • 5
  • 20
  • 53
0

With regex W+ and strsplit you can filter your data by the number of words that you want, next example for less than three words.

With R base:

df[lengths(strsplit(df$CATEGORY, "\\W+"))<3, ]

Or dplyr:

library(dplyr)
df %>% filter(lengths(strsplit(df$CATEGORY, "\\W+"))<3)
juanbarq
  • 374
  • 6