Find rows where column values are in set (similar to pandas isin or R %in%)

Question

using CSV, DataFrames
iris = CSV.read(joinpath(dirname(pathof(DataFrames)),"..","test/data/iris.csv"))

head(iris)
6×5 DataFrame
│ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │
│     │ Float64⍰    │ Float64⍰   │ Float64⍰    │ Float64⍰   │ String⍰ │
├─────┼─────────────┼────────────┼─────────────┼────────────┼─────────┤
│ 1   │ 5.1         │ 3.5        │ 1.4         │ 0.2        │ setosa  │
│ 2   │ 4.9         │ 3.0        │ 1.4         │ 0.2        │ setosa  │
│ 3   │ 4.7         │ 3.2        │ 1.3         │ 0.2        │ setosa  │
│ 4   │ 4.6         │ 3.1        │ 1.5         │ 0.2        │ setosa  │
│ 5   │ 5.0         │ 3.6        │ 1.4         │ 0.2        │ setosa  │
│ 6   │ 5.4         │ 3.9        │ 1.7         │ 0.4        │ setosa  │

I want to find all rows where Species is in setosa or virginica. Note that the answer must use a lookup into an array of values to find since I want the result to work when looking for arbitrarily many values.

There is a function called indexin. It gets me halfway there:

iris[indexin(iris.Species ,["setosa", "virginica"])]

But when I try to use it for indexing the result is:

ERROR: ArgumentError: Only Integer values allowed when indexing by vector of numbers

https://stackoverflow.com/a/29661623 – jezrael Oct 02 '18 at 11:59 — jezrael, Oct 02 '18 at 11:59

score 3 · Accepted Answer · answered Oct 02 '18 at 12:07

3

iris[ in.(iris[:Species],(["virginica","setosa"],)),: ]

The additional tuple around ["virginica","setosa"] allows to avoid broadcasting over the search list.

answered Oct 02 '18 at 12:07

Przemyslaw Szufel

40,002
3
32
62

Ah, that is kinda ugly, but I guess those kinks will be ironed out eventually :) Thanks – The Unfun Cat Oct 02 '18 at 12:09
2

A more Julian approach would be to write `filter(x -> x[:Species] in ["virginica", "setosa"], iris)`. If you use DataFramesMeta this is what you can alternatively use `@where(iris, in.(:Species, [["virginica", "setosa"]]))`. – Bogumił Kamiński Oct 02 '18 at 12:27

score 1 · Answer 2 · answered Oct 04 '18 at 21:02

1

A way to achieve this is to use findall:

iris[findall(in(["setosa", "virginica"]), iris.Species), :]

answered Oct 04 '18 at 21:02

Milan Bouchet-Valat

504
2
5

score 0 · Answer 3 · answered Oct 03 '18 at 16:04

0

You can use the findin function.

iris[findin(iris[:Species],["setosa","virginica"]),:]

Note that if you want to use findin to search only one value, it has to be always an array, like

iris[findin(iris[:Species],["setosa"]),:]

answered Oct 03 '18 at 16:04

tpdsantos

423
5
6

`findin` is deprecated in 0.7 and removed from 1.0. `findall(in(...), ...)` replaces it. – Milan Bouchet-Valat Oct 04 '18 at 21:00
Yeah, I realized that after posting this answer... Thanks for the heads up anyway :) – tpdsantos Oct 16 '18 at 10:32

Find rows where column values are in set (similar to pandas isin or R %in%)

3 Answers3