189

I want to filter rows from a data.frame based on a logical condition. Let's suppose that I have data frame like

   expr_value     cell_type
1    5.345618 bj fibroblast
2    5.195871 bj fibroblast
3    5.247274 bj fibroblast
4    5.929771          hesc
5    5.873096          hesc
6    5.665857          hesc
7    6.791656          hips
8    7.133673          hips
9    7.574058          hips
10   7.208041          hips
11   7.402100          hips
12   7.167792          hips
13   7.156971          hips
14   7.197543          hips
15   7.035404          hips
16   7.269474          hips
17   6.715059          hips
18   7.434339          hips
19   6.997586          hips
20   7.619770          hips
21   7.490749          hips

What I want to is to get a new data frame which looks the same but only has the data for one cell_type. E.g. subset / select rows which contains the cell type "hesc":

   expr_value     cell_type
1    5.929771          hesc
2    5.873096          hesc
3    5.665857          hesc

Or either cell type "bj fibroblast" or "hesc":

   expr_value     cell_type
1    5.345618 bj fibroblast
2    5.195871 bj fibroblast
3    5.247274 bj fibroblast
4    5.929771          hesc
5    5.873096          hesc
6    5.665857          hesc

Is there any easy way to do this?

I've tried:

expr[expr[2] == 'hesc']
# [1] "5.929771" "5.873096" "5.665857" "hesc"     "hesc"     "hesc"    

if the original data frame is called "expr", but it gives the results in wrong format as you can see.

Henrik
  • 65,555
  • 14
  • 143
  • 159
lhahne
  • 5,909
  • 9
  • 33
  • 40

9 Answers9

269

To select rows according to one 'cell_type' (e.g. 'hesc'), use ==:

expr[expr$cell_type == "hesc", ]

To select rows according to two or more different 'cell_type', (e.g. either 'hesc' or 'bj fibroblast'), use %in%:

expr[expr$cell_type %in% c("hesc", "bj fibroblast"), ]
Henrik
  • 65,555
  • 14
  • 143
  • 159
learnr
  • 6,479
  • 4
  • 27
  • 23
97

Use subset (for interactive use)

subset(expr, cell_type == "hesc")
subset(expr, cell_type %in% c("bj fibroblast", "hesc"))

or better dplyr::filter()

filter(expr, cell_type %in% c("bj fibroblast", "hesc"))
rcs
  • 67,191
  • 22
  • 172
  • 153
  • 42
    Careful! The documentation of `subset` has a big WARNING: "This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standar d evaluation of argument *subset* can have unanticipated consequen ces." – Aleksandar Dimitrov Mar 09 '13 at 13:52
39

The reason expr[expr[2] == 'hesc'] doesn't work is that for a data frame, x[y] selects columns, not rows. If you want to select rows, change to the syntax x[y,] instead:

> expr[expr[2] == 'hesc',]
  expr_value cell_type
4   5.929771      hesc
5   5.873096      hesc
6   5.665857      hesc
Ken Williams
  • 22,756
  • 10
  • 85
  • 147
  • This will pick up any `NA` records as well! Hence, not applicable. The reason it seemed to be true resulted from the fact that expr dataframe has no `NA` in the filtered column. If there is `NA` there, your way is not applicable as I said before. – Erdogan CEVHER Jul 30 '18 at 21:33
  • Thanks for the explanation of how to work with columns and rows. – Zhivar Sourati Feb 27 '21 at 12:32
31

You could use the dplyr package:

library(dplyr)
filter(expr, cell_type == "hesc")
filter(expr, cell_type == "hesc" | cell_type == "bj fibroblast")
nathaneastwood
  • 3,664
  • 24
  • 41
10

No one seems to have included the which function. It can also prove useful for filtering.

expr[which(expr$cell == 'hesc'),]

This will also handle NAs and drop them from the resulting dataframe.

Running this on a 9840 by 24 dataframe 50000 times, it seems like the which method has a 60% faster run time than the %in% method.

eigenfoo
  • 229
  • 2
  • 9
7

I was working on a dataframe and having no luck with the provided answers, it always returned 0 rows, so I found and used grepl:

df = df[grepl("downlink",df$Transmit.direction),]

Which basically trimmed my dataframe to only the rows that contained "downlink" in the Transmit direction column. P.S. If anyone can guess as to why I'm not seeing the expected behavior, please leave a comment.

Specifically to the original question:

expr[grepl("hesc",expr$cell_type),]

expr[grepl("bj fibroblast|hesc",expr$cell_type),]
2

Sometimes the column you want to filter may appear in a different position than column index 2 or have a variable name.

In this case, you can simply refer the column name you want to filter as:

columnNameToFilter = "cell_type"
expr[expr[[columnNameToFilter]] == "hesc", ]
Daniel Bonetti
  • 2,306
  • 2
  • 24
  • 33
1

This worked like magic for me.

celltype_hesc_bool = expr['cell_type'] == 'hesc'

expr_celltype_hesc = expr[celltype_hesc]

Check this blog post

DKMDebugin
  • 551
  • 7
  • 16
0

we can use data.table library

  library(data.table)
  expr <- data.table(expr)
  expr[cell_type == "hesc"]
  expr[cell_type %in% c("hesc","fibroblast")]

or filter using %like% operator for pattern matching

 expr[cell_type %like% "hesc"|cell_type %like% "fibroblast"]
Varn K
  • 400
  • 4
  • 8