45

I have the following table1 which is a data frame composed of 6 columns and 8083 rows. Below I am displaying the head of this table1:

|gene ID        |   prom_65|   prom_66|  amast_69|  amast_70|   p_value|
|:--------------|---------:|---------:|---------:|---------:|---------:|
|LdBPK_321470.1 |   24.7361|   25.2550|   31.2974|   45.4209| 0.2997430|
|LdBPK_251900.1 |  107.3580|  112.9870|   77.4182|   86.3211| 0.0367792|
|LdBPK_331430.1 |   72.0639|   86.1486|   68.5747|   77.8383| 0.2469355|
|LdBPK_100640.1 |   43.8766|   53.4004|   34.0255|   38.4038| 0.1299948|
|LdBPK_330360.1 | 2382.8700| 1871.9300| 2013.4200| 2482.0600| 0.8466225|
|LdBPK_090870.1 |   49.6488|   53.7134|   59.1175|   66.0931| 0.0843242|

I have another data frame, called accessions40 which is a list of 510 gene IDs. It is a subset of the first column of table1 i.e. all of its values (510) are contained in the first column of table1 (8083). The head of accessions40 is displayed below:

|V1             |
|:--------------|
|LdBPK_330360.1 |
|LdBPK_283000.1 |
|LdBPK_360210.1 |
|LdBPK_261550.1 |
|LdBPK_367320.1 |
|LdBPK_361420.1 |

What I want to do is the following: I want to produce a new table2 which contains under the first column (gene ID) only the values present in accessions40 and the corresponding values from the other five columns from table1. In other words, I want to subset the first column of my table1 based on the values of accessions40.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
BCArg
  • 2,094
  • 2
  • 19
  • 37

3 Answers3

80

We can use %in% to get a logical vector and subset the rows of the 'table1' based on that.

subset(table1, gene_ID %in% accessions40$V1)

A better option would be data.table

library(data.table)
setDT(table1)[gene_ID %chin% accessions40$V1]

Or use filter from dplyr

library(dplyr)
table1 %>%
      filter(gene_ID %in% accessions40$V1)
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    How to filter the values in gene_ID that are not in accession40$V1? – RanonKahn May 28 '19 at 22:13
  • 4
    @RanonKahn Use the negate (`!` i.e. `setDT(table)[!gene_ID %chin% accessions40$V1]` – akrun May 28 '19 at 22:42
  • 1
    I was using the negate sign in the wrong place and running into error. Thanks! – RanonKahn May 30 '19 at 17:45
  • Although it's late, happy to hear from you. I have the same problem and I used this subset code, but it gave an empty output file that had just the name of the columns in the first (original) file, that called "table1" in this post. Could you please help me out to solve the issue? – Mary Dec 06 '21 at 12:21
  • 1
    @Mary please consider to post as a new question with a reproducible example – akrun Dec 06 '21 at 16:34
12

There are many ways to do this. Finding the gene_ID in table1 which are present in V1 column of accession40

table1[table1$gene_ID %in% accessions40$V1, ]

Or you can also use match

table1[match(accessions40$V1, table1$gene_ID), ]
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
1

semi_join does not get used often, but works for this case:

dplyr::semi_join(table1, accessions40, by = c('gene_ID' = 'V1'))

The first data frame is subset by the second, only when there is a match. If by is not specified, then by default it will filter by all intersecting columns.

Data

table1 <- structure(list(gene_ID = c("LdBPK_321470.1", "LdBPK_251900.1", 
                                     "LdBPK_331430.1", "LdBPK_100640.1", "LdBPK_330360.1", "LdBPK_090870.1"
), prom_65 = c(24.7361, 107.358, 72.0639, 43.8766, 2382.87, 49.6488
), prom_66 = c(25.255, 112.987, 86.1486, 53.4004, 1871.93, 53.7134
), amast_69 = c(31.2974, 77.4182, 68.5747, 34.0255, 2013.42, 
                59.1175), amast_70 = c(45.4209, 86.3211, 77.8383, 38.4038, 2482.06, 
                                       66.0931), p_value = c(0.299743, 0.0367792, 0.2469355, 0.1299948, 
                                                             0.8466225, 0.0843242)), class = "data.frame", row.names = c(NA, 
                                                                                                                         -6L))

accessions40 <- structure(list(V1 = c("LdBPK_330360.1", "LdBPK_283000.1", "LdBPK_360210.1", 
                                      "LdBPK_261550.1", "LdBPK_367320.1", "LdBPK_361420.1")), class = "data.frame", row.names = c(NA, 
                                                                                                                                  -6L))
LMc
  • 12,577
  • 3
  • 31
  • 43