Extracting all rows from multiple columns with specific value from dataframe in R

Question

I have a dataframe called all_genes that has 157 columns in total, the first column being a genes column containing gene names. The columns of interests are from 50th to 157th with 2-step (50, 52, 54, 56, etc ...) which are the sample's names. These columns have three types of values: 1, 2 or 3, knowing that for the same row (same gene), we could have the three types of values for different samples.

For example, the row of gene X has a value of 1 in column 50th column but value of 2 for 52nd column.

What I wish is to extract all rows from the even columns depending on these values. To get a better idea, here's how the dataframe looks like:

Original dataframe

Now, I have written this code to extract, for example, rows of value 1:

# extracting rows of value "1" from column 50 to 157, by taking into account only the even columns
df <- all_genes[which(all_genes[, seq(50, 157, 2)] == 1), ] 

# removing NAs if all the rows are NAs from columns 50 to 157
df <- df[rowSums(is.na(df[, 50:157])) != ncol(df[, 50:157]), ]

However, what I get is the following:

Output of the above code

As you can see, the first column contains values that are all equal to 1 but if you look at other columns, you see values of 2 (and 3). I think my code is only looking at the fiftieth column and neglecting the possibility to get different values than 1 for the 50th column because for the same gene, we can have a value of 2 in the 50th column but 1 for the 52nd column. To confirm that, I checked the possibility (please copy-paste the following link since I don't have enough reputation):

i.stack.imgur.com/rZQ2E.png

Could you please tell me if my code is working correctly or should I change something ?

The same thing happens if I change in my code the value of 1 to 2. I will still get values of 2 in the 50th column but all kind of values in the other columns.

Thanks in advance.

EDIT As requested by @tobiasegli_te, here's a small reproducible dataframe:

structure(list(`#00e41e6a-9fe7-44f9-978b-7b05b179506a` = c(1, 
1, NA, NA, NA, NA, NA, NA, NA, 1, 2, 1, NA, NA, 2, NA, 3, 1, 
1, NA, NA, NA, 2, NA, 1, NA, NA, NA, NA, 1, 1, 1, NA, NA, 1, 
NA, NA), `#aca312ab-6dbd-4183-8b22-8f37834f3426` = c(NA, NA, 
NA, 1, NA, 1, NA, 2, 1, NA, 2, 1, 1, 1, NA, NA, NA, 1, 1, 1, 
NA, 1, 2, 1, NA, 1, NA, 1, NA, NA, 1, NA, 1, NA, 1, 1, 1), `#0730216b-c201-443c-9092-81e23fd13c6c` = c(NA, 
NA, NA, NA, NA, NA, 2, NA, NA, NA, NA, NA, 1, NA, NA, 1, NA, 
NA, NA, NA, 1, NA, NA, NA, NA, NA, 2, NA, NA, NA, NA, NA, NA, 
2, 1, NA, NA), `#acd5ceef-c5cf-4e95-9394-c50fdbc70c8d` = c(NA, 
NA, 2, NA, 2, NA, 2, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2, NA, 
NA, NA, NA, 1, NA, NA, NA, NA, NA, NA, 1, NA, 1, NA, NA, NA, 
1, NA, NA)), .Names = c("#00e41e6a-9fe7-44f9-978b-7b05b179506a", 
"#aca312ab-6dbd-4183-8b22-8f37834f3426", "#0730216b-c201-443c-9092-81e23fd13c6c", 
"#acd5ceef-c5cf-4e95-9394-c50fdbc70c8d"), row.names = c(1L, 2L, 
4L, 6L, 8L, 10L, 11L, 16L, 20L, 22L, 23L, 30L, 32L, 37L, 38L, 
43L, 45L, 46L, 47L, 49L, 50L, 53L, 62L, 64L, 65L, 67L, 68L, 69L, 
70L, 71L, 73L, 74L, 76L, 77L, 79L, 80L, 81L), class = "data.frame")

Please provide the example data using `dput()` https://stackoverflow.com/questions/5963269/how-to-make-a-great-… — tobiasegli_te, Oct 17 '17 at 14:45
@tobiasegli_te the output from `dput()` is very big, even with `droplevels` and `head()`. However, I have included the table that you can load directly: https://drive.google.com/open?id=0B6ng04WZzK7JTEVxZF9hUFN4MkE — Zen, Oct 17 '17 at 14:57
Obviously you don't have to dput everything, just enough rows and columns to understand your problem and figuring out a solution. — tobiasegli_te, Oct 17 '17 at 14:58
I would suggest pulling your columns of interest out as a matrix. `focus = as.matrix(all_genes[, seq(50, 157, 2)])`. It is easier to operate row-wise on a matrix than a data frame. Then you can find rows with all 1s, ignoring `NA`s, `apply(focus == 1, MARGIN = 1, fun = all, na.rm = T)`. Or identify rows with all NA values, `apply(is.na(focus), MARGIN = 1, fun = all)`, etc. — Gregor Thomas, Oct 17 '17 at 15:07
You can use `any` instead of `all` for other conditions. This will probably be more efficient than the approaches in your question because it does the relatively expensive extraction and matrix conversion just once, instead of every time you use `rowSums` or `apply`. — Gregor Thomas, Oct 17 '17 at 15:16
@tobiasegli_te I have edited my main post, please find the `dput()` output. @Gregor Thank you for your suggestion, I will check them out. — Zen, Oct 17 '17 at 15:21

score 0 · Answer 1 · answered Oct 17 '17 at 15:14

For the first case try something like

mtcars[sapply(1:nrow(mtcars), function(i) any(mtcars[i, seq(2, ncol(mtcars), 2)] == 4)),]

                # mpg cyl  disp  hp drat    wt  qsec vs am gear carb
# Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
# Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
# Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
# Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
# Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
# Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
# Merc 280C      17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
# Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
# Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
# Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
# Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
# Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
# Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
# Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
# Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

For your data

all_genes[sapply(1:nrow(all_genes), function(i) any(all_genes[i, seq(50, 157, 2)] == 1)),]

For the second case try something like

mtcars[sapply(1:nrow(mtcars), function(i) all(is.na(mtcars[i, seq(2, ncol(mtcars), 2)]))),]

For your data

all_genes[sapply(1:nrow(all_genes), function(i) all(is.na(all_genes[i, seq(50, 157, 1)]))),]

Thanks a lot for your answer ! This is exactly what I was looking for. The code for the first case applied to my data works better. I think I understand now what was the problem. +1 — Zen, Oct 17 '17 at 15:24

Extracting all rows from multiple columns with specific value from dataframe in R

1 Answers1