Extract rows from R data frame based on factors (strings)

Question

Sorry if this is a duplicate, but I can't seem to find the information anywhere else on SO, even though it seems like such a simple problem. I have a data frame with several columns as factors. Some of those are integers, and some are strings. I would like to extract the rows that correspond to a particular factor. For example,

my_data <- read.table(file = "my_data.txt", header = TRUE)
my_data[ my_data$age == 20, ]

This works, but if I then try

my_data[ my_data$gender == "male", ]

This produces no matches. I realized they are not the same thing, as checking the class of my_data$name[1] gives factor, while I'm checking it against a string.

Any ideas what I'm doing wrong here?

Cheers

Data sample: Size Age Gender Value 1 20 male 0.5 4 22 female 0.7 3 14 female 0.3

Should we assume that you have tried to use the correct `[row, col]` extracting form, as in `my_data[my_data$gender == "male", ]`? — A5C1D2H2I1M1N2O1R2T1, Feb 12 '14 at 01:46
Could you give us a sample of your data (e.g. dput(head(my_data))? — matt_k, Feb 12 '14 at 02:28
yes, I used the `[row, col]` format....I realised my mistake now....I should have done `my_data[ my_data$gender == " male ", ]` Do you see the difference? Quite embarrassing, really. — Samuel Tan, Feb 12 '14 at 04:41

score 11 · Accepted Answer · edited Feb 12 '14 at 01:51

11

Try using the subset function.

This site provides a good reference: HowtoInR

my_data = subset(my_data, gender == "male")

edited Feb 12 '14 at 01:51

thelatemail

91,185
12
128
188

answered Feb 12 '14 at 01:31

LearnR

326
2
6

Thanks for your reply. However, it gives the same output. – Samuel Tan Feb 12 '14 at 03:40
Could you give a sample of your data? – LearnR Feb 12 '14 at 03:44
Thanks for that, found the problem, see my comment above. Sorry about the trouble. I knew I was doing something wrong. – Samuel Tan Feb 12 '14 at 04:42
This works even for boolean comparision, e.g. data$x => data$y. – Mohammed Nov 10 '16 at 14:07

score 4 · Answer 2 · edited May 23 '17 at 12:08

This is an answer to an old question, but I'd like to share my current way of doing things where mistakes like this happen a lot less.

The answer is the data.table package. It has saved me hundreds of lines of code and will continue to do so. Subsetting becomes a piece of cake:

my_data <- data.table(my_data)
my_data[gender == "male" & age <= 20]

I can string as many conditionals as I like, and also use .SD to pass columns as arguments to functions, like so:

my_data[gender == "male" & age <= 20, lapply(.SD, mean), by = c("nationality", "height")]

Column creation from existing columns is much simpler, even creating multiple columns at once

Extract rows from R data frame based on factors (strings)

2 Answers2