7

Sorry if this is a duplicate, but I can't seem to find the information anywhere else on SO, even though it seems like such a simple problem. I have a data frame with several columns as factors. Some of those are integers, and some are strings. I would like to extract the rows that correspond to a particular factor. For example,

my_data <- read.table(file = "my_data.txt", header = TRUE)
my_data[ my_data$age == 20, ]

This works, but if I then try

my_data[ my_data$gender == "male", ]

This produces no matches. I realized they are not the same thing, as checking the class of my_data$name[1] gives factor, while I'm checking it against a string.

Any ideas what I'm doing wrong here?

Cheers

Data sample: Size Age Gender Value 1 20 male 0.5 4 22 female 0.7 3 14 female 0.3

joran
  • 169,992
  • 32
  • 429
  • 468
Samuel Tan
  • 1,700
  • 5
  • 22
  • 35
  • 3
    Should we assume that you have tried to use the correct `[row, col]` extracting form, as in `my_data[my_data$gender == "male", ]`? – A5C1D2H2I1M1N2O1R2T1 Feb 12 '14 at 01:46
  • Could you give us a sample of your data (e.g. dput(head(my_data))? – matt_k Feb 12 '14 at 02:28
  • yes, I used the `[row, col]` format....I realised my mistake now....I should have done `my_data[ my_data$gender == " male ", ]` Do you see the difference? Quite embarrassing, really. – Samuel Tan Feb 12 '14 at 04:41

2 Answers2

11

Try using the subset function.

This site provides a good reference: HowtoInR

my_data = subset(my_data, gender == "male")
thelatemail
  • 91,185
  • 12
  • 128
  • 188
LearnR
  • 326
  • 2
  • 6
4

This is an answer to an old question, but I'd like to share my current way of doing things where mistakes like this happen a lot less.

The answer is the data.table package. It has saved me hundreds of lines of code and will continue to do so. Subsetting becomes a piece of cake:

my_data <- data.table(my_data)
my_data[gender == "male" & age <= 20]

I can string as many conditionals as I like, and also use .SD to pass columns as arguments to functions, like so:

my_data[gender == "male" & age <= 20, lapply(.SD, mean), by = c("nationality", "height")]

Column creation from existing columns is much simpler, even creating multiple columns at once

Community
  • 1
  • 1
Samuel Tan
  • 1,700
  • 5
  • 22
  • 35