All right, so it looks like you're definitely brand new to programming as a whole, and not just R. So first and foremost, I'd highly recommend checking out some of the MOOCs on Coursera, such as this one. But, as for your question, let's look at each piece of it that seems to be causing confusion.
First, when asking for help on this site, it's always best to provide actual data, and not a picture of your dataset. Given that you already had a dataframe in R that you were working with, you could easily take advantage of the dput
function and then copy that into your question. So, for example, you might have the following the dataframe:
df = data.frame(name=c("John", "Jim", "Sally"), state=c("MI", "FL", "NY"), state_leader=c(TRUE, FALSE, TRUE))
df
name state state_leader
1 John MI TRUE
2 Jim FL FALSE
3 Sally NY TRUE
Then we can just use dput(df)
and get the following output:
dput(df)
structure(list(name = structure(c(2L, 1L, 3L), .Label = c("Jim",
"John", "Sally"), class = "factor"), state = structure(c(2L,
1L, 3L), .Label = c("FL", "MI", "NY"), class = "factor"), state_leader = c(TRUE,
FALSE, TRUE)), .Names = c("name", "state", "state_leader"), row.names = c(NA,
-3L), class = "data.frame")
Those of us on Stack Overflow can now copy the output from dput and have a working copy of your dataset.
Next, let's look at your confusion around how to set new values in a dataset. In your updated text, you tried to set state_leader
equal to FALSE
with the following code df["John", "state_leader"] = FALSE
. This is wrong for two reasons: 1) "John" doesn't point to anything. R has no idea what you mean when you just say "John". 2) Even assuming that first part of your indexing logic was correct, by simply putting "state_leader" in the second part of your index, you're telling R that you want that whole column to be equal to FALSE
. The proper way to do what you wanted to do is with the following.
df[df$name == "John", "state_leader"] = FALSE
This way, R knows that you want the variable name
to be equal to "John".
So now that we have that, it'd probably be a good time to look at the [
operator and understand how it works. Because your complex algorithm for trying to find your values is not nearly as complex as you think when you understand how indexing works.
If you have a one-dimensional object in R, such as a vector, [
takes one parameter. If you have a two-dimensional object, such as a dataframe or matrix, [
takes two parameters, either one of which is optional. Let's look at a few examples.
x = 1:10 # A one-dimensional vector
x[1:3] # Get the first three elements of x
x[c(1, 3, 5, 7, 9)] # Get all odd elements of x
x[x %% 2 != 0] # Get all odd elements of x
In the examples above, we're working with a one-dimensional vector. The three operations we perform highlight a couple key points about [
. The first key point is that [
expects a numeric input, or something that can be converted to a numeric input. Second, the numeric inputs do not have to be consecutive. Lastly, the numeric input can be a function that returns a numeric result, such as x %% 2 != 0
. This last example is perfect for demonstrating what I mean by "something that can be converted to a numeric input". You can think of this in the following way: First, R computes x %% 2
. It then checks each element to see if it is equal to 0 or not, which returns a vector of Boolean values equal to TRUE
or FALSE
. It then checks which values are TRUE
and returns a vector of indices equal to c(1, 3, 5, 7, 9)
, which is identical to our second example.
Now, let's look at df
to see how [
works on two-dimensional objects. When working with 2D objects, the first parameter to [
tells it which rows you want, and the second parameter tells it which columns you want.
df[df$name == "John", ] # Get all rows where name equals "John" and ALL columns
df[, c(1, 3)] # Get all rows and only the first and third column
df[grepl("^J", df$name), 3] # Get all rows with names that start with "J" and only the third column
As we see above in the first two examples, you do not need to provide a value for each parameter in [
. If you leave one of the values blank, the default is to return all available rows or columns from the object. You'll also notice that we specifically call the column name even when we're specifying rows, such as df[df$name == "John", ]
. This is because we need R to understand which column we want to check to determine if we keep the row. Lastly, you should also notice that all of our prior understandings about [
in one-dimensional objects holds here. It expects a numeric input, or one that can be converted to a numeric input. So, in the first example, df$name == "John"
will be result in a Boolean vector with values c(TRUE, FALSE, FALSE)
and R will then check which values are TRUE
and return a value of 1
, indicating that only the first row matches that criteria.
So now that we understand how [
works, let's see how to use it to solve our question here. We know that we want all of the columns, so we can ignore the second parameter in [
. And we know that we want only the rows where state_leader
is TRUE
. So let's use that condition in our index.
df[df$state_leader == TRUE, ]
name state state_leader
1 John MI TRUE
3 Sally NY TRUE
As an exercise to you, how would you make this output better by only returning the name
and state
variables?