Complex self-referencing of a dataframe

Question

I could not find anything that answered this question, so I apologize if it is a duplicate. I'm also not sure exactly how to phrase it.

Here is my example I created for stackoverflow - my real dataset is much more complex:

Here is the example dataframe I am using

The idea behind this is that this is a dataset of workers. Each worker has info in columns named name, age, State (where they are located), and State_Lead, a boolean column that represents whether or not if they are the worker who is in charge of that State.

My goal here is twofold - I want a code that

1) references the State and State_Lead columns and require 1 (Not zero, not >1) State_Lead =TRUE per State. If there is more than or less than 1, I want to randomize who in each State becomes the State Lead

2) Calls up the current State_Lead=TRUE for each State. Ideally I could reference a State and be able to call anything from the row of the State_Lead (where the rows are named the same as the Name column).

#I made Jack not the state lead so the goal should be to return James and Jill
Database["Jack", "State_Lead"]=FALSE

All_States <- unique(Database$State)
All_States


##Here I thought I could cycle through each state and return the rows that                       matched each State Leader
heads <- NULL
for(i in All_States){
  heads <- append( heads, Database[, "State"==i])
  }

heads

## heads just returns "list()"

###attempt 2

heads <- NULL

for(i in All_States){
  if (sum(Database[Database[,"State"==i], "State_Lead"]) = 1)
    heads <-append(heads, Database[,"State"==i], "State_Lead"])
  else Database$State==i <- NA
    all_in_state <- subset(Database[, State="i"])
    sample(all_in_state, 1)

}

if everyone is a state lead are you flipping a coin for who is returned? — manotheshark, Dec 20 '16 at 04:10
FYI, the `r` tag means you don't need `(R)` in your title. Images aren't useful as example data. It sounds like you need to go through a beginner course. I highly recommend the `swirl` package which will guide you through how to do what you're after here (and what you're trying for [in your other, unaccepted, question](http://stackoverflow.com/questions/41214198/r-im-trying-to-reference-a-column-in-a-dataframe-with-an-if-statement-to-co)). — Jonathan Carroll, Dec 20 '16 at 04:17
Jonathan - I did do the swirl packages - I didn't see a clear answer there. My other question I solved. If you could recommend another intro course that would be great - or any solution for the question. brittenb - I'm away from my code for a little bit, I'll post it when I can. — user5827247, Dec 20 '16 at 04:19
@user5827247 for your other question, [try this one](http://stackoverflow.com/help/someone-answers). This isn't a code-writing service. Post your code and resulting issues and we might be able to tell you where you're going wrong. — Jonathan Carroll, Dec 20 '16 at 04:39

tblznbits · Accepted Answer · 2016-12-20T14:37:06.177

All right, so it looks like you're definitely brand new to programming as a whole, and not just R. So first and foremost, I'd highly recommend checking out some of the MOOCs on Coursera, such as this one. But, as for your question, let's look at each piece of it that seems to be causing confusion.

First, when asking for help on this site, it's always best to provide actual data, and not a picture of your dataset. Given that you already had a dataframe in R that you were working with, you could easily take advantage of the dput function and then copy that into your question. So, for example, you might have the following the dataframe:

df = data.frame(name=c("John", "Jim", "Sally"), state=c("MI", "FL", "NY"), state_leader=c(TRUE, FALSE, TRUE))
df
   name state state_leader
1  John    MI         TRUE
2   Jim    FL        FALSE
3 Sally    NY         TRUE

Then we can just use dput(df) and get the following output:

dput(df)
structure(list(name = structure(c(2L, 1L, 3L), .Label = c("Jim", 
"John", "Sally"), class = "factor"), state = structure(c(2L, 
1L, 3L), .Label = c("FL", "MI", "NY"), class = "factor"), state_leader = c(TRUE, 
FALSE, TRUE)), .Names = c("name", "state", "state_leader"), row.names = c(NA, 
-3L), class = "data.frame")

Those of us on Stack Overflow can now copy the output from dput and have a working copy of your dataset.

Next, let's look at your confusion around how to set new values in a dataset. In your updated text, you tried to set state_leader equal to FALSE with the following code df["John", "state_leader"] = FALSE. This is wrong for two reasons: 1) "John" doesn't point to anything. R has no idea what you mean when you just say "John". 2) Even assuming that first part of your indexing logic was correct, by simply putting "state_leader" in the second part of your index, you're telling R that you want that whole column to be equal to FALSE. The proper way to do what you wanted to do is with the following.

df[df$name == "John", "state_leader"] = FALSE

This way, R knows that you want the variable name to be equal to "John".

So now that we have that, it'd probably be a good time to look at the [ operator and understand how it works. Because your complex algorithm for trying to find your values is not nearly as complex as you think when you understand how indexing works.

If you have a one-dimensional object in R, such as a vector, [ takes one parameter. If you have a two-dimensional object, such as a dataframe or matrix, [ takes two parameters, either one of which is optional. Let's look at a few examples.

x = 1:10 # A one-dimensional vector
x[1:3] # Get the first three elements of x
x[c(1, 3, 5, 7, 9)] # Get all odd elements of x
x[x %% 2 != 0] # Get all odd elements of x

In the examples above, we're working with a one-dimensional vector. The three operations we perform highlight a couple key points about [. The first key point is that [ expects a numeric input, or something that can be converted to a numeric input. Second, the numeric inputs do not have to be consecutive. Lastly, the numeric input can be a function that returns a numeric result, such as x %% 2 != 0. This last example is perfect for demonstrating what I mean by "something that can be converted to a numeric input". You can think of this in the following way: First, R computes x %% 2. It then checks each element to see if it is equal to 0 or not, which returns a vector of Boolean values equal to TRUE or FALSE. It then checks which values are TRUE and returns a vector of indices equal to c(1, 3, 5, 7, 9), which is identical to our second example.

Now, let's look at df to see how [ works on two-dimensional objects. When working with 2D objects, the first parameter to [ tells it which rows you want, and the second parameter tells it which columns you want.

df[df$name == "John", ] # Get all rows where name equals "John" and ALL columns
df[, c(1, 3)] # Get all rows and only the first and third column
df[grepl("^J", df$name), 3] # Get all rows with names that start with "J" and only the third column

As we see above in the first two examples, you do not need to provide a value for each parameter in [. If you leave one of the values blank, the default is to return all available rows or columns from the object. You'll also notice that we specifically call the column name even when we're specifying rows, such as df[df$name == "John", ]. This is because we need R to understand which column we want to check to determine if we keep the row. Lastly, you should also notice that all of our prior understandings about [ in one-dimensional objects holds here. It expects a numeric input, or one that can be converted to a numeric input. So, in the first example, df$name == "John" will be result in a Boolean vector with values c(TRUE, FALSE, FALSE) and R will then check which values are TRUE and return a value of 1, indicating that only the first row matches that criteria.

So now that we understand how [ works, let's see how to use it to solve our question here. We know that we want all of the columns, so we can ignore the second parameter in [. And we know that we want only the rows where state_leader is TRUE. So let's use that condition in our index.

df[df$state_leader == TRUE, ]
   name state state_leader
1  John    MI         TRUE
3 Sally    NY         TRUE

As an exercise to you, how would you make this output better by only returning the name and state variables?

Thank you so much for your detailed comment. It was very helpful. I didn't realize the `dput` function existed, and will use that in the future. Thanks as well for pointing me towards that MOOC. I've spent the last day going through the DataCamp R lessons and have found them incredibly helpful for understanding this. To answer your question (and mine, which seems very simple now..): `df[df$state_leader == TRUE, 1:2]` Would return only the State Leader rows, and the name and state columns. Thank you again for your patience and time. It is very appreciated. — user5827247, Dec 22 '16 at 08:31
Alternatively, you could even do: `df[df$state_leader == TRUE, c("name", "state")] ` to return the same result - this seems better because you don't risk specifying incorrect column rows in the case of a large dataset. – — user5827247, Dec 22 '16 at 08:37
@user5827247 I'm really glad that you found it helpful. There's a steep learning curve, but it gets easier as you go. As a side note, on Stack Overflow it is considered proper etiquette to mark a question as answered by clicking the check mark by the question once a satisfactory answer has been provided. Good luck with your future learnings! — tblznbits, Dec 22 '16 at 12:26
Thank you. I was trying to figure out how to mark comments as answered earlier but didn't realize that they were different from answers. I marked yours. One of the biggest problems I am encountering is just not knowing of these very useful functions like `dput`. Is there a list somewhere you would recommend? — user5827247, Dec 22 '16 at 20:42
@user5827247 I'm not sure if there's a list anywhere that would show all the helpful functions that you'll need all the way. The thing that will help you the most in programming, in my opinion, is learning how to ask the most basic question that solves your problem. So, for example, you might ask "How can I easily transfer my data frame to stack overflow?" and just plug that into Google and see what you get. The other thing you should become overly familiar with in R is `??`. `??` followed by a word will search all of the help files in R for that word. — tblznbits, Dec 22 '16 at 20:47

Complex self-referencing of a dataframe

1 Answers1