0

Novice to R here. I am looking to do the following:

I have a dataset, let's call it dataset1, and I a looking to make a new dataframe (dataset2). In dataset1 are peoples' names and then the states that they are in (thus, there are duplicate states, but no duplicate combo of names/states). There are no more than 3 people listed per state. In dataset2, I am looking to create new columns named person1, person2, person3 associated with each state (i.e: each state only has one row). So, if Alice, Bob, and Cathy are from Alabama, and then Dave and Edwin are from Alaska, there should be two rows: one for Alabama, one for Alaska, and then person1....person3 will have their names in it (and the last column, p3, for Alaska will be empty).

I am looking at trying to store state name as a dummy variable, and then using an if statement to sort through the rows of dataset1 and then appending the dataset as needed. Something tells me, though, there is a more concise way to do this as opposed to using for/if statements.

Any help?

  • 1
    Sample data and sample code are *much* more powerful than a textual description. Please make this question *reproducible*. This includes sample code (including listing non-base R packages), sample data (e.g., `dput(head(x))` or randomly-generated), and expected output. Refs: https://stackoverflow.com/questions/5963269, https://stackoverflow.com/help/mcve, and https://stackoverflow.com/tags/r/info. – r2evans Oct 11 '18 at 00:04
  • @r2evans Pretty hard to do a code description here, saying as I ask where to start. – Natasha P Oct 11 '18 at 00:10
  • Sure, but you also provided zero data. Do you see how JonSpring provided some randomly-generated data? This should be done by you (the asker) so that it meets all of your needs, and so that we (the answerers) don't have to guess (or guess wrong) what your data actually look like. (The one thing I would add to @JonSpring's random data is a `set.seed(1)` so that you can generate his exact sequence of random data.) – r2evans Oct 11 '18 at 00:35

1 Answers1

1

This sounds like you need to take the data from long format to wide format.

Here's some fake data:

set.seed(42)
df <- data.frame(stringsAsFactors = F,
  states = sample(state.name, size = 100, replace = T),
  people = sample(LETTERS, size = 100, replace = T)
  )

Here's an approach that groups by state, labels whether they're Person1, or Person2, etc., and then spreads those out to columns

library(tidyr); library(dplyr)
df2 <- df %>%
  group_by(states) %>%
  mutate(person = paste0("Person", row_number())) %>%
  ungroup() %>%
  spread(person, people, fill = "")

Output:

> df2
# A tibble: 44 x 6
   states   Person1 Person2 Person3 Person4 Person5
   <chr>    <chr>   <chr>   <chr>   <chr>   <chr>  
 1 Alabama  Q       R       P       P       K      
 2 Alaska   R       M       K       L       C      
 3 Arkansas O       ""      ""      ""      ""     
 4 Colorado X       U       F       ""      ""     
 5 Delaware O       ""      ""      ""      ""     
 6 Georgia  L       N       V       O       ""     
 7 Hawaii   G       ""      ""      ""      ""     
 8 Idaho    W       L       J       C       ""     
 9 Illinois V       ""      ""      ""      ""     
10 Indiana  Y       Y       U       ""      ""    
Jon Spring
  • 55,165
  • 4
  • 35
  • 53
  • Jon, can you add `set.seed(1)` (or some number) before your `df <-` code, then rerun the output? It helps a lot with reproducibility. – r2evans Oct 11 '18 at 00:35