0

I'm working off a dataframe like the below. I've done my best to format it in SO. The important thing is that there are the same number of comma separated entries in person, personparty and sponsordate (I've truncated the cells, so they may not be the same in this example but they are the same in the dataset).

bill                                               status       person                       personparty       sponsordate
A bill to amend chapter 44 of title 18, ....        2ND Sen.   David Vitter [R-LA]           Republican             12/05/2015
A bill to authorize the appropriation of funds....  RESTRICT    Sen. Ed Markey [D-MA], Sen. Ed Markey [D-MA], Sen. Ed Markey [D-MA], Sen. Barbara Boxer [D-CA]  Democrat, Democrat, Democrat, Democrat, Democrat, Democrat, Democrat, Democrat, Democrat, Democrat, Democrat, Democrat, Democrat, Democrat,     21/05/2014, 02/06/2015, 05/04/2017, 22/05/2014, 21/07/2014, 09/06/2014, 02/06/2014, 12/06/2014, 21/05/2014, 02/06/2014, 21/05/2014

I want to create a new dataframe of five columns. I essentially want to unlist these (non list) values into one larger dataframe.

The final dataframe should have a row for the ith comma separated entry in , keeping the same column values for bill and status.

So for example from the second row in my example dataset, there would be a row with the bill name (A bill to authorize the appropriation of funds....), status (RESTRICT), Ed Markey, Democrat, 21/05/2014. The next row would be the 2nd entry from the comma separated values (same bill name, same status, Sen. Ed Markey [D-MA], Democrat, 02/06/2015), etc.

For rows with just one value in the three last columns, they'd stay the same.

How do I essentially unnest these list-like values?

Union find
  • 7,759
  • 13
  • 60
  • 111

2 Answers2

1

You seem to be looking for separate_rows.

Assumption: Comma separated values in these three columns are of same numbers. It is based on the excerpt from your post - "The important thing is that there are the same number of comma separated entries in person, personparty and sponsordate"

library(dplyr)
library(tidyr)

df %>%
  separate_rows(person, personparty, sponsordate, sep=",") 

Output is:

                                                bill   status                     person personparty
1       A bill to amend chapter 44 of title 18, .... 2ND Sen.        David Vitter [R-LA]  Republican
2 A bill to authorize the appropriation of funds.... RESTRICT      Sen. Ed Markey [D-MA]    Democrat
3 A bill to authorize the appropriation of funds.... RESTRICT      Sen. Ed Markey [D-MA]    Democrat
4 A bill to authorize the appropriation of funds.... RESTRICT      Sen. Ed Markey [D-MA]    Democrat
5 A bill to authorize the appropriation of funds.... RESTRICT  Sen. Barbara Boxer [D-CA]    Democrat
  sponsordate
1  12/05/2015
2  21/05/2014
3  02/06/2015
4  05/04/2017
5  22/05/2014

Sample data:

df <- structure(list(bill = structure(1:2, .Label = c("A bill to amend chapter 44 of title 18, ....", 
"A bill to authorize the appropriation of funds...."), class = "factor"), 
    status = structure(1:2, .Label = c("2ND Sen.", "RESTRICT"
    ), class = "factor"), person = structure(1:2, .Label = c("David Vitter [R-LA]", 
    "Sen. Ed Markey [D-MA], Sen. Ed Markey [D-MA], Sen. Ed Markey [D-MA], Sen. Barbara Boxer [D-CA]"
    ), class = "factor"), personparty = structure(c(2L, 1L), .Label = c("Democrat, Democrat, Democrat, Democrat", 
    "Republican"), class = "factor"), sponsordate = structure(1:2, .Label = c("12/05/2015", 
    "21/05/2014, 02/06/2015, 05/04/2017, 22/05/2014"), class = "factor")), .Names = c("bill", 
"status", "person", "personparty", "sponsordate"), class = "data.frame", row.names = c(NA, 
-2L))
Prem
  • 11,775
  • 1
  • 19
  • 33
0

Not sure I have understood what you want, so I begin with the data frame I assume you have:

df=structure(list(bill = c("A bill to amend chapter 44 of title 18, .<U+0085>", 
"A bill to authorize the appropriation of funds...."), status = c("2ND Sen.", 
"RESTRICT"), person = c("David Vitter [R-LA]", "Sen. Ed Markey [D-MA], Sen. Ed Markey [D-MA], Sen. Ed Markey [D-MA], Sen. Barbara Boxer [D-CA]"
), personparty = c("Republican", "Democrat, Democrat, Democrat, Democrat, Democrat, Democrat, Democrat, Democrat, Democrat, Democrat, Democrat, Democrat, Democrat, Democrat,"
), sponsordate = c("12/05/15", "21/05/2014, 02/06/2015, 05/04/2017, 22/05/2014, 21/07/2014, 09/06/2014, 02/06/2014, 12/06/2014, 21/05/2014, 02/06/2014, 21/05/2014"
)), .Names = c("bill", "status", "person", "personparty", "sponsordate"
), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"
), spec = structure(list(cols = structure(list(bill = structure(list(), class = c("collector_character", 
"collector")), status = structure(list(), class = c("collector_character", 
"collector")), person = structure(list(), class = c("collector_character", 
"collector")), personparty = structure(list(), class = c("collector_character", 
"collector")), sponsordate = structure(list(), class = c("collector_character", 
"collector"))), .Names = c("bill", "status", "person", "personparty", 
"sponsordate")), default = structure(list(), class = c("collector_guess", 
"collector"))), .Names = c("cols", "default"), class = "col_spec"))

Now I understand you want to expand the 2nd row to many rows. If 'many' means all combinations of vector elements of cols 3,4,5 of row 2 and append it to the data frame (overlapping row 2) you can do it as follows:

librart(stringr)
x01=str_split(df$person[2],",")[[1]]
x02=str_split(df$personparty[2],",")[[1]]
x03=str_split(df$sponsordate[2],",")[[1]]
x04=expand.grid(x01,x02,x03)
df0=do.call("rbind", replicate(nrow(x04), df[2,], simplify = FALSE))
df0[2:(nrow(x04)+1),3:5]=as.matrix(x04)

Hope this helps

Antonios
  • 1,919
  • 1
  • 11
  • 18