1

Problem

I have a data frame that looks like this:

  ID    User.Food matched.indexes
1  1         milk        2, 8, 15
2  2       apples                
3  3        bread            4, 6
4  4    ice cream               5
5  5 boxed fruits  

The matched.indexes column contains a vector of integers. I want to convert this into long format so each matched index is on one row:

  ID    User.Food matched.indexes
1  1         milk               2
2  1         milk               8
3  1         milk              15
4  2       apples              NA     
5  3        bread               4
6  3        bread               6
7  4    ice cream               5
8  5 boxed fruits              NA

All of the questions and tutorials I have seen focus on changing a wide data frame with multiple named columns into long format (melt, gather, etc) or on separating out a cell that holds a string "2, 8, 15", but with this I am unclear on how to unpack the vector within the matched.indexes column?

Data

This data frame comes from the results of using agrep to get possible matches from a food groups data frame. Code to reproduce it is below:

df1 <- structure(list(ID = 1:5, 
                 User.Food = c("milk", "apples", "bread", "ice cream",  
                               "boxed fruits"), 
                 matched.indexes = list(c(2, 8, 15), NA, c(4,6), c(5),
                                        NA)), 
                 .Names = c("ID", "User.Food", "matched.indexes"), 
                 class = "data.frame", 
                 row.names = c("1", "2", "3", "4", "5"))
Tom Cooper
  • 611
  • 1
  • 8
  • 16

2 Answers2

0

We can use separarte_rows with convert = TRUE to change the class from character to numeric, thereby replacing those blanks ("" to NA

library(tidyr)
separate_rows(df1, matched.indexes, convert = TRUE)
#   ID    User.Food matched.indexes
#1  1         milk               2
#2  1         milk               8
#3  1         milk              15
#4  2       apples              NA
#5  3        bread               4
#6  3        bread               6
#7  4    ice cream               5
#8  5 boxed fruits              NA

data

df1 <- structure(list(ID = 1:5, User.Food = c("milk", "apples", "bread", 
"ice cream", "boxed fruits"), matched.indexes = c("2, 8, 15", 
"", "4, 6", "5", "")), .Names = c("ID", "User.Food", "matched.indexes"
 ), class = "data.frame", row.names = c("1", "2", "3", "4", "5"))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks, this almost gets me there, but the values in `matched.indexes` are not strings `"2, 8, 15"` they are vectors of integers `c(2, 8 ,15)`. So the `separate_rows` function includes a row for `c` and a blank row after every final value in the list? – Tom Cooper Mar 08 '18 at 13:12
  • @TomCooper Sorry, I thought it is a string. Did you meant that it is a `list` column i..e. `df1$matched.indexes <- list(c(2, 8, 15), NULL, c(4, 6), 5, NULL)`. If you can provide the correct structure i.e. what are those blanks.. Is it `NULL` or something else – akrun Mar 08 '18 at 13:14
  • Sorry I should have been clearer and provided a toy example (I have edited my question to include one). Yes, `matched.indexes` is as you describe however instead of `NULL` the empty cells are vectors containing a single `integer(0)`. I have written a function to replace them with `NA`s. – Tom Cooper Mar 08 '18 at 13:48
0

Using Tidyverse

library(tidyverse)

 df1 %>% 
  mutate(matched.indexes =str_split(matched.indexes, ",")) %>% 
  unnest() %>% 
  na_if("")

Output

  ID    User.Food matched.indexes
1  1         milk               2
2  1         milk               8
3  1         milk              15
4  2       apples            <NA>
5  3        bread               4
6  3        bread               6
7  4    ice cream               5
8  5 boxed fruits            <NA>
vsb
  • 428
  • 6
  • 15