Extending a dataset based on multiple IDs in a column

Question

It's a data-wrangling problem, issue with a query. I have a dataset, and each row does not represent 1 sample but contains one column which has a list of ID's. For example, You have 3 columns: age, sex & ID's. You could have one row: 28, M, 'ID209,ID208'.

Are there easy ways to extend this data-set such that I have one row per ID number? I'm working with R or Python.

Yes, you'll have to provide a sample of the data for us to help though. If you're using R, see [here for creating a reproducible example](https://stackoverflow.com/a/5963610/4421870) — Mako212, Oct 19 '17 at 21:44
R solution: `library(tidyverse); df %>% mutate(id = stringr::str_split(id, “,”)) %>% unnest(id)` — tblznbits, Oct 19 '17 at 21:58

score 1 · Answer 1 · answered Oct 19 '17 at 21:50

This may not be the cleanest Python solution, but it should get you started.

This assumes that you have split rows down into a list of this form: [age, sex, 'ids']. This code should be easy to modify to fit your actual row format, but this should be sufficient to get you started.

new_rows = []
for row in dataset:
    id1, id2 = row[2].split(',')
    new_rows.append([row[0], row[1], id1])
    new_rows.append([row[0], row[1], id2])

print(new_rows)

I hope that helps.

score 1 · Answer 2 · answered Oct 19 '17 at 21:59

An R solution using tidytext. Assuming that values in column ids are comma-separated:

library(tidytext)
library(stringr)

df1 <- data.frame(age = 28, 
                  sex = "M", 
                  ids = "ID209,ID208", 
                  stringsAsFactors = FALSE)

df1 %>% 
  unnest_tokens(id, ids, token = str_split, pattern = ",", to_lower = FALSE)

    age sex    id
1    28   M ID209
1.1  28   M ID208

Extending a dataset based on multiple IDs in a column

2 Answers2