-1

It's a data-wrangling problem, issue with a query. I have a dataset, and each row does not represent 1 sample but contains one column which has a list of ID's. For example, You have 3 columns: age, sex & ID's. You could have one row: 28, M, 'ID209,ID208'.

Are there easy ways to extend this data-set such that I have one row per ID number? I'm working with R or Python.

Victor
  • 3,669
  • 3
  • 37
  • 42
blah_crusader
  • 424
  • 1
  • 5
  • 14
  • 2
    Yes, you'll have to provide a sample of the data for us to help though. If you're using R, see [here for creating a reproducible example](https://stackoverflow.com/a/5963610/4421870) – Mako212 Oct 19 '17 at 21:44
  • R solution: `library(tidyverse); df %>% mutate(id = stringr::str_split(id, “,”)) %>% unnest(id)` – tblznbits Oct 19 '17 at 21:58
  • thanks, unnest was really the function i was looking for! – blah_crusader Oct 21 '17 at 08:29

2 Answers2

1

This may not be the cleanest Python solution, but it should get you started.

This assumes that you have split rows down into a list of this form: [age, sex, 'ids']. This code should be easy to modify to fit your actual row format, but this should be sufficient to get you started.

new_rows = []
for row in dataset:
    id1, id2 = row[2].split(',')
    new_rows.append([row[0], row[1], id1])
    new_rows.append([row[0], row[1], id2])

print(new_rows)

I hope that helps.

L. MacKenzie
  • 493
  • 4
  • 14
1

An R solution using tidytext. Assuming that values in column ids are comma-separated:

library(tidytext)
library(stringr)

df1 <- data.frame(age = 28, 
                  sex = "M", 
                  ids = "ID209,ID208", 
                  stringsAsFactors = FALSE)

df1 %>% 
  unnest_tokens(id, ids, token = str_split, pattern = ",", to_lower = FALSE)

    age sex    id
1    28   M ID209
1.1  28   M ID208
neilfws
  • 32,751
  • 5
  • 50
  • 63