0

My data currently looks like

UserID Full Name  DOB     EncounterID QuestionID Name  Type  label                responses    
1      John Smith 1-1-90  13          505        Intro Check Were you given any info?  yes
1      John Smith 1-1-90  13          506        Care  Check By using this service..   yes
1      John Smith 1-1-90  13          507        Out   Check How satisfied are you?    vsat
2      Jane Doe   2-2-80  14          505        Intro Check Were you given any info?  no
2      Jane Doe   2-2-80  14          506        Care  Check By using this service..   no
2      Jane Doe   2-2-80  14          507        Out   Check How satisfied are you?    unsat

My code to transform it from long to wide looks like

gwlsubset <- read.csv("subset.csv", header = TRUE)
gwlsubset

install.packages("tidyr")
library("tidyr")

subset<- pivot_wider(gwlsubset, id_cols = c( ID, full_name, date_of_birth, encounterID,                             
                                  practice_name, practice_id  ), 
          names_from = c(label), 
          values_from = response)

The code works perfectly fine when I run it from a subset of my data (300 records). I get something below

UserID F_Name  Were you given any info?   By using this service..?   How satisfied are you?
1      John Smith  yes                        yes                        very satisfied
2      Jane Doe    no                         no                         unsatisfied

However when I run with 1,000+ records, I get the error below

Warning message:
Values are not uniquely identified; output will contain list-cols.
* Use `values_fn = list` to suppress this warning.
* Use `values_fn = length` to identify where the duplicates arise
* Use `values_fn = {summary_fun}` to summarise duplicates

I'm thinking that I have some duplicate rows that's prompting the error or it could be something else too.

My output also changes into random numbers.

UserID   F_Name   Were you given any info?   By using this service..?   How satisfied are you?
1        John Smith  1                          824                          38
2        Jane Doe    7                          176                          445

How can I edit my code to get rid of duplicates? What else do you think may be causing the error and output with numbers?

I've tried the codes in my error message but wasn't able to get anywhere for example

values_fn = {summary_fun}
Ash S
  • 119
  • 5
  • This means you have more than 1 row for some combination of your ID variables. It'd be like, for example, if John Smith have two rows where the Name is "Intro". If this is expected then I think the answer showing how to make a sequence per group will help. If it is unexpected than I'd try to track down the problem by taking your dataset, grouping by all of your ID variables, and then filtering to where n() > 1. Like, simple pseudo-code, `dataset %>% group_by(variable1, variable2, etc) %>% filter(n() > 1) to print out problem rows. – aosmith Oct 07 '21 at 22:11
  • Can you provide a sample using `dput` which gives an error? (`dput(gwlsubset)`) Also share what is your expected output for this data? – Ronak Shah Oct 08 '21 at 01:51
  • @aosmith yes it's expected for there to be exactly two rows but the only difference is the time they were sent out. How would you write the code to the sequence? – Ash S Oct 08 '21 at 03:56
  • If you need to make unique identifiers for your different times so you keep them both (i.e., make a "time" column), then [this answer](https://stackoverflow.com/a/58837832/2461552) is likely relevant. – aosmith Oct 08 '21 at 14:33

1 Answers1

0

We may need to create a sequence column

library(data.table)
library(dplyr)
library(tidyr)
gwlsubset %>%
    mutate(rn = rowid(ID, full_name, date_of_birth, encounterID,                             
                                  practice_name, practice_id)) %>%
    pivot_wider(names_from = label, values_from = response)
akrun
  • 874,273
  • 37
  • 540
  • 662
  • I tried that and i also tried installing the data.table and dplyr package but it didn't work. I now get the error: Error: Can't subset columns that don't exist. x Column `label` doesn't exist. but I know the column 'label' exists. This is odd. – Ash S Oct 08 '21 at 04:33
  • @AshS what is the error message – akrun Oct 08 '21 at 04:34
  • @AshS your comment is broken i.e. without the error message – akrun Oct 08 '21 at 04:34
  • @AshS your input data shows the column name as `label` though? – akrun Oct 08 '21 at 04:35
  • My input does have column name as label so it's really odd that the error now says that it doesn't exist. – Ash S Oct 08 '21 at 04:40
  • @AshS can you show the column names after `gwlsubset %>% mutate(rn = rowid(ID, full_name, date_of_birth, encounterID, practice_name, practice_id)) %>% names` – akrun Oct 08 '21 at 04:42
  • I get this error ```Error in mutate(., rn = rowid(mrn, full_name, date_of_birth, encounter_date, : could not find function "mutate"``` I installed the data.table and dplyr packages too. – Ash S Oct 08 '21 at 04:47
  • @AsH That means you didn't load `library(dplyr)` ? – akrun Oct 08 '21 at 04:47
  • I loaded it again. It worked. Now I get the error ```Error: Problem with `mutate()` column `rn`. i `rn = rowid(...)`. x could not find function "rowid"``` – Ash S Oct 08 '21 at 04:48
  • @AshS you said you loaded `library(data.table)`. It is from data.table – akrun Oct 08 '21 at 04:49
  • Probably you only installed the package, but didn't load it i.e. load by `library(dplyr);library(data.table)` – akrun Oct 08 '21 at 04:50
  • it works now. I don't get that error but I do get ```Error: Can't subset columns that don't exist. x Column `label` doesn't exist.``` – Ash S Oct 08 '21 at 04:53
  • @AshS can you show the output of `gwlsubset %>% mutate(rn = rowid(ID, full_name, date_of_birth, encounterID, practice_name, practice_id)) %>% names` – akrun Oct 08 '21 at 04:53
  • @AshS I am suspecting that your `label` column may be having some leading/lagging spaces? – akrun Oct 08 '21 at 04:56
  • I get this ```> "pivot_wider(names_from = label, + values_from = response) %>% names" [1] "pivot_wider(names_from = label, \n values_from = response) %>% names"``` I don't have any spaces in the label column, I just checked. @akrun – Ash S Oct 08 '21 at 04:58
  • @AshS you are quoting the expression with `"`. I don't understand what is going on – akrun Oct 08 '21 at 05:01
  • @AshS Please copy/paste this line of code `gwlsubset %>% mutate(rn = rowid(ID, full_name, date_of_birth, encounterID, practice_name, practice_id)) %>% names` and see what it returns – akrun Oct 08 '21 at 05:01
  • 1
    @AshS Then, I am not sure why `pivot_wider` is not able to find the `label` column – akrun Oct 08 '21 at 05:10
  • @AshS can you update your post with a small `dput` of your data that shows the issue – akrun Oct 08 '21 at 05:14
  • thanks for your time. you're awesome. I'll keep trying and changing stuff around. It didn't happen before. – Ash S Oct 08 '21 at 05:14
  • @AshS try this on a fresh R session with only packages loaded `library(dplyr);library(data.table);library(tidyr)` – akrun Oct 08 '21 at 05:15
  • what else do I enter in the fresh session? I would update my post but my output still says ```UserID F_Name Were you given any info? By using this service..? How satisfied are you? 1 John Smith 1 824 38 2 Jane Doe 7 176 445 ``` – Ash S Oct 08 '21 at 05:18
  • @AshS you load your data `gwlsubset` i.e. `gwlsubset <- read.csv("subset.csv", header = TRUE)` – akrun Oct 08 '21 at 05:19
  • @AshS can you update your post with `dput(droplevels(head(gwlsubset, 20)))` – akrun Oct 08 '21 at 05:20
  • thanks I ran it, no error but it still shows random numbers in my output – Ash S Oct 08 '21 at 05:21
  • @AshS If you can update with `dput` as in my previous comment, I can test it. `dput` gives the exact same structure of your data and thus is reproducible – akrun Oct 08 '21 at 05:21
  • I can't copy paste it because it's sensitive information. structure It starts off with ```(list(ID = structure(c(18L, 13L, 15L, 2L, 4L, 9L, 12L, 14L, 7L, 20L, 3L, 10L, 11L, 17L, 8L, 1L, 19L, 6L, 5L, 16L)``` – Ash S Oct 08 '21 at 05:27
  • @AshS the partial dput is not helpful for reproduction of data i..e if I copy/paste that on my R console, it restults in error – akrun Oct 08 '21 at 05:28