0

My dataset is organized as shown below (just a small extract) : for a given subject (here, subject=5), I have 3 tests performed at times D-1, D1-8h and D2-24h:

    SUBJECT   TIME                    TEST RESULT UNITS              RANGES
591       5    D-1    Leukoyte count urine      1   /?L            |-< 15|-
592       5    D-1 Erythrocyte count urine      0   /?L            |-< 19|-
593       5    D-1  Glucose dipstick urine Normal  None |+ from 50 mg/dL-|-
684       5  D1 8h    Leukoyte count urine      0   /?L            |-< 15|-
687       5  D1 8h Erythrocyte count urine      0   /?L            |-< 19|-
683       5  D1 8h  Glucose dipstick urine Normal  None |+ from 50 mg/dL-|-
694       5 D2 24h    Leukoyte count urine      1   /?L            |-< 15|-
695       5 D2 24h Erythrocyte count urine      0   /?L            |-< 19|-
696       5 D2 24h  Glucose dipstick urine Normal  None |+ from 50 mg/dL-|-

I would like to reorganize these data in a table set by columns, in the following form:

TEST D-1 D1-8h D2-24h UNITS RANGES

so that I get one line by test.

I am confused with "table" and "aggregate", and I don't find a proper way to do that, although I am sure it's not so complicated...

Could you gve me some help?

Thanks

Here is dput:

> dput(dataset)
structure(list(SUBJECT = c(5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L
), TIME = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), .Label = c("D-1", 
"D1 8h", "D2 24h", "D4 72h"), class = "factor"), TEST = structure(c(35L, 
24L, 28L, 35L, 24L, 28L, 35L, 24L, 28L), .Label = c("", "Alkaline phosphatase", 
"APTT", "Basophils", "Basophils (%)", "Calcium", "CD19", "CD19 abs.", 
"CD3", "CD3 abs.", "CD4/CD8 ratio", "CD4+", "CD4+ abs.", "CD56", 
"CD56 absolute", "CD8+", "CD8+ abs.", "Chloride", "CK (creatine kinase)", 
"Creatinine", "Direct bilirubin (conjug)", "Eosinophils", "Eosinophils (%)", 
"Erythrocyte count urine", "Erythrocyte dipstick urine", "Gamma GT", 
"Glucose", "Glucose dipstick urine", "GOT (AST)", "GPT (ALT)", 
"Hematocrit", "Hemoglobin", "Ketone bodies urine", "Leukocyte esterase urine", 
"Leukoyte count urine", "Lymphocytes", "Lymphocytes (%)", "Monocytes", 
"Monocytes (%)", "Neutrophils", "Neutrophils (%)", "pH urine", 
"Platelet count", "Potassium", "Protein urine", "PT INR", "Red blood cell count", 
"Reticulocytes", "Reticulocytes %", "Serum  Albumine", "Sodium", 
"Total bilirubin", "Total cholesterol", "Total protein", "Triglycerides", 
"Urea", "Urine glucose quantitative", "Urine protein quantitative", 
"White blood cell count"), class = "factor"), RESULT = c("1", 
"0", "Normal", "0", "0", "Normal", "1", "0", "Normal"), UNITS = c("/?L", 
"/?L", "None", "/?L", "/?L", "None", "/?L", "/?L", "None"), RANGES = c("|-< 15|-", 
"|-< 19|-", "|+ from 50 mg/dL-|-", "|-< 15|-", "|-< 19|-", "|+ from 50 mg/dL-|-", 
"|-< 15|-", "|-< 19|-", "|+ from 50 mg/dL-|-")), .Names = c("SUBJECT", 
"TIME", "TEST", "RESULT", "UNITS", "RANGES"), row.names = c(591L, 
592L, 593L, 684L, 687L, 683L, 694L, 695L, 696L), class = "data.frame")
Andrew
  • 926
  • 2
  • 17
  • 24
  • Check out [Reshaping data.frame from wide to long format](https://stackoverflow.com/questions/2185252/reshaping-data-frame-from-wide-to-long-format) – Rui Barradas Feb 23 '19 at 16:50
  • Well, it seems that the problem addressed is exactly the opposite of mine... So I suppose I just have to "reverse" the instructions; but, My God, I really dont understand how to do this... and apply the given solution to may own problem. I feel completely stupid (maybe I am). – Andrew Feb 23 '19 at 17:15
  • Sampling your data and providing a working data set via `dput` might be helpful to some. – Peter_Evan Feb 23 '19 at 17:17
  • Ok, sorry, I messed up, here is the accurate link: [reshape data from long to wide in R](https://stackoverflow.com/questions/5890584/how-to-reshape-data-from-long-to-wide-format). – Rui Barradas Feb 23 '19 at 18:14

2 Answers2

1

Is this it? If so I believe it should be marked as a duplicate of reshape data from long to wide in R.

library(tidyverse)

spread(dataset, key = TIME, value = UNITS)
#  SUBJECT                    TEST RESULT              RANGES  D-1 D1 8h D2 24h
#1       5 Erythrocyte count urine      0            |-< 19|-  /?L   /?L    /?L
#2       5  Glucose dipstick urine Normal |+ from 50 mg/dL-|- None  None   None
#3       5    Leukoyte count urine      0            |-< 15|- <NA>   /?L   <NA>
#4       5    Leukoyte count urine      1            |-< 15|-  /?L  <NA>    /?L

Edit.

In his comment, Peter_Evan corrects the above. Th right solution is

spread(dataset, key = TIME, value = RESULT)
#  SUBJECT                    TEST UNITS              RANGES    D-1  D1 8h D2 24h
#1       5 Erythrocyte count urine   /?L            |-< 19|-      0      0      0
#2       5  Glucose dipstick urine  None |+ from 50 mg/dL-|- Normal Normal Normal
#3       5    Leukoyte count urine   /?L            |-< 15|-      1      0      1

or, if the OP wants to reorder the columns, the following.

dataset %>%
  spread(key = TIME, value = RESULT) %>%
  select(SUBJECT,TEST, `D-1`:`D2 24h`, UNITS, RANGES)
#  SUBJECT                    TEST    D-1  D1 8h D2 24h UNITS              RANGES
#1       5 Erythrocyte count urine      0      0      0   /?L            |-< 19|-
#2       5  Glucose dipstick urine Normal Normal Normal  None |+ from 50 mg/dL-|-
#3       5    Leukoyte count urine      1      0      1   /?L            |-< 15|-
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • 1
    I think you have the value wrong. `x %>% spread(TIME,RESULT)` is perhaps the intended output? `x %>% spread(TIME,RESULT) %>% select(SUBJECT,TEST, `D-1`:`D2 24h`, UNITS, RANGES)` to for completeness. – Peter_Evan Feb 23 '19 at 18:31
0

I believe you are asking for a pretty straight forward implementation of dcast() which takes data from long to wide. Here's one implementation using data.table package.

library(data.table)
#> Warning: package 'data.table' was built under R version 3.4.4

x <- structure(list(SUBJECT = c(5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L
), TIME = c("D-1", "D-1", "D-1", "D1 8h", "D1 8h", "D1 8h", "D2 24h", 
            "D2 24h", "D2 24h"), TEST = c("Leukoyte count urine", "Erythrocyte count urine", 
                                          "Glucose dipstick urine", "Leukoyte count urine", "Erythrocyte count urine", 
                                          "Glucose dipstick urine", "Leukoyte count urine", "Erythrocyte count urine", 
                                          "Glucose dipstick urine"), RESULT = c("1", "0", "Normal", "0", 
                                                                                "0", "Normal", "1", "0", "Normal"), UNITS = c("/?L", "/?L", "None", 
                                                                                                                              "/?L", "/?L", "None", "/?L", "/?L", "None"), RANGES = c("|-< 15|-", 
                                                                                                                                                                                      "|-< 19|-", "|+ from 50 mg/dL-|-", "|-< 15|-", "|-< 19|-", "|+ from 50 mg/dL-|-", 
                                                                                                                                                                                      "|-< 15|-", "|-< 19|-", "|+ from 50 mg/dL-|-")), .Names = c("SUBJECT", 
                                                                                                                                                                                                                                                  "TIME", "TEST", "RESULT", "UNITS", "RANGES"), row.names = c(NA, 
                                                                                                                                                                                                                                                                                                              -9L), class = c("data.table", "data.frame"))



  dcast(SUBJECT + TEST ~ TIME, data = x, value.var = c("UNITS", "RANGES"))
#>    SUBJECT                    TEST UNITS_D-1 UNITS_D1 8h UNITS_D2 24h
#> 1:       5 Erythrocyte count urine       /?L         /?L          /?L
#> 2:       5  Glucose dipstick urine      None        None         None
#> 3:       5    Leukoyte count urine       /?L         /?L          /?L
#>             RANGES_D-1        RANGES_D1 8h       RANGES_D2 24h
#> 1:            |-< 19|-            |-< 19|-            |-< 19|-
#> 2: |+ from 50 mg/dL-|- |+ from 50 mg/dL-|- |+ from 50 mg/dL-|-
#> 3:            |-< 15|-            |-< 15|-            |-< 15|-

Created on 2019-02-23 by the reprex package (v0.2.1)

Maybe this is what you want (if not, put the expected output in your question to avoid everyone guessing):

dcast(SUBJECT + TEST + UNITS + RANGES ~ TIME, data = df, value.var = "RESULT")

  SUBJECT                    TEST UNITS              RANGES    D-1  D1 8h D2 24h
1       5 Erythrocyte count urine   /?L            |-< 19|-      0      0      0
2       5  Glucose dipstick urine  None |+ from 50 mg/dL-|- Normal Normal Normal
3       5    Leukoyte count urine   /?L            |-< 15|-      1      0      1
Chase
  • 67,710
  • 18
  • 144
  • 161
  • 1
    I am sorry, but it's not what I am looking for; the result should be one line by test, and then, on each line, the results observed at times D-1 etc., then one column indicating the units (corresponding to the test on the line) and then one column indicating the ranges (corresponding to the test on the line). – Andrew Feb 23 '19 at 17:54
  • @Andrew - you're getting closer (I see you added data), but if you can make your example reproducible, as outlined in [this link](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), you'll get better answers. I'm 99% confident your question will still be answered by `dcast()`... – Chase Feb 23 '19 at 19:20
  • @Andrew - maybe this is what you want: `dcast(SUBJECT + TEST + UNITS + RANGES ~ TIME, data = df, value.var = "RESULT")` which gives you the following columns: `SUBJECT TEST UNITS RANGES D-1 D1 8h D2 24h` with the values in `RESULT` populating the columns D1, D1 8h, and D2 24h – Chase Feb 23 '19 at 19:20
  • Thanks Chase ! The result is perfect. After browsing a lot here and there, I came independently to the same (almost) solution by using the following code : 'melting <- melt(x, id.vars = c("TEST", "TIME", "UNITS", "RANGES"), measured.vars = c("RESULT")) casting <- dcast(melting, TEST+UNITS+RANGES ~ TIME)' . And then, re-arranging a bit : 'casting.sort <- aqw[order(casting$TEST),] casting.final <- casting.sort[ , c(1,4,5,6,2,3)]' I just cancelled SUBJECT, as it was always the same in my dataset, so not a variable... Your remarks helped me a lot in finding a suitable solution. Tanks all ! – Andrew Feb 23 '19 at 20:29
  • Didn't use backtiks, sorry; then: `melting <- melt(x, id.vars = c("TEST", "TIME", "UNITS", "RANGES"), measured.vars = c("RESULT")) casting <- dcast(melting, TEST+UNITS+RANGES ~ TIME)` . And then, re-arranging a bit : `casting.sort <- aqw[order(casting$TEST),] casting.final <- casting.sort[ , c(1,4,5,6,2,3)]` – Andrew Feb 23 '19 at 21:39