3

I need to manipulate the raw data (csv) to a wide format so that I can analyze in R or SPSS.

It looks something like this:

1,age,30 
1,race,black 
1,scale_total,35
2,age,20
2,race,white
2,scale_total,99

Ideally it would look like:

ID,age,race,scale_total, etc
1, 30, black, 35 
2, 20, white, 99

I added values to the top row of the raw data (ID, Question, Response) and tried the cast function but I believe this aggregated data instead of just transforming it:

data_mod <- cast(raw.data2, ID~Question, value="Response")
Aggregation requires fun.aggregate: length used as default
Andrew Gustar
  • 17,295
  • 1
  • 22
  • 32
blacksheep
  • 31
  • 1
  • 1
    Possible duplicate of [How to reshape data from long to wide format?](https://stackoverflow.com/questions/5890584/how-to-reshape-data-from-long-to-wide-format) – camille Dec 19 '18 at 22:48

3 Answers3

1

We need a sequence column to be created to take care of the duplicate rows which by default results in aggregation to length

library(data.table)
dcast(setDT(df1), ID + rowid(Question) ~ Question, value.var = 'Response')

NOTE: The example data clearly works (giving expected output) without using the sequence column.

dcast(setDT(df1), ID ~ Question)
#   ID age   race scale_total
#1:  1 30  black           35
#2:  2  20  white          99

So, this is a case when applied on the full dataset with duplicate rows

data

df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L), Question = c("age", 
"race", "scale_total", "age", "race", "scale_total"), Response = c("30", 
 "black ", "35", "20", "white", "99")), class = "data.frame", 
 row.names = c(NA, -6L))
Community
  • 1
  • 1
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Each ID may not necessarily have every Question variable associated with it, e.g., ID 1 may not have a race variable present but ID 2 and 3 do, because participants only viewed/answered applicable questions. How would you handle this if I wanted "NA" to fill in that "missing" data? When I run the script above, it creates multiple rows for some ID's. @andrew – blacksheep Apr 01 '19 at 16:32
  • @blacksheep I think your message is meant for andrew – akrun Apr 02 '19 at 01:57
  • Apologies, I meant to reply to you. I used the code you provided but it created multiple rows for each ID and I wasn't sure a way around that. – blacksheep Apr 03 '19 at 16:02
  • @blacksheep Is it based on the same example (as I couldn't reproduce) – akrun Apr 03 '19 at 16:04
1

You could use tidyr...

library(tidyr)
df<-read.csv(text="1,age,30 
    1,race,black 
    1,scale_total,35
    2,age,20
    2,race,white
    2,scale_total,99", header=FALSE, stringsAsFactors=FALSE)

df %>% spread(key=V2,value=V3)

  V1 age   race scale_total
1  1 30  black           35
2  2  20  white          99
Andrew Gustar
  • 17,295
  • 1
  • 22
  • 32
0

For SPSS:

data list list/ID (f5) Question Response (2a20).
begin data
1 "age" "30" 
1 "race" "black" 
1 "scale_total" "35"
2 "age" "20"
2 "race" "white"
2 "scale_total" "99"
end data.

casestovars /id=id /index=question.

Note that the resulting variables age and scale_total will be string variables - you'll have to turn them into numbers before further transformations:

alter type age scale_total (f8).
eli-k
  • 10,898
  • 11
  • 40
  • 44