dcast changes content of dataframe

Question

I tried using the reshape package to reshape a dataframe I got, but when using it, numbers in the dataframe are changed which should not be.

The dataframe contains several variables as well as multiple times these variables have been measured, for each person there are 6 rows, that is 6 times that person has been measured. Now I want to reshape the dataframe so there is only one row for each person instead of 6, that means every variable should be there 6 times (once for every measurement), this should easily be done with the following code:

melteddata <- melt(daten, id=(c("IDParticipant", "looporder")))

datenrestrukturiert <- dcast(melteddata, IDParticipant~looporder+variable)

with "daten" being the original dataframe, "looporder" being the variable that reflects the time of measurement (1-6), here an example (unfortunately I could not figure out how to post tables):

https://www.dropbox.com/s/8c9dm4rttedbzw1/daten.jpg?dl=0

or maybe this is fine:

structure(list(IDParticipant = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 3L, 3L, 3L), looporder = c(1L, 2L, 3L, 5L, 6L, 2L, 3L, 
5L, 6L, 1L, 2L, 3L), pc_mean_1 = c(NA, 3.22222222222222, NA, 
3.22222222222222, 3.22222222222222, 3.66666666666667, 3.66666666666667, 
3.66666666666667, 3.66666666666667, 3.25, NA, 3.25), bd_mean_1 = c(NA, 
2.88888888888889, NA, 2.88888888888889, 2.88888888888889, 2.75, 
2.75, 2.75, 2.75, 4.08333333333333, NA, 4.08333333333333), sm = c(999, 
4, 999, 3.66666666666667, 1, 4, 4, 5, 5, 5, 999, 5), cm = c(999, 
1.33333333333333, 999, 2.33333333333333, 1, 2, 2, 2.33333333333333, 
1, 3, 999, 1.66666666666667)), .Names = c("IDParticipant", "looporder", 
"pc_mean_1", "bd_mean_1", "sm", "cm"), row.names = c(NA, 12L), class = "data.frame")

datenrestrukturiert looks as the following:

https://www.dropbox.com/s/al93lnj76y1j266/datenrestrukturiert.jpg?dl=0

I do not want to aggregate or anything, which is why I tried adding fun.aggregate = NULL without any change, also there is always the following message:

"Aggregation function missing: defaulting to length"

so far everything worked, but there is one problem: when using dcast (as well as cast) some numbers from variables are changed, mostly to "0" or "1", but usually there should be some other numbers like "3.44" or "4.77" or something like that, but they are changed to mostly "0" when cast is computed

Anybody got any hints why this could be?

Some more information that might help: when i import the dataset via read.csv2 I always get a strange name for the first variable, that is some more symbols in front of the variablename than shown in Excel: "ï..IDParticipant" which I rename to "IDParticipant", could that have anything to do with it?

another sidefact: running it with the sampleframe I provided, everything is fine, the original dataframe consists of 1404 rows and 353 variables, could it be too big for R?

Do you ever have more than one value per variable combination? Can you share some example input and output? — A5C1D2H2I1M1N2O1R2T1, Aug 27 '15 at 09:03
Hi, welcome to SO. We cannot answer your question based on speculation; we need to know what your data looks like. Please provide a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) — Heroka, Aug 27 '15 at 09:03
What do you get as the result of `any(duplicated(daten[c("IDParticipant", "looporder")]))`? — A5C1D2H2I1M1N2O1R2T1, Aug 27 '15 at 09:31
@psytar, then you're going to have to add a secondary ID before you can proceed. — A5C1D2H2I1M1N2O1R2T1, Aug 27 '15 at 09:34
tried with "IDTeam as another ID, still the same result, here´s the code: melteddata <- melt(daten, id=(c("IDParticipant", "IDTeam", "looporder"))) datenrestrukturiert <- dcast(melteddata, IDParticipant + IDTeam ~ looporder+variable) — psytar, Aug 27 '15 at 09:35
Related: [*dcast error: ‘Aggregation function missing: defaulting to length’*](https://stackoverflow.com/q/33051386/2204410) — Jaap, Aug 31 '18 at 05:42

score 0 · Answer 1 · answered Aug 27 '15 at 09:47

If you have duplicated combinations of your LHS and RHS variables, then you either need to (1) create a secondary level of IDs, or (2) perform some form of aggregation.

You can test for duplicates by using any(duplicated(...)).

Here's an example, using your existing sample of "daten" (which does not contain duplicates):

library(reshape2)

idvars <- c("IDParticipant", "looporder")
any(duplicated(daten[idvars]))
# [1] FALSE

melteddata <- melt(daten, id=idvars)
datenrestrukturiert <- dcast(melteddata, IDParticipant ~ looporder + variable)
datenrestrukturiert
#   IDParticipant 1_pc_mean_1 1_bd_mean_1 1_sm 1_cm 2_pc_mean_1 2_bd_mean_1 2_sm       2_cm 3_pc_mean_1
# 1             1          NA          NA  999  999    3.222222    2.888889    4   1.333333          NA
# 2             2          NA          NA   NA   NA    3.666667    2.750000    4   2.000000    3.666667
# 3             3        3.25    4.083333    5    3          NA          NA  999 999.000000    3.250000
#   3_bd_mean_1 3_sm       3_cm 5_pc_mean_1 5_bd_mean_1     5_sm     5_cm 6_pc_mean_1 6_bd_mean_1 6_sm
# 1          NA  999 999.000000    3.222222    2.888889 3.666667 2.333333    3.222222    2.888889    1
# 2    2.750000    4   2.000000    3.666667    2.750000 5.000000 2.333333    3.666667    2.750000    5
# 3    4.083333    5   1.666667          NA          NA       NA       NA          NA          NA   NA
#   6_cm
# 1    1
# 2    1
# 3   NA

However, since any(duplicated(...)) is giving you TRUE, you are likely to have something more similar to:

daten2 <- rbind(daten, daten[c(1, 4, 6), ])
any(duplicated(daten2[idvars]))
# [1] TRUE

In this case, you can consider using getanID from my "splitstackshape" package to conveniently add a secondary "id" to your dataset.

library(splitstackshape)

melteddata2 <- melt(getanID(daten2, idvars), c(".id", idvars))

datenrestrukturiert2 <- dcast.data.table(
  melteddata2, .id + IDParticipant ~ looporder + variable)

datenrestrukturiert2
#    .id IDParticipant 1_pc_mean_1 1_bd_mean_1 1_sm 1_cm 2_pc_mean_1 2_bd_mean_1 2_sm
# 1:   1             1          NA          NA  999  999    3.222222    2.888889    4
# 2:   1             2          NA          NA   NA   NA    3.666667    2.750000    4
# 3:   1             3        3.25    4.083333    5    3          NA          NA  999
# 4:   2             1          NA          NA  999  999          NA          NA   NA
# 5:   2             2          NA          NA   NA   NA    3.666667    2.750000    4
#          2_cm 3_pc_mean_1 3_bd_mean_1 3_sm       3_cm 5_pc_mean_1 5_bd_mean_1     5_sm
# 1:   1.333333          NA          NA  999 999.000000    3.222222    2.888889 3.666667
# 2:   2.000000    3.666667    2.750000    4   2.000000    3.666667    2.750000 5.000000
# 3: 999.000000    3.250000    4.083333    5   1.666667          NA          NA       NA
# 4:         NA          NA          NA   NA         NA    3.222222    2.888889 3.666667
# 5:   2.000000          NA          NA   NA         NA          NA          NA       NA
#        5_cm 6_pc_mean_1 6_bd_mean_1 6_sm 6_cm
# 1: 2.333333    3.222222    2.888889    1    1
# 2: 2.333333    3.666667    2.750000    5    1
# 3:       NA          NA          NA   NA   NA
# 4: 2.333333          NA          NA   NA   NA
# 5:       NA          NA          NA   NA   NA

thank you very much for your answer, unfortunately I do not fully understand what you mean, there should not be any duplications, but as you said, the function to test this says there are, what do you mean with LHS and RHS variables? I am not familiar with there abbreveations, thanks very much for your help! — psytar, Aug 27 '15 at 10:05
@psytar, left hand side and right hand side, in reference to the formulas. — A5C1D2H2I1M1N2O1R2T1, Aug 27 '15 at 10:07
edit: it worked now somehow, I´ll figure out how and the reply — psytar, Aug 27 '15 at 12:30

score 0 · Answer 2 · answered Aug 27 '15 at 12:57

here is my solution basend on Anandas suggestions (thank you very much for that)

dataframe is "daten" containing many variables, e.g. "IDParticipant", "looporder" and "sm"

first we need to create an object containing the variables for the later use of the melt- and cast-function

idvars <- c("IDParticipant", "looporder")

as it turns out, there were duplicates in the dataframe with the same values in the two variables "IDParticipant" and "looporder", so we need to add another id-varaible to the dataframe when melting it, that is to be done with "getanID" from the splitstackshape-package

melteddata <- melt(getanID(daten, idvars), c(".id", idvars))

after adding an extra id-variable, we can finally cast the dataframe we need using the extra id-variable and the other variables

datenrestrukturiert <- dcast(melteddata, .id + IDParticipant ~ variable + looporder)

dcast changes content of dataframe

2 Answers2

Linked