Transforming a dataframe from long to wide with Duplicate identifiers in R

Question

I have had this topic already 3 or 4 times and I thought I found a solution, but I did not. I have big problems with converting dataframes like that (one example of many more):

https://megastore.uni-augsburg.de/get/TXLoameX7G/ (I hope its ok to provide the dataframe via my university website)

The original dataframe looks like the one on the left side and I want to have it look like on the right side:

I have this code, that works fine for most of my dataframes (the real dataframes have 31 days, not only 3).

library(tidyverse)
trans_df= df %>% gather(Day, value, Day01:Day31) %>% spread(Station, value)

But for some reason it doesnt work for all of my dataframes. Some are showing this error (like the one I uploded in the link):

Error: Duplicate identifiers for rows (2893, 2905), (19333, 19345), (35773, 35785), (52213, 52225), (68653, 68665), (85093, 85105), (101533, 101545), (117973, 117985), (134413, 134425), (150853, 150865), (167293, 167305), (183733, 183745), (200173, 200185), (216613, 216625), (233053, 233065), (249493, 249505), (265933, 265945), (282373, 282385), (298813, 298825), (315253, 315265), (331693, 331705), (348133, 348145), (364573, 364585), (381013, 381025), (397453, 397465), (413893, 413905), (430333, 430345), (446773, 446785), (463213, 463225), (479653, 479665), (496093, 496105), (2894, 2906), (19334, 19346), (35774, 35786), (52214, 52226), (68654, 68666), (85094, 85106), (101534, 101546), (117974, 117986), (134414, 134426), (150854, 150866), (167294, 167306), (183734, 183746), (200174, 200186), (216614, 216626), (233054, 233066), (249494, 249506), (265934, 265946), (282374, 282386), (298814, 298826), (315254, 315266), (331694, 331706), (348134, 348146), (364574, 364586), (381014, 381026),

I already asked here how to solve this problem: R - Wrong error message - Error: Duplicate identifiers for rows

I got an answer to do this:

data2 <- data %>%
gather(Day, value, Day01:Day31) %>%
tibble::rowid_to_column() %>%
spread(Station, value)

First I thought it is working, because I dont get the Duplicate identifiers Error anymore, but the file sizes are getting huge and it seems to duplicate each line 4 times!

Any idea how to finally solve this problem?

score 2 · Accepted Answer · answered Jan 16 '18 at 20:31

If you check the original data frame, there are duplicates with respect to Station and Day:

df.summary <- group_by(df, Station, Date) %>% count()
df.summary[which(df.summary$n > 1), ]

# A tibble: 396 x 3
# Groups:   Station, Date [396]
   Station       Date     n
    <fctr>     <fctr> <int>
 1 DEBW001 2001-01-01     2
 2 DEBW001 2001-02-01     2
 3 DEBW001 2001-03-01     2
 4 DEBW001 2001-04-01     2
 5 DEBW001 2001-05-01     2
 6 DEBW001 2001-06-01     2
 7 DEBW001 2001-07-01     2
 8 DEBW001 2001-08-01     2
 9 DEBW001 2001-09-01     2
10 DEBW001 2001-10-01     2
# ... with 386 more rows

It depends on how you want to treat these duplicates. Suppose you want to take the mean of the duplicated values:

df2 <- reshape2::melt(df, id.vars=c("Station", "Date"), variable.name="Day")
df3 <- reshape2::dcast(df2, Date+Day~Station, value.var="value", fun.aggregate=mean)

The resulting data frame looks like this:

df3[1:10, 1:10]
         Date   Day AT0ACH1 AT0ENK1 AT0ILL1 AT0PIL1 AT0SIG1 AT0SON1 AT0STO1 AT0VOR1
1  2001-01-01 Day01  53.696  44.727  47.826  40.955  85.500  94.455  64.739  62.455
2  2001-01-01 Day02  42.048  28.609  39.435  42.435  78.000  89.261      NA  71.348
3  2001-01-01 Day03  38.565  28.957  19.522  28.304  72.500  88.625      NA  47.130
4  2001-01-01 Day04  39.304  23.739  16.522  20.870  85.625  95.870      NA  52.913
5  2001-01-01 Day05  67.375  29.864  22.421  21.174  82.875  93.087      NA  61.652
6  2001-01-01 Day06  58.478  32.478  28.708  26.870  67.043  79.391      NA  55.957
7  2001-01-01 Day07  49.652  21.217  29.042  48.609  55.870  76.174      NA  52.435
8  2001-01-01 Day08  48.217  16.739  27.591  41.217  59.522  79.435      NA  55.696
9  2001-01-01 Day09  52.000  30.391  44.542  46.783  67.609  82.583      NA  54.455
10 2001-01-01 Day10  37.087  33.174  28.522  30.182  80.750  94.478      NA  52.818

But I dont get why those are duplicates? Its the same station, yes, but different days. That happens in all of my dataframes and it doesnt make problems there. I can send you a working one, to check is there are any differences. — Essi, Jan 16 '18 at 20:53
https://megastore.uni-augsburg.de/get/Y0H4vIHar2/ Here is a working one and I tested it by myself, you are right. In the working dataframe I am getting this result: > df.summary <- group_by(ozon, Station, Date) %>% count() > df.summary[which(df.summary$n > 1), ] # A tibble: 0 x 3 # Groups: Station, Date [0] # ... with 3 variables: Station , Date , n — Essi, Jan 16 '18 at 20:57
I still dont get in which way they are duplicates. You mean the n-column? It seems like they are just the same line two times right? — Essi, Jan 16 '18 at 21:03
They are duplicates in the sense that there were two lines in `df` that had the same Station AND Date. If you look at rows 3097 and 3109, both the Station name and Date are the same: df[c(3097, 3109), ] — LucyMLi, Jan 17 '18 at 00:34

Transforming a dataframe from long to wide with Duplicate identifiers in R

1 Answers1