0

I have a problem with a dataframe that I need to reshape.

I have this command:

library(tidyverse)
df1 = df1 %>% gather(Day, value, Day01:Day31) %>% spread(Station, value)

And I get this error:

Error: Duplicate identifiers for rows (130933, 131029), (389113, 389209), (647293, 647389), (905473, 905569), (1163653, 1163749), (1421833, 1421929), (1680013, 1680109), (1938193, 1938289), (2196373, 2196469), (2454553, 2454649), (2712733, 2712829), (2970913, 2971009), (3229093, 3229189), (3487273, 3487369), (3745453, 3745549), (4003633, 4003729), (4261813, 4261909), (4519993, 4520089), (4778173, 4778269), (5036353, 5036449), (5294533, 5294629), (5552713, 5552809), (5810893, 5810989), (6069073, 6069169), (6327253, 6327349), (6585433, 6585529), (6843613, 6843709), (7101793, 7101889), (7359973, 7360069), (7618153, 7618249), (7876333, 7876429), (130934, 131030), (389114, 389210), (647294, 647390), (905474, 905570), (1163654, 1163750), (1421834, 1421930), (1680014, 1680110), (1938194, 1938290), (2196374, 2196470), (2454554, 2454650), (2712734, 2712830), (2970914, 2971010), (3229094, 3229190), (3487274, 3487370), (3745454, 3745550), (4003634, 4003730), (4261814, 4261910), (4519994, 4520090

The strange thing is that I also get this results:

library(dplyr)
test = rownames_to_column(df1, "VALUE")
length(unique(test$VALUE)) ### Result 258180 = Same as number of rows
length(unique(test$VALUE)) == nrow(test) #### Result TRUE

As you see the error message also contains rows that do not even exist in my dataframe.

The command works fine on all other dataframes I have, that have 1:1 the same structure. They only have less rows.

I dont know how to provide the dataframe for you since its so huge. I uploaded it on my university, so you can download the dataframe.

Here is the link (I hope its allowed to post it like that)

https://megastore.uni-augsburg.de/get/pmAS15z6TN/

alistaire
  • 42,459
  • 4
  • 77
  • 117
Essi
  • 761
  • 3
  • 12
  • 22
  • You should edit to post a subset of your data that reproduces the issue, preferably by posting the results of calling `dput` on it. – alistaire Dec 19 '17 at 01:35

1 Answers1

1

This ought to work. As a comment noted, this is because spread tries to combine rows that are no longer uniquely identified after the gather. rowid_to_column is a simple function that converts the row ids into a column. The reason the numbers are larger than the size of the original dataset is because after gathering you have a data frame with 8003580 rows.

data2 <- data %>%
    gather(Day, value, Day01:Day31) %>%
    tibble::rowid_to_column() %>%
    spread(Station, value)

I ran into memory issues trying to actually do this on my laptop though.

Calum You
  • 14,687
  • 4
  • 23
  • 42
  • Thanks a lot! I tested it and I also have memory issues....I also only have a laptop and until now it was still able to handle it. I will try it on a desktop now. But at least I am not getting the same error than before. – Essi Dec 19 '17 at 16:51