1

My data looks like this:

df <- read.table(header = T, text =
        "GeneID    Gene_Name   Species    Paralogues    Domains   Functional_Diversity
         1234      DDR1        hsapiens   14            2         8.597482
         5678      CSNK1E      celegans   70            4         8.154788
         9104      FGF1        Chicken    3             0         5.455874
         4575      FGF1        hsapiens   4             6         6.745845")

I need it to look like:

   Gene_Name    hsapiens    celegans    ggalus
   DDR1         8.597482    NA          NA
   CSNK1E       NA          8.154788    NA
   FGF1         6.745845    NA          5.455874

I've tried using:

library(tidyverse)
df %>% 
    select(Gene_Name, Species, Functional_Diversity) %>% 
    spread(Species, Functional_Diversity)

My actual data consists of 130,000 rows (many Gene Names approx 14,000 unique), consisting of 9 species.

When I apply this method to my actual data I get:

Error: Duplicate identifiers for rows (16691, 19988), (20938, 21033), (1232, 21150), (2763, 21465), (1911, 20844), (17274, 17657, 18293, 18652, 18726, 19006, 19025), (496, 22555), (17227, 17608, 18211, 18605, 18676, 18967, 19002), (13569, 21807), (10261, 21014, 21607), (20816, 21553), (2244, 22025), (6194, 21910), (12217, 21555), (2936, 21078), (16484, 20911), (12216, 21851), (9289, 21791), (10340, 21752), (1714, 22077), (13216, 22618), (6076, 22371), (14731, 21717), (160, 22472), (11553, 22635), (17183, 17583, 18510, 18608, 18661, 18896, 19108), (138, 20028), (17185, 17584, 18330, 18415, 18500, 18981, 19063), (9726, 22440), (17238, 17617, 18905, 18960, 18996, 19134), (1638, 21645), (4631, 20821), (9162, 22463), (319, 20900), (13600, 22227), (9312, 20011), (14825, 21711, 21764), (3381, 21134), (505, 21133), (5954, 20013), (5948, 21313), (17233, 17612, 18187, 18311, 18411, 18708, 18980), (16953, 20902, 21845), (20710, 22477), (20519, 20973), (10204, 21197, 21213), (2933, 20707), (4302,
CJ Yetman
  • 8,373
  • 2
  • 24
  • 56
Jack Dean
  • 163
  • 1
  • 7
  • 2
    This is one of the complications of `spread`: there has to be 1 or 0 numbers for each spot in the result. The solution is to add an index column, but what that should look like requires thought. [Here are some examples](https://stackoverflow.com/questions/45898614/how-to-spread-columns-with-duplicate-identifiers/45898919#45898919). – alistaire May 09 '18 at 18:33
  • Possible duplicate of [Spread with duplicate identifiers for rows](https://stackoverflow.com/questions/48383605/spread-with-duplicate-identifiers-for-rows) – camille May 09 '18 at 18:33
  • You used the wrong `df` and there was a typo in `Funtional_Diversity`. Should have been `ddf` and `Functional_Diversity`. It works after fixing those – Tung May 09 '18 at 18:35
  • Related / possible duplicate: [*Transpose / reshape dataframe without “timevar” from long to wide format*](https://stackoverflow.com/q/11322801/2204410) – Jaap Jul 04 '19 at 16:29

1 Answers1

1

To see just the rows that have "Duplicate identifiers", you could use...

df %>% 
  group_by(Gene_Name, Species) %>% 
  mutate(n = n()) %>% 
  filter(n > 1)

To ensure the spread works, even if you have rows with duplicate identifiers, you can add a row number column which will guarantee that each row is unique...

df %>% 
  select(Gene_Name, Species, Functional_Diversity) %>% 
  mutate(row = row_number()) %>% 
  spread(Species, Functional_Diversity)
CJ Yetman
  • 8,373
  • 2
  • 24
  • 56