0

I have habitat data with a column named habitat_size, this ranges from small, medium, large and missing, denoted as S,M,L,X, I want these values as columns so something like Small, Medium, Large, Missing, however, for its values it must take the values from the other habitats, and the column names for those habitats will be arranged into a separate column with each name matching its specific value.

I have tried:

 dcast(habitat_case, month + season + year + LOC_ID + GBW_NUM + OBS_DT ~ GARDEN_SIZE)

Using LARGE_PONDS as value column: use value.var to override.
  month season year    LOC_ID GBW_NUM     OBS_DT    L    M    S    X
1     8 Summer 2003 LOC569491   76550 2003-08-06 <NA>    X <NA> <NA>
2    11 Autumn 2003 LOC575706   76863 2003-11-06 <NA> <NA>    X <NA>
3    10 Autumn 2008 LOC575706   76863 2008-10-09 <NA> <NA>    X <NA>
4    11 Autumn 2003 LOC575712   76869 2003-11-06 <NA>    X <NA> <NA>
5     8 Summer 2008 LOC575712   76869 2008-08-08 <NA> <NA>    X <NA>
6    11 Autumn 2003 LOC591752   76998 2003-11-06 <NA>    X <NA> <NA>

However, it only uses one of the other columns as values, I expect something like this:

month season   year LOC_ID      GBW_NUM OBS_DT   Habitat_type    Large    Medium    Small    Missing
8       Summer 2003 LOC569491   76550 2003-08-06  lawn             A         A         A        X
11      Autumn 2003 LOC575706   76863 2003-11-06  garden_age       A         A         X        A
10      Autumn 2008 LOC575706   76863 2008-10-09  small_pond       1         0         0        1

Here is a reproducible code:

structure(list(month = c(8, 11, 10, 11, 8, 11, 3, 4, 3, 1, 10, 
12, 6, 6, 12, 10, 10, 11, 1, 1, 1, 1, 1, 1, 1), season = c("Summer", 
"Autumn", "Autumn", "Autumn", "Summer", "Autumn", "Spring", "Spring", 
"Spring", "Winter", "Autumn", "Winter", "Summer", "Summer", "Winter", 
"Autumn", "Autumn", "Autumn", "Winter", "Winter", "Winter", "Winter", 
"Winter", "Winter", "Winter"), year = c(2003, 2003, 2008, 2003, 
2008, 2003, 2005, 2005, 2020, 2008, 2010, 2013, 2012, 2018, 2012, 
2016, 2017, 2003, 1995, 1995, 1995, 2003, 1995, 1995, 1995), 
    LOC_ID = c("LOC569491", "LOC575706", "LOC575706", "LOC575712", 
    "LOC575712", "LOC591752", "LOC610239", "LOC614296", "LOC674315", 
    "LOC675786", "LOC794326", "LOC794326", "LOC973627", "LOC973627", 
    "LOC1354719486425", "LOC1314108", "LOC2174194", "LOC1300619774", 
    "LOC1300613011", "LOC1300612920", "LOC1300615617", "LOC1300609972", 
    "LOC1300611058", "LOC1300612277", "LOC1300605354"), GBW_NUM = c("76550", 
    "76863", "76863", "76869", "76869", "76998", "79154", "79846", 
    "85416", "85665", "89078", "89078", "90835", "90835", "97926", 
    "93406", "26517", "72370", "65158", "65056", "68017", "61805", 
    "63019", "64369", "54273"), OBS_DT = structure(c(12270, 12362, 
    14161, 12362, 14099, 12362, 12856, 12903, 18352, 13890, 14889, 
    16040, 15502, 17693, 15679, 17100, 17443, 12362, 9131, 9131, 
    9131, 12078, 9131, 9131, 9131), class = "Date"), GARDEN_SIZE = c("M", 
    "S", "S", "M", "S", "M", "L", "S", "M", "S", "M", "M", "L", 
    "L", "S", "M", "M", "M", "M", "S", "M", "L", "M", "L", "L"
    ), GARDEN_AGE = c("A", "B", "C", "D", "D", "D", "B", "C", 
    "D", "C", "D", "C", "E", "E", "B", "B", "D", "A", "E", "B", 
    "D", "D", "D", "E", "E"), SMALL_WATER_BODY = c("X", "X", 
    "X", "X", "X", "X", "X", "Q", "X", "X", "X", "X", "Q", "Q", 
    "X", "X", "X", "X", "X", "X", "X", "X", "X", "X", "X"), LARGE_WATER_BODY = c("X", 
    "X", "X", "X", "X", "X", "X", "X", "X", "X", "X", "X", "X", 
    "X", "X", "X", "X", "X", "X", "X", "X", "X", "X", "X", "X"
    ), STREAM = c("X", "X", "X", "X", "X", "X", "X", "S", "S", 
    "X", "X", "X", "S", "S", "X", "X", "X", "S", "X", "X", "X", 
    "X", "X", "X", "X"), RIVER = c("X", "X", "X", "X", "X", "X", 
    "X", "X", "X", "X", "X", "X", "X", "X", "X", "X", "X", "X", 
    "X", "X", "X", "X", "X", "X", "X"), CANAL = c("X", "X", "X", 
    "X", "X", "X", "X", "X", "X", "X", "X", "X", "X", "X", "X", 
    "X", "X", "X", "X", "X", "X", "X", "X", "X", "X"), SEASHORE = c("X", 
    "X", "X", "X", "X", "X", "X", "X", "X", "X", "X", "X", "X", 
    "X", "X", "X", "X", "X", "X", "X", "X", "X", "X", "X", "X"
    ), SMALL_PONDS = c("X", "X", "X", "X", "X", "X", "X", "X", 
    "0", "X", "X", "X", "X", "2", "X", "X", "X", "X", "X", "X", 
    "X", "X", "X", "X", "X"), MEDIUM_PONDS = c("X", "X", "X", 
    "X", "X", "X", "X", "X", "0", "X", "X", "X", "X", "X", "X", 
    "X", "X", "X", "X", "X", "X", "X", "X", "X", "X"), LARGE_PONDS = c("X", 
    "X", "X", "X", "X", "X", "X", "X", "0", "X", "X", "X", "X", 
    "X", "X", "X", "X", "X", "X", "X", "X", "X", "X", "X", "X"
    )), row.names = c(NA, 25L), class = "data.frame")
Lime
  • 738
  • 5
  • 17

1 Answers1

1

You can try to get data in long format, recode it and get it back in wide format. Using dplyr and tidyr you can do this as :

library(dplyr)
library(tidyr)

df %>%
  pivot_longer(cols = GARDEN_AGE:LARGE_PONDS) %>%
  mutate(GARDEN_SIZE = recode(GARDEN_SIZE, `L` = 'Large', `M` = 'Medium', 
                              `S` = 'Small', `X` = 'Missing')) %>%
  pivot_wider(names_from = GARDEN_SIZE, values_from = value)
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • That works brilliantly, however, may you explain your reasoning as to why you decided to convert the data into long format then back into wide, as opposed to skipping the long format process and going straight to wide? – Lime Oct 10 '20 at 12:05
  • 1
    If we go straight to wide format it creates new columns for each combination of column value and column names creating too many new columns. Try this code : `df %>%mutate(GARDEN_SIZE = recode(GARDEN_SIZE, \`L\` = 'Large',\`M\` = 'Medium', \`S\` = 'Small', \`X\` = 'Missing')) %>% pivot_wider(names_from = GARDEN_SIZE, values_from = c(GARDEN_AGE:LARGE_PONDS))` – Ronak Shah Oct 10 '20 at 12:09
  • I see now, I will keep this in mind for future coding; I find this logic very helpful! – Lime Oct 10 '20 at 12:21
  • Having used this code, when I try to fit a GAM model using `Medium` as a covariate, I get `invalid type (list) for variable 'Medium'`, do you know a work-around this? Or should I ask this as a separate question? – Lime Oct 10 '20 at 15:43
  • Hmmm...I am not sure what could be wrong. However, Did you assign the above dataframe to a new variable? `df1 <- df %>% pivot_longer.....` – Ronak Shah Oct 10 '20 at 15:47
  • Yes I did, when I do this however, It comes with this error `Warning message: Values are not uniquely identified; output will contain list-cols. * Use `values_fn = list` to suppress this warning. * Use `values_fn = length` to identify where the duplicates arise * Use `values_fn = {summary_fun}` to summarise duplicates ` Which I presume is affecting the code when using GAM? – Lime Oct 10 '20 at 15:49
  • Not sure...yeah better to ask that as new question. The error indicates that you don't have a unique identifier values to get data in wide format see this post for possible ways to handle it https://stackoverflow.com/questions/58837773/pivot-wider-issue-values-in-values-from-are-not-uniquely-identified-output-w – Ronak Shah Oct 10 '20 at 15:56
  • I have noticed that instead of filling missing values as `NA`, which is what happened when it first worked, it filled them with `NULL`, maybe changing these Null values may work. Having used `str()` it shows that the new columns are lists? as opposed to being character, so using `as.character()` does the trick. – Lime Oct 10 '20 at 17:00