1

I am an R newbie. Trying to tidy a dataset, a portion of which is below (the actual dataset has 10000 rows and columns).I am trying to tidy it by gathering them in pairs.

id     start      event1    event2  event2  date1       date2       date2
 1     06/07/2011   A       B       C       06/07/2011  06/07/2011  06/07/2011           
 1     06/07/2011                           NA          NA          NA
 1     06/07/2011   -                       NA          NA          NA
 2     15/07/2011   D       E       A       18/07/2011  18/07/2011  16/07/2011
 3     15/07/2011   D       C       H       19/07/2011  19/07/2011  14/08/2011
 4     22/08/2011   G                       04/09/2011  NA          NA
 4     22/08/2011   -                       NA          NA          NA

What I want to achieve is this:

start        event_date   event   
06/07/2011   06/07/2011   A
06/07/2011   06/07/2011   B
06/07/2011   06/07/2011   C
15/07/2011   18/07/2011   D

And so on, converting to the long format with preserving the temporal link between dates and events and removing all 'non-events'.

Rizwan
  • 27
  • 5
  • This will be much easier for people to answer if you include a reproducible form of your data (i.e., via `dput`). See here for details [How to make a great R reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – Conor Neilson Feb 16 '18 at 00:46
  • I don't understand your expected outcome. What happened to all the other entries? For example, why is there no entry for `start=22/08/2011`? Why only one entry for `start=15/07/2011` instead of 6? Is there expected outcome incomplete? – Maurits Evers Feb 16 '18 at 00:59
  • I don't understand why you have two columns named `event2`, and two more columns named `date2`. Should the names be `id start event1 event2 event3 date1 date2 date3`? If not, what do the duplicated column names signify? – Ista Feb 16 '18 at 01:16
  • I also don't understand why `id` is duplicated. – Ista Feb 16 '18 at 01:40

4 Answers4

0

I don't understand your expected outcome. Assuming it is incomplete, is this what you're after?

require(tidyverse);
df %>%
    filter(event1 != "-" & event1 != "") %>%
    group_by(start) %>%
    unite(tmp1, date1, event1) %>%
    unite(tmp2, date2, event2) %>%
    unite(tmp3, date3, event3) %>%
    gather(id, tmp, 3:5) %>%
    separate(tmp, c("event_date", "event"), sep = "_") %>%
    select(start, event_date, event) %>%
    filter(event_date != "NA") %>%
    ungroup() %>%
    arrange(start, event_date);
    ## A tibble: 10 x 3
    #   start      event_date event
    #   <chr>      <chr>      <chr>
    # 1 06/07/2011 06/07/2011 A
    # 2 06/07/2011 06/07/2011 B
    # 3 06/07/2011 06/07/2011 C
    # 4 15/07/2011 14/08/2011 H
    # 5 15/07/2011 16/07/2011 A
    # 6 15/07/2011 18/07/2011 D
    # 7 15/07/2011 18/07/2011 E
    # 8 15/07/2011 19/07/2011 D
    # 9 15/07/2011 19/07/2011 C
    #10 22/08/2011 04/09/2011 G

Explanation: Remove rows where event1 is empty or "-". Group by start and concatenate columns date1,event1 and so on. Convert to long table and separate concatenated entries into event_date and event. Clean to make consistent with expected outcome.


Sample data

require(tidyverse);
df <- read_table(
    "id     start      event1    event2  event3  date1       date2       date3
 1     06/07/2011   A       B       C       06/07/2011  06/07/2011  06/07/2011
 1     06/07/2011                           NA          NA          NA
 1     06/07/2011   -                       NA          NA          NA
 2     15/07/2011   D       E       A       18/07/2011  18/07/2011  16/07/2011
 3     15/07/2011   D       C       H       19/07/2011  19/07/2011  14/08/2011
 4     22/08/2011   G                       04/09/2011  NA          NA
 4     22/08/2011   -                       NA          NA          NA")
Maurits Evers
  • 49,617
  • 4
  • 47
  • 68
0

We can use two gather calls, manipulate the columns and filter out columns with NA or empty string, order the columns by id and finally select the right columns.

library(dplyr)
library(tidyr)

dat2 <- dat %>%
  gather(label, event, starts_with("event")) %>%
  gather(date, event_date, starts_with("date")) %>%
  mutate_at(vars(label, date), funs(sub("[A-Za-z]*", "", .))) %>%
  filter(label == date, !is.na(event_date), !event %in% "") %>%
  arrange(id) %>%
  select(start, event_date, event)
dat2
#         start event_date event
# 1  06/07/2011 06/07/2011     A
# 2  06/07/2011 06/07/2011     B
# 3  06/07/2011 06/07/2011     C
# 4  15/07/2011 18/07/2011     D
# 5  15/07/2011 18/07/2011     E
# 6  15/07/2011 16/07/2011     A
# 7  15/07/2011 19/07/2011     D
# 8  15/07/2011 19/07/2011     C
# 9  15/07/2011 14/08/2011     H
# 10 22/08/2011 04/09/2011     G

DATA

dat <- read.table(text = "id     start      event1    event2  event2  date1       date2       date2
 1     '06/07/2011'   A       B       C       '06/07/2011'  '06/07/2011'  '06/07/2011'           
                  1     '06/07/2011'   ''      ''     ''       NA          NA          NA
                  1     '06/07/2011'   -       ''     ''          NA          NA          NA
                  2     '15/07/2011'   D       E       A       '18/07/2011'  '18/07/2011'  '16/07/2011'
                  3     '15/07/2011'   D       C       H       '19/07/2011'  '19/07/2011'  '14/08/2011'
                  4     '22/08/2011'   G       ''     ''           '04/09/2011'  NA          NA
                  4     '22/08/2011'   -       ''     ''        NA          NA          NA",
                  header = TRUE, stringsAsFactors = FALSE)
www
  • 38,575
  • 12
  • 48
  • 84
0

A common pattern when working with data in (roughly) this form is 1) gather all the repeating columns 2) separate the variable names from the time indicators, and 3) spread the data putting the variables back in the columns.

There are some non-standard aspects to your example, so I start by assuming that you actually have three repeated measurements, and by creating a unique id variable.

dat <- read.table(text = "id     start      event1    event2  event3  date1       date2       date3
1     '06/07/2011'   A       B       C       '06/07/2011'  '06/07/2011'  '06/07/2011'           
1     '06/07/2011'   ''      ''     ''       NA          NA          NA
1     '06/07/2011'   -       ''     ''          NA          NA          NA
2     '15/07/2011'   D       E       A       '18/07/2011'  '18/07/2011'  '16/07/2011'
3     '15/07/2011'   D       C       H       '19/07/2011'  '19/07/2011'  '14/08/2011'
4     '22/08/2011'   G       ''     ''           '04/09/2011'  NA          NA
4     '22/08/2011'   -       ''     ''        NA          NA          NA",
 header = TRUE, stringsAsFactors = FALSE, na = c("", "NA"))

dat$rowid <- 1:nrow(dat)
names(dat) <- gsub("([a-z])([0-9])", "\\1_\\2", names(dat))
dat
##   id      start event_1 event_2 event_3     date_1     date_2     date_3 rowid
## 1  1 06/07/2011       A       B       C 06/07/2011 06/07/2011 06/07/2011     1
## 2  1 06/07/2011    <NA>    <NA>    <NA>       <NA>       <NA>       <NA>     2
## 3  1 06/07/2011       -    <NA>    <NA>       <NA>       <NA>       <NA>     3
## 4  2 15/07/2011       D       E       A 18/07/2011 18/07/2011 16/07/2011     4
## 5  3 15/07/2011       D       C       H 19/07/2011 19/07/2011 14/08/2011     5
## 6  4 22/08/2011       G    <NA>    <NA> 04/09/2011       <NA>       <NA>     6
## 7  4 22/08/2011       -    <NA>    <NA>       <NA>       <NA>       <NA>     7

From here the process can proceed as usual, following the steps listed above:

library(tidyr)

dat <- gather(dat,
              key = "var",
              value = "value",
              -id, -rowid, -start)
dat <- separate(dat, var, into = c("var", "which"), sep = "_")
dat <- spread(dat, key = var, value = value)

Some final cleanup and we're done:

dat <- na.omit(dat)[ , setdiff(names(dat), c("rowid", "which"))]
dat
##    id      start       date event
## 1   1 06/07/2011 06/07/2011     A
## 2   1 06/07/2011 06/07/2011     B
## 3   1 06/07/2011 06/07/2011     C
## 10  2 15/07/2011 18/07/2011     D
## 11  2 15/07/2011 18/07/2011     E
## 12  2 15/07/2011 16/07/2011     A
## 13  3 15/07/2011 19/07/2011     D
## 14  3 15/07/2011 19/07/2011     C
## 15  3 15/07/2011 14/08/2011     H
## 16  4 22/08/2011 04/09/2011     G
Ista
  • 10,139
  • 2
  • 37
  • 38
0

The melt() function from data.table is able to reshape multiple measure columns at once. So, the exercise becomes a "three-liner":

library(data.table)
melt(setDT(DF), id.vars = c("id", "start"), na.rm = TRUE,
     measure.vars = patterns(event_date = "^date", event = "^event"))[order(id)]
    id      start variable event_date event
 1:  1 06/07/2011        1 06/07/2011     A
 2:  1 06/07/2011        2 06/07/2011     B
 3:  1 06/07/2011        3 06/07/2011     C
 4:  2 15/07/2011        1 18/07/2011     D
 5:  2 15/07/2011        2 18/07/2011     E
 6:  2 15/07/2011        3 16/07/2011     A
 7:  3 15/07/2011        1 19/07/2011     D
 8:  3 15/07/2011        2 19/07/2011     C
 9:  3 15/07/2011        3 14/08/2011     H
10:  4 22/08/2011        1 04/09/2011     G

To meet OP's expected result, we can remove the unwanted columns:

melt(setDT(DF), id.vars = c("id", "start"), na.rm = TRUE,
     measure.vars = patterns(event_date = "^date", event = "^event"))[
       order(id)][, c("id", "variable") := NULL][]
         start event_date event
 1: 06/07/2011 06/07/2011     A
 2: 06/07/2011 06/07/2011     B
 3: 06/07/2011 06/07/2011     C
 4: 15/07/2011 18/07/2011     D
 5: 15/07/2011 18/07/2011     E
 6: 15/07/2011 16/07/2011     A
 7: 15/07/2011 19/07/2011     D
 8: 15/07/2011 19/07/2011     C
 9: 15/07/2011 14/08/2011     H
10: 22/08/2011 04/09/2011     G

Data

DF <- readr::read_table(
"id     start      event1    event2  event2  date1       date2       date2
 1     06/07/2011   A       B       C       06/07/2011  06/07/2011  06/07/2011           
 1     06/07/2011                           NA          NA          NA
 1     06/07/2011   -                       NA          NA          NA
 2     15/07/2011   D       E       A       18/07/2011  18/07/2011  16/07/2011
 3     15/07/2011   D       C       H       19/07/2011  19/07/2011  14/08/2011
 4     22/08/2011   G                       04/09/2011  NA          NA
 4     22/08/2011   -                       NA          NA          NA"
)
Community
  • 1
  • 1
Uwe
  • 41,420
  • 11
  • 90
  • 134