-4

Basically, my data is grouped by days with an inconsistent number of rows in between:

16-Oct-16
Name1
Name2
Name3
17-Oct-16
Name1
Name2
Name3
Name4
Name5
19-Oct-16

etc.

I need to be able to grab the group data and apply it to the child records. The expected result should look as follows:

Name1   16-Oct-16
Name2   16-Oct-16
Name3   16-Oct-16
Name1   17-Oct-16
Name2   17-Oct-16
Name3   17-Oct-16
Name4   17-Oct-16
Name5   17-Oct-16

I'm using data.table but currently I can't think of any way other than a loop.

The following script generates the kind of dataset I'm looking at:

data.table(c('October 16, 2016', paste0('Name',1:4),
             'October 17, 2016', paste0('Name',1:12),
             'October 20, 2016', paste0('Name',1:2),
             'October 25, 2016', paste0('Name',1:6)))

I just want to copy the appropriate date field to each name row and end up with a tidy dataset where each row has name and date.

Uwe
  • 41,420
  • 11
  • 90
  • 134
Trent Baur
  • 95
  • 1
  • 9
  • How do you get this data at first in real worlds ? This sounds like àwk`could be the perfect tool to arrange your dataset before loading it into R – Tensibai Oct 31 '16 at 14:40
  • Could you please be more precise in your example data and what your expected outcome is? For guielines on how to improve your question, see [ask] and [how to give a reproducible example](http://stackoverflow.com/questions/5963269) – Jaap Oct 31 '16 at 15:17
  • a [reproducible](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) example would help. – Henk Oct 31 '16 at 15:18
  • The sample data you are showing and the script to generate them should be consistent. Now, they are using two different date formats. – Uwe Oct 31 '16 at 16:10

2 Answers2

2

There is a data.table solution I have used in similar situations. (I've tested with data.table version 1.9.7. but it should work also with the CRAN version 1.9.6)

Read Data

library(data.table)

dt <- fread("16-Oct-16
            Name1
            Name2
            Name3
            17-Oct-16
            Name1
            Name2
            Name3
            Name4
            Name5
            19-Oct-16",
            header = FALSE)
print(dt)
           V1
 1: 16-Oct-16
 2:     Name1
 3:     Name2
 4:     Name3
 5: 17-Oct-16
 6:     Name1
 7:     Name2
 8:     Name3
 9:     Name4
10:     Name5
11: 19-Oct-16

Last observation carried forward

The group headers are dates. So I used the dmy function of the lubridate package to coerce them to date format where possible. For lines which aren't formatted as dates dmy creates NA. Using na.locf from package zoo the last non-NA observation is carried forward. Both operations can be combined in a one-liner:

dt[, day := zoo::na.locf(lubridate::dmy(V1))]
print(dt)
           V1        day
 1: 16-Oct-16 2016-10-16
 2:     Name1 2016-10-16
 3:     Name2 2016-10-16
 4:     Name3 2016-10-16
 5: 17-Oct-16 2016-10-17
 6:     Name1 2016-10-17
 7:     Name2 2016-10-17
 8:     Name3 2016-10-17
 9:     Name4 2016-10-17
10:     Name5 2016-10-17
11: 19-Oct-16 2016-10-19

(I guess there is a faster version using rolling joins instead of na.locf.)

Remove group header lines

To remove the group headers we need to keep a temporary column.

dt[, tmp := lubridate::dmy(V1)][, day := zoo::na.locf(tmp)]
print(dt)
           V1        tmp        day
 1: 16-Oct-16 2016-10-16 2016-10-16
 2:     Name1       <NA> 2016-10-16
 3:     Name2       <NA> 2016-10-16
 4:     Name3       <NA> 2016-10-16
 5: 17-Oct-16 2016-10-17 2016-10-17
 6:     Name1       <NA> 2016-10-17
 7:     Name2       <NA> 2016-10-17
 8:     Name3       <NA> 2016-10-17
 9:     Name4       <NA> 2016-10-17
10:     Name5       <NA> 2016-10-17
11: 19-Oct-16 2016-10-19 2016-10-19

dt <- dt[is.na(tmp)]
print(dt)
      V1  tmp        day
1: Name1 <NA> 2016-10-16
2: Name2 <NA> 2016-10-16
3: Name3 <NA> 2016-10-16
4: Name1 <NA> 2016-10-17
5: Name2 <NA> 2016-10-17
6: Name3 <NA> 2016-10-17
7: Name4 <NA> 2016-10-17
8: Name5 <NA> 2016-10-17

dt[, tmp := NULL]
print(dt)
      V1        day
1: Name1 2016-10-16
2: Name2 2016-10-16
3: Name3 2016-10-16
4: Name1 2016-10-17
5: Name2 2016-10-17
6: Name3 2016-10-17
7: Name4 2016-10-17
8: Name5 2016-10-17
Uwe
  • 41,420
  • 11
  • 90
  • 134
1

Another option would be to use a regex pattern. For the first example dataset:

library(data.table)
library(zoo)
dt1[grep('([0-9]{1,2})-([A-Za-z]+)-(\\d{2})', V1), V2 := V1
    ][, V2 := na.locf(V2)][V1!=V2]

which gives:

      V1        V2
1: Name1 16-Oct-16
2: Name2 16-Oct-16
3: Name3 16-Oct-16
4: Name1 17-Oct-16
5: Name2 17-Oct-16
6: Name3 17-Oct-16
7: Name4 17-Oct-16
8: Name5 17-Oct-16

For the second dataset, you can use:

dt2[grep('([A-Za-z]+ )([0-9]{1,2}[,] )(\\d{4})', V1), V2 := V1
    ][, V2 := na.locf(V2)][V1!=V2]

which gives:

        V1               V2
 1:  Name1 October 16, 2016
 2:  Name2 October 16, 2016
 3:  Name3 October 16, 2016
 4:  Name4 October 16, 2016
 5:  Name1 October 17, 2016
 6:  Name2 October 17, 2016
 7:  Name3 October 17, 2016
 8:  Name4 October 17, 2016
 9:  Name5 October 17, 2016
10:  Name6 October 17, 2016
11:  Name7 October 17, 2016
12:  Name8 October 17, 2016
13:  Name9 October 17, 2016
14: Name10 October 17, 2016
15: Name11 October 17, 2016
16: Name12 October 17, 2016
17:  Name1 October 20, 2016
18:  Name2 October 20, 2016
19:  Name1 October 25, 2016
20:  Name2 October 25, 2016
21:  Name3 October 25, 2016
22:  Name4 October 25, 2016
23:  Name5 October 25, 2016
24:  Name6 October 25, 2016

Used data:

dt1 <- fread("16-Oct-16
             Name1
             Name2
             Name3
             17-Oct-16
             Name1
             Name2
             Name3
             Name4
             Name5
             19-Oct-16", header = FALSE)

and:

dt2 <- data.table(c('October 16, 2016', paste0('Name',1:4),
                    'October 17, 2016', paste0('Name',1:12),
                    'October 20, 2016', paste0('Name',1:2),
                    'October 25, 2016', paste0('Name',1:6)))
Jaap
  • 81,064
  • 34
  • 182
  • 193