1

Each day I run summary report to get a list of the files that were imported and used in a certain project output. Each day's log is stamped with current day's date (Log.Date) and they are read into a running file log.

I want to create a check for this log that will let me know that the same number of historical files are being read in from one day to the next. (i.e. today's imports should equal yesterday's imports, plus a few new files from today.)

Example log:

fileLog <- data.frame('Log.Date'=c('2020-08-01','2020-08-01','2020-08-02','2020-08-02','2020-08-02','2020-08-03','2020-08-03','2020-08-03','2020-08-03'),
                     'System' = c('A','B','A','B','C','A','B','C','D'),
                     'File'=c('file1','file2','file1','file2','file3', 'file1', 'file2','file3','file4')) 

#     Log.Date     System   File
# 1   2020-08-01      A     file1
# 2   2020-08-01      B     file2
# 3   2020-08-02      A     file1
# 4   2020-08-02      B     file2
# 5   2020-08-02      C     file3
# 6   2020-08-03      A     file1
# 7   2020-08-03      B     file2
# 8   2020-08-03      C     file3
# 9   2020-08-03      D     file4

I group the file log by the Log.Date and System to get a file count for each day, then pivot the data so that I can see and calculate any change from the previous day. I want to create the calculation by position: Change = 'last column' minus 'next to last column', since the column names will always be changing.

fileLog <- fileLog %>% 
  arrange(Log.Date) %>% 
  group_by(Log.Date, System) %>% 
  summarise(File.Count = length(unique(File))) %>% 
  ungroup() %>% 
  pivot_wider(names_from = Log.Date, values_from = File.Count) %>% 
  replace(is.na(.), 0)
fileLog <- as.data.frame(fileLog) %>% 
  mutate(Change = rev(fileLog)[1] - rev(fileLog)[2]) 

This does let me view the change I'm looking for, however the resulting 'Change' column is of class 'data.frame'. This causes me problems for exporting it as part of a report.

#        System  2020-08-01  2020-08-02  2020-08-03  Change.2020-08-03
# 1        A          1          1          1          0
# 2        B          1          1          1          0
# 3        C          0          1          1          0
# 4        D          0          0          1          1

> class(fileLog$Change)
[1] "data.frame"

Using the absolute column names works fine:

mutate(Change = fileLog$'2020-08-03' - fileLog$'2020-08-02')

I've tried other iterations of calling the position, which nets the same problem

mutate(Change = fileLog[,ncol(fileLog)] - fileLog[,ncol(fileLog)-1])

I've also tried coercing the column to numeric, gives the error: 'list' object cannot be coerced to type 'double'

  mutate(Change = as.numeric(check_start[,-1]) - as.numeric(check_start[,-2]))
  mutate(Change = as.numeric(rev(check_start)[1]) - as.numeric(rev(check_start)[2])) 

Question: Can someone suggest the correct way to make this positional calculation to output a 'Change' column that is not of class=data.frame?

gkoo
  • 13
  • 2
  • 1
    There might also be value in considering whether this can be done in the long format. Using `lag` or `diff` or similar functions can easily deal with day-to-day changes, and you can do this within groups using `group_by` as well. – thelatemail Aug 27 '20 at 00:43
  • Also, this ancient question from the decade before last may be helpful for understanding all of the indexing differences with `[`, `[[`, `$` etc - https://stackoverflow.com/questions/1169456/the-difference-between-bracket-and-double-bracket-for-accessing-the-el – thelatemail Aug 27 '20 at 00:49
  • Thank you! I'll look into the lag and diff, as I can see where this could be useful. Otherwise, the post about [, [[, $ was SUPER helpful. – gkoo Aug 27 '20 at 13:25

1 Answers1

1

Few changes to your previous step :

  1. Instead of length(unique(File)), used n_distinct

  2. Instead of adding another replace step after pivot_wider use values_fill.

fileLog <- fileLog %>% 
  arrange(Log.Date) %>% 
  group_by(Log.Date, System) %>% 
  summarise(File.Count = n_distinct(File)) %>% 
  ungroup() %>% 
  pivot_wider(names_from = Log.Date, values_from = File.Count, values_fill = 0) 

Now to subtract value from last and second last column you can use ncol which will give you column number of last column.

fileLog <- fileLog %>% mutate(Change = .[[ncol(.)]] - .[[ncol(.) - 1]]) 
fileLog
# A tibble: 4 x 5
  System `2020-08-01` `2020-08-02` `2020-08-03` Change
  <chr>         <int>        <int>        <int>  <int>
1 A                 1            1            1      0
2 B                 1            1            1      0
3 C                 0            1            1      0
4 D                 0            0            1      1
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213