0

I am new to stackoverflow, but not new to googling the heck out of my R script issues. This one has me stumped.

I am trying to find out why my current script doesn't work:

SCRIPT: My example is a tibble (called "Squishframe") of 13 columns (11 of them to be treated as characters and unique variables; 2 of them to be numeric). I group by the character columns and sum all the other numeric columns using the summarize (and Sum) functions. This new tibble is called "Groupframe".

The main issue appears to be that:

  • when there is a row with unique variables AND numeric data,
  • shared with another row with the same unique variables AND BLANK numeric data,
  • the row with numeric data is ignored!

    #Create pipeline that first turns the Squishframe number columns into numeric type, and then groups the data by the Text-based columns
    Groupedframe <- Squishframe %>% 
      mutate_at(vars(AF, Billable), funs(gsub("\\$", "", .))) %>%
      mutate_at(vars(AF, Billable), funs(as.numeric)) %>%
      group_by_at(vars(`File Name`, Source, Date, Colour, Size, MediumSource, Campaign, Version, Signed, CountryCode, State)) %>%
    
    #Summarize all non-text-based columns by groups (sum function used)
     summarize_all(funs(sum))
    

If you would like a zip of my live files as an example, I have them, but I'm unsure how to provide them right now.

It is a little different from my above, as it imports from an Raw Data folder, and deposits the final dataset into the output folder.


Input: A tibble called "Squishframe" in the script above. I am also aware some of these columns are empty. This is intentional:

structure(list(`File Name` = c("Bricks.csv", "Bricks.csv", "Bricks.csv", 
"Bricks.csv", "Bricks.csv", "Bricks.csv", "Bricks.csv", "Bricks.csv", 
"Bricks.csv", "Bricks.csv", "Bricks.csv", "Bricks.csv", "Bricks.csv"
), Source = c("Installations", "Installations", "Installations", 
"Installations", "Installations", "Installations", "Installations", 
"Installations", "Installations", "Installations", "Installations", 
"Installations", "Installations"), Date = structure(c(17934, 
17934, 17934, 17934, 17945, 17953, 17956, 17957, 17945, 17945, 
17945, 17945, 17945), class = "Date"), Colour = c(NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Size = c(NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), MediumSource = c("Baliest", 
"Baliest", "Baliest", "Baliest", "Baliest", "Baliest", "Baliest", 
"Baliest", "Klimt", "Klimt", "Klimt", "Klimt", "Klimt"), Campaign = c("Big Campaign Baliest", 
"Big Campaign Baliest", "Big Campaign Baliest", "Big Campaign Baliest", 
"Big Campaign Baliest", "Big Campaign Baliest", "Big Campaign Baliest", 
"Big Campaign Baliest", "Big Campaign Klimt", "Big Campaign Klimt", 
"Big Campaign Klimt", "Big Campaign Klimt", "Big Campaign Klimt"
), Version = c("Creative_V2", "Creative_V1", "Creative_V3", "Creative_V2", 
"Creative_V1", "Creative_V3", "Creative_V2", "Creative_V1", "Creative_V1", 
"Creative_V1", "Creative_V1", "Creative_V1", "Creative_V1"), 
    Signed = c("Tigle iOS", "Tigle iOS", "Tigle iOS", "Tigle iOS", 
    "Tigle iOS", "Tigle iOS", "Tigle iOS", "Tigle iOS", "200289", 
    "200289", "200289", "200289", "200289"), CountryCode = c(NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), State = c(NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), AF = c(NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Billable = c(NA, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, NA, NA, NA, NA)), row.names = c(NA, 
-13L), class = c("tbl_df", "tbl", "data.frame"))

EXPECTED RESULT:

My script tries to compress data. It should consolidate all numeric data into rows defined by ONLY unique variables. In other words, if a row has unique variables, that row should store all the numeric data, instead of its original state where there are multiple rows with the same variables. I should get the same numeric column totals, just with less rows.

I should get a resulting dataframe with a sum of 8 in the "Billable" column. And no duplicates.

File Name   Source  Date    Colour  Size    MediumSource    Campaign    Version Signed  CountryCode State   AF  Billable
Bricks.csv  Installations   2/7/2019            Baliest Big Campaign Baliest    Creative_V1 Tigle iOS               1
Bricks.csv  Installations   2/7/2019            Baliest Big Campaign Baliest    Creative_V2 Tigle iOS               1
Bricks.csv  Installations   2/7/2019            Baliest Big Campaign Baliest    Creative_V3 Tigle iOS               1
Bricks.csv  Installations   2/18/2019           Baliest Big Campaign Baliest    Creative_V1 Tigle iOS               1
Bricks.csv  Installations   2/18/2019           Klimt   Big Campaign Klimt  Creative_V1 200289              1
Bricks.csv  Installations   2/26/2019           Baliest Big Campaign Baliest    Creative_V3 Tigle iOS               1
Bricks.csv  Installations   3/1/2019            Baliest Big Campaign Baliest    Creative_V2 Tigle iOS               1
Bricks.csv  Installations   3/2/2019            Baliest Big Campaign Baliest    Creative_V1 Tigle iOS               1

CURRENT RESULT: For the most part, it is successful. But I have a case where the script that happily sums all other rows, ignores some. This means that some numeric data is overlooked completely, despite sharing variables with another row. This results in data loss. I get less rows, but the column totals are less. Blanks appear in the "Billable" column where there should be data for those unique combination of variables. I am not trying to remove these blanks, these blanks should be 1s instead:

File Name   Source  Date    Colour  Size    MediumSource    Campaign    Version Signed  CountryCode State   AF  Billable
Bricks.csv  Installations   2/7/2019            Baliest Big Campaign Baliest    Creative_V1 Tigle iOS               1
Bricks.csv  Installations   2/7/2019            Baliest Big Campaign Baliest    Creative_V2 Tigle iOS               
Bricks.csv  Installations   2/7/2019            Baliest Big Campaign Baliest    Creative_V3 Tigle iOS               1
Bricks.csv  Installations   2/18/2019           Baliest Big Campaign Baliest    Creative_V1 Tigle iOS               1
Bricks.csv  Installations   2/18/2019           Klimt   Big Campaign Klimt  Creative_V1 200289              
Bricks.csv  Installations   2/26/2019           Baliest Big Campaign Baliest    Creative_V3 Tigle iOS               1
Bricks.csv  Installations   3/1/2019            Baliest Big Campaign Baliest    Creative_V2 Tigle iOS               1
Bricks.csv  Installations   3/2/2019            Baliest Big Campaign Baliest    Creative_V1 Tigle iOS               1

To me, it seems that summing blank+1 in these cases = blank. It should be 1. The current billable column sums to 6 instead of 8 as it should be as afmorementioned.


Thank you all for your time! I am grateful on any feedback on my script and my methods of writing this post. If it can be bettered in any way, I am open to it!

micoo
  • 11
  • 4
  • 3
    Welcome to SO. Thank you for trying to create an example, however, this is not easily reproducible. Perhaps create some small dataset with: `df <- data.frame(...)` with the bare minimum to reproduce your starting point and expected results. See [How to make a great R reproducible example](https://stackoverflow.com/q/5963269/2572423) for more pointers. – JasonAizkalns Mar 21 '19 at 13:56
  • Hi JasonAizkalns! Thanks for the greeting. I have taken note on the reproducibility element, and have edited my input data to be useable in this fashion. I did not see the point of adjusting the expected outcome or current outcome which are aesthetically understandable and easier compared in their current form. Hope that's not too stubborn an idea. :) thanks – micoo Mar 22 '19 at 01:37
  • Thank you for using `dput`, but this is still very hard to discern what you are going after. That being said, the following produces your desired results from the example (assuming your data is stored in `df <- data.frame()`)... `library(dplyr); distinct(df) %>% filter(! is.na(Billable))` – JasonAizkalns Mar 22 '19 at 13:31
  • Thanks for answering, but even though that will produce the expected result for the limited set I've provided, I would like to understand why my current code (at the end of my post) isn't doing the job. I'd like to focus on that. I am not simply trying to remove NAs. I am intending to group them if distinct (some rows with 2 or 3 or more! IF the non-numeric variables are the same across rows). Why doesn't grouping and summarizing with Sum working to this effect? I have thousands of rows that it does work for but not with these!. I've edited my post to delineate my Expected and Current results. – micoo Mar 22 '19 at 14:25
  • I've also edited it for clarity. Brought my script up to the top, and included horizontal breaks. – micoo Mar 22 '19 at 14:32
  • What if you just add `na.rm = TRUE` to your summarize_all -- `summarize_all(~ sum(., na.rm = TRUE))` – JasonAizkalns Mar 22 '19 at 14:53
  • I think I placed too much emphasis on the inclusion of NAs. NAs should be removed, in which case we are back to my original problem. Where rows are dropped completely. And the sum of the Billable is 6 and not 8 as it should be. I've edited my original post to include this detail with greater focus. Also, I hope it's clear how thankful I am for your help thus far!!! Id like to move this to chat, but my reputation is too low :( – micoo Mar 22 '19 at 15:11

1 Answers1

0

The sum of Billable is 8 with either of these implementations. Since you have edited the question several times, you may want to consider deleting and asking a new one.

library(tidyverse)

res1 <- df %>% 
  mutate_at(vars(AF, Billable), ~ gsub("\\$", "", .)) %>% 
  mutate_at(vars(AF, Billable), ~ as.numeric(.)) %>%
  group_by_at(vars(`File Name`:State)) %>%
  summarize_all(~ sum(., na.rm = TRUE))
res1
#> # A tibble: 8 x 13
#> # Groups:   File Name, Source, Date, Colour, Size, MediumSource, Campaign,
#> #   Version, Signed, CountryCode [8]
#>   `File Name` Source Date       Colour Size  MediumSource Campaign Version
#>   <chr>       <chr>  <date>     <lgl>  <lgl> <chr>        <chr>    <chr>  
#> 1 Bricks.csv  Insta~ 2019-02-07 NA     NA    Baliest      Big Cam~ Creati~
#> 2 Bricks.csv  Insta~ 2019-02-07 NA     NA    Baliest      Big Cam~ Creati~
#> 3 Bricks.csv  Insta~ 2019-02-07 NA     NA    Baliest      Big Cam~ Creati~
#> 4 Bricks.csv  Insta~ 2019-02-18 NA     NA    Baliest      Big Cam~ Creati~
#> 5 Bricks.csv  Insta~ 2019-02-18 NA     NA    Klimt        Big Cam~ Creati~
#> 6 Bricks.csv  Insta~ 2019-02-26 NA     NA    Baliest      Big Cam~ Creati~
#> 7 Bricks.csv  Insta~ 2019-03-01 NA     NA    Baliest      Big Cam~ Creati~
#> 8 Bricks.csv  Insta~ 2019-03-02 NA     NA    Baliest      Big Cam~ Creati~
#> # ... with 5 more variables: Signed <chr>, CountryCode <lgl>, State <lgl>,
#> #   AF <dbl>, Billable <dbl>
sum(res1$Billable) == 8
#> [1] TRUE

res2 <- df %>%
  distinct() %>%
  filter(! is.na(Billable))
res2
#> # A tibble: 8 x 13
#>   `File Name` Source Date       Colour Size  MediumSource Campaign Version
#>   <chr>       <chr>  <date>     <lgl>  <lgl> <chr>        <chr>    <chr>  
#> 1 Bricks.csv  Insta~ 2019-02-07 NA     NA    Baliest      Big Cam~ Creati~
#> 2 Bricks.csv  Insta~ 2019-02-07 NA     NA    Baliest      Big Cam~ Creati~
#> 3 Bricks.csv  Insta~ 2019-02-07 NA     NA    Baliest      Big Cam~ Creati~
#> 4 Bricks.csv  Insta~ 2019-02-18 NA     NA    Baliest      Big Cam~ Creati~
#> 5 Bricks.csv  Insta~ 2019-02-26 NA     NA    Baliest      Big Cam~ Creati~
#> 6 Bricks.csv  Insta~ 2019-03-01 NA     NA    Baliest      Big Cam~ Creati~
#> 7 Bricks.csv  Insta~ 2019-03-02 NA     NA    Baliest      Big Cam~ Creati~
#> 8 Bricks.csv  Insta~ 2019-02-18 NA     NA    Klimt        Big Cam~ Creati~
#> # ... with 5 more variables: Signed <chr>, CountryCode <lgl>, State <lgl>,
#> #   AF <lgl>, Billable <int>
sum(res2$Billable) == 8
#> [1] TRUE
JasonAizkalns
  • 20,243
  • 8
  • 57
  • 116
  • Hi Jason, Thanks so much! Implementation 1 really does the trick. Implementation 2 is certainly a good solution for my ask which could have been clearer. – micoo Mar 25 '19 at 13:55
  • I understand most of the changes you made. I know my understanding of formulae might be limited, but could you help explain what the inclusion of tildes (~) do to fix the code? – micoo Mar 25 '19 at 14:06
  • @micoo The tilde shouldn't really change your implementation. It's just a shortcut for an anonymous function and the fact that `funs(.)` is soft-deprecated, so it's better to get in the habit of using `~ fx()` rather than `funs(.)` because eventually `funs(.)` will go away. Specifically, `~ as.numeric(.)` should be equivalent to `funs(as.numeric(.))`. – JasonAizkalns Mar 25 '19 at 14:09
  • noted. Very thankful for the thoughtful and thorough solution provided. – micoo Mar 27 '19 at 02:38