0

I am having some difficulties on summarizing data from my database in R. I am looking to pull the data and have it summarized by Quarter.

Below is the code i am using to get a txt output but I am getting errors.

What do I need to do to manipulate the code to run this so that I can have the data be summarized by quarter?

library(data.table, warn.conflicts = FALSE)
library(lubridate, warn.conflicts = FALSE)

################
## PARAMETERS ##
################

# Set path of major source folder for raw transaction data
in_directory <- "C:/Users/name/Documents/Raw Data/"

# List names of sub-folders (currently grouped by first two characters of 
CUST_ID)
in_subfolders <- list("AA-CA", "CB-HZ", "IA-IL", "IM-KZ", "LA-MI", "MJ-MS",
                  "MT-NV", "NW-OH", "OI-PZ", "QA-TN", "TO-UZ",
                  "VA-WA", "WB-ZZ")

# Set location for output
out_directory <- "C:/Users/name/Documents/YTD Master/"
out_filename <- "NEW.csv"

# Set beginning and end of date range to be collected - year-month-day format
date_range <- interval(as.Date("2018-01-01"), as.Date("2018-05-31"))

# Enable or disable filtering of raw files to only grab items bought within 
certain months to save space.
# If false, all files will be scanned for unique items, which will take 
longer and be a larger file.
date_filter <- TRUE


##########
## CODE ##
##########

starttime <- Sys.time()
mastertable <- NULL

for (j in 1:length(in_subfolders)) {
  subfolder <- in_subfolders[j]
  sub_directory <- paste0(in_directory, subfolder, "/")

  ## IMPORT DATA
  in_filenames <- dir(sub_directory, pattern =".txt")

  for (i in 1:length(in_filenames)) {

    # Default value provided for when fast filtering is disabled.
    read_this_file <- TRUE

    # To fast filter the data, we choose to include or exclude an entire file 
based on the date of its first line.
    # WARNING: This is only a valid method if filtering by entire months, 
since that is the amount of data housed in each file.
    if (date_filter) {
      temptable <- fread(paste0(sub_directory, in_filenames[i]), 
colClasses=c(CUSTOMER_TIER = "character"),
                     na.strings = "", nrows = 1)
      temptable[, INVOICE_DT := as.Date(INVOICE_DT)]

      # If date matches, set read flag to TRUE.  If date does not match, set 
read flag to FALSE.
  read_this_file <- temptable[, INVOICE_DT] %within% date_range
}


if (read_this_file) {
  print(Sys.time()-starttime)
  print(paste0("Reading in ", in_filenames[i]))
  temptable <- fread(paste0(sub_directory, in_filenames[i]), colClasses=c(CUSTOMER_TIER = "character"),
                     na.strings = "")
  temptable <- temptable[, lapply(.SD, sum), by = quarter(INVOICE_DT),
                         .SDcols = c("INV_ITEM_ID","Ext Sale", "Ext Total Cost", "CE100", "CE110","CE120","QTY_SOLD","PACKSLIP_WHSL")]

  # Combine into full list
  mastertable <- rbindlist(list(mastertable, temptable), use.names = TRUE)
  # Release unneeded memory
  rm(temptable)

}

 }

}

# Save Final table
print("Saving master table")
fwrite(mastertable, paste0(out_directory, out_filename))
rm(mastertable)

print(Sys.time()-starttime)

After running this scrip the below is the error message i receive.

Error in gsum(INV_ITEM_ID) : Type 'character' not supported by GForce sum (gsum). Either add the prefix base::sum(.) or turn off GForce optimization using options(datatable.optimize=1)

J fast
  • 53
  • 8
  • 3
    Please copy paste your error message and your code in your post... – Powkachu Jun 22 '18 at 15:06
  • the error message and code are linked. Just click on Error Message and that will show you the error message. Click on "Code I am working with" to see that as well. – J fast Jun 22 '18 at 15:15
  • I can also post more of the code if you will need to see the entire format. – J fast Jun 22 '18 at 15:16
  • 1
    being able to copy your data and paste it into R is very helpful when trying to create a solution. you can use dput() to create a version of your data that are easily used here. – Puddlebunk Jun 22 '18 at 15:20
  • what exactly do you mean. How would I use that? – J fast Jun 22 '18 at 15:24
  • 1
    When asking for help, you should include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. Pictures of data and code are not particularly helpful. But it sounds like your data is probably not being read in correctly (some "numeric" columns may be read in as "character" values). – MrFlick Jun 22 '18 at 15:26
  • I get what you are saying now. i need to use r to paste my data so you can see how it is being read in correct? – J fast Jun 22 '18 at 15:33
  • **Never** post a screen shot of your code nor data in place of actually copy and pasting it. No one want to retype your code into their session just to get started. – acylam Jun 22 '18 at 15:33
  • See new links Page 1 and 2 of my code. What will I need to do to use the dput() to read out this data? – J fast Jun 22 '18 at 15:34
  • 2
    Just copy and paste the code and data into your question. Why would you want to take a screen shot of it anyway? It takes like 2 seconds to ctrl + c > ctrl + v. For your data, just type `dput(your_data)` in your R console, and **copy and paste** the output of that into your question. – acylam Jun 22 '18 at 15:38
  • My bad on the screenshots. I have it now posted into the question. i am looking to run the ouput to summarize by quarter – J fast Jun 22 '18 at 15:46
  • Does my question make sense? – J fast Jun 22 '18 at 16:27
  • Now you have copied the code but we still can't do anything with it as it pulls from a directory that we do not have access too. In order to get help you need to supply the group with both the code you are using as well as data that they can run it on. Without both of these things the best you can hope for is the general type framework that i lied out below. You should also consider cutting out just the part of the code that isn't working and the data that would be needed at that point. – Puddlebunk Jun 25 '18 at 15:22
  • Should i not be using gsum if I am just looking to summarize by date/quarter? Some of the data type I want includedis character type. Will I need to change this? – J fast Jun 26 '18 at 14:29

1 Answers1

1

Here is the general approach with some generic data.

library(tidyverse)
library(lubridate)
data.frame(date = seq(as.Date('2010-01-12'), as.Date('2018-02-03'), by = 100),
                 var = runif(30)) %>%
  group_by(quarter(date, with_year = T)) %>%
  summarize(average_var = mean(var))

you can leave out the "with_year = T" if you don't care about the differences between years.

Puddlebunk
  • 493
  • 3
  • 10