0

I try to make my R code optimization (it relies on data.table heavily) under Ubuntu.

The code below shows some data transformation I make on my data.table object, including:

  • date variable calculation
  • time variable calculation
  • filtration
  • string variable calculation

The thing I am concerned with is that from the moment of data load to the moment of finishing these calculations the RAM consumption increases quite drastically.

Bytes of RAM vs seconds of runtime:

enter image description here

At start my data.table object dat occupies around 1.2 Gb according to object.size().

At this time my data.table object dat occupies around 2.1 Gb according to object.size().

There is a kind of an overhead RAM consumption accounting for over 5.5-3-(2.1-1.2) = 1.6 Gb. It means I increased the object size by much smaller extent that RAM increased at the same time.

Question: Could you give me some guidance for how to mitigate this using data.table to make the same transforms?

## date var

dat[, Date := as.Date(
                         as.POSIXct(as.numeric(When) / 1000, origin = "1970-01-01", tz = "utc"), format = "%Y-%m-%d"
                    )
    ]

## limit report dates to minimal needed range

date_tbl <- dat[, .N, by = Date]

if(
     nrow(date_tbl) < 3 * minimum_train_sample / 7 * 5
     ) # suppose weekends are not busy at all in whole organization
{

     stop('not enough historical data to run any detection: sparse dates with data')

}


report_min_date <- min(dat[, Date])


## limit report wheres to minimal available data

where_date_actions <- dat[,
                          {
                               min_date = min(Date)
                               max_date = max(Date)
                               unique_dates = length(unique(Date))

                               list(
                                    min_date = min_date
                                    , max_date = max_date
                                    , unique_dates = unique_dates
                               )
                          }
                          , by = Where
                          ]

dat <- dat[Where %in% where_date_actions[min_date <= (last_reported_date - last_days_predict - minimum_train_sample)
                                         & unique_dates >= minimum_train_sample / 7 * 5
                                         , Where]
           , ]

if(nrow(dat) == 0)
{

     stop('not enough data by Where to run any detection: none of the monitored subsystems accumulated enough length of train data')

}


## time vars

dat[, datetime_when := as.POSIXct(
                                        as.numeric(When) / 1000
                                        , origin = "1970-01-01"
                                        , tz = "utc"
                                   )
    ]

dat[, Hour:= format(
     as.POSIXct(as.numeric(When) / 1000, origin = "1970-01-01", tz = "utc")
     , "%Y-%m-%d %H"
                    )
    ]

gc()


## convert strings and names

var_names <- 'Who'

dat <- dat[, (var_names):= lapply(.SD[, var_names, with = F], function(x) tolower(as.character(x)))]

gc()


## replace blanks

dat[, ObjectPath:= ifelse(is.na(ObjectPath), 'n/a', ObjectPath)] # get rid of NAs

dat[, ObjectPath:= ifelse(ObjectPath == '', 'n/a', ObjectPath)] # get rid of ''


dat[, Workstation:= ifelse(is.na(Workstation), 'n/a', Workstation)] # get rid of NAs

dat[, Workstation:= ifelse(Workstation == '', 'n/a', Workstation)] # get rid of ''


dat[, Who:= ifelse(Who == '', 'n/a', Who)] # get rid of '' in Who


gc()


## time between successive events

setorder(dat, Who, datetime_when)

dat[
     , next_time_diff_secs := as.numeric(
          datetime_when - shift(datetime_when, 1)
          , units = 'secs'
                                       )
    , by = Who
    ]

dat[, diff_object_action_after:= What_Action]

dat[, diff_object_action_before:= shift(What_Action, n = 1)]


## cumulative count

setorder(dat, Who, datetime_when)

dat[, counter := 1:.N, by = Who]
Alexey Burnakov
  • 259
  • 2
  • 14
  • 1
    When asking for help, you should include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Jul 12 '18 at 16:12
  • @MrFlick Is it OK to prepare a reproducible example with several million rows to reproduce my scale of task? – Alexey Burnakov Jul 12 '18 at 16:22
  • 1
    It would probably be better if you use commands to simulate data of a given size rather then requiring the download of a large set of data. With performance type questions, a lot of stuff is very specific to the data itself. Without being able to actually run and profile the code, it's not easy to know exactly where the bottlenecks are. Have you tried [profiling the code yourself](http://adv-r.had.co.nz/Profiling.html)? – MrFlick Jul 12 '18 at 16:25
  • @MrFlick, yes, I can simulate a big table. I do profiling in terms of memory, not in terms of runtime, so a usual profiler routine is not priority. I am puzzled by RAM behaviour under Linux. May I just ask a specific question, without going into too much detail about my case? Does it help to initialize new fields in a data.table object before actually filling them (in terms of burden on RAM)? For example, I do dat[, Date := as.Date('2000-01-01')], and then I do the actual calculation for that field. – Alexey Burnakov Jul 12 '18 at 16:37

0 Answers0