0

Working with data.tables, I have a large function do.everything() which includes other functions and returns a data.table. I run it with:

DT.processed <- do.everything(datatable = DT.unprocessed)

No error output is produced, but it seems that there is a hidden error somewhere, since after running first.manipulation() some NA's are inserted in some variables

library(data.table)
first.manipulation <- function(datatable) {
  datatable <- datatable[order(Year)]
  datatable <- datatable[Year %in% c(2001, 2002, 2003)] # for some reason, some NA appear in some variables!
  ...
  return(datatable)
}

second.manipulation <- function(datatable) {
  ...
  return(datatable)
}

do.everything <- function(datatable) {
   datatable <- first.manipulation(datatable = datatable)
   datatable <- second.manipulation(datatable = datatable)
   return(datatable)
}

So there seems to be a problem in first.manipulation().

To solve it, I rewrote it so as to export the state of datatable before/after each command and determine what causes this "NA insertion" problem:

first.manipulation <- function(datatable) {
  datatable.step.1 <<- datatable
  datatable <- datatable[order(Year)]
  datatable.step.2 <<- datatable
  datatable <- datatable[Year %in% c(2001, 2002, 2003)]
  datatable.step.3 <<- datatable
  return(datatable)
}

I find out that nor datatable.step.1 neither datatable.step.2 has the "NA problem", but datatable.step.3 does have the "NA problem". I could not understand why is this happening so I rewrote it as in:

first.manipulation <- function(datatable) {
  datatable.step.1 <<- datatable
  datatable <- datatable[order(Year)]
  datatable.step.2 <<- datatable
  print(colSums(is.na(datatable)))

  datatable.step.2.debug.1 <<- datatable.step.2
  datatable.step.2.debug.2 <<- datatable.step.2.debug.1[Year %in% c(2001, 2002, 2003)]

  datatable <- datatable[Year %in% c(2001, 2002, 2003)]
  datatable.step.3 <<- datatable
  print(colSums(is.na(datatable)))

  return(datatable)
}

The question is that even after this, datatable.step.3 is different than datatable.step.2.debug.2. datatable.step.3 presents the "NA problem", while datatable.step.2.debug.2 does not present it.

Why is this happening?

Second, both print(colSums(is.na(datatable))), return 0 everywhere, while clearly there are NAs in the second print(). Is this expected?

Last, is it a good practice to use the datatable = datatable in the definition of do.everything()? Obviously, I am open to any other recommendations.

Konstantinos
  • 4,096
  • 3
  • 19
  • 28
  • 1
    I don't have an answer for your problem, sorry, but if you're programming a whole bunch of functions and putting them inside another by function, you might want to test your individual functions with unit testing, and then do an integration test to know that everything works nice together. – brodrigues Jul 19 '16 at 19:16
  • It's great that you've narrowed your problem down to the `first.manipulation` - even a particular line of the function! (1) Why do you share `second.manipulation` and `do.everything`? They seem irrelevant tor your problem. (2) Why don't you share any data that reproduces the problem? The entirety of this question could be *"I have this data [`dput()` of data or simulated data right before the problematic line], when I run [problematic line] it creates missing values. Why?"* – Gregor Thomas Jul 19 '16 at 19:39
  • As it is, your code looks fine at a glance, but we can't do any testing because you haven't shared any data. [Have a look at how to make a good reproducible example in R](http://stackoverflow.com/q/5963269/903061). – Gregor Thomas Jul 19 '16 at 19:41

0 Answers0