Working with data.tables, I have a large function do.everything()
which includes other functions and returns a data.table
. I run it with:
DT.processed <- do.everything(datatable = DT.unprocessed)
No error output is produced, but it seems that there is a hidden error somewhere, since after running first.manipulation()
some NA's are inserted in some variables
library(data.table)
first.manipulation <- function(datatable) {
datatable <- datatable[order(Year)]
datatable <- datatable[Year %in% c(2001, 2002, 2003)] # for some reason, some NA appear in some variables!
...
return(datatable)
}
second.manipulation <- function(datatable) {
...
return(datatable)
}
do.everything <- function(datatable) {
datatable <- first.manipulation(datatable = datatable)
datatable <- second.manipulation(datatable = datatable)
return(datatable)
}
So there seems to be a problem in first.manipulation()
.
To solve it, I rewrote it so as to export the state of datatable
before/after each command and determine what causes this "NA insertion" problem:
first.manipulation <- function(datatable) {
datatable.step.1 <<- datatable
datatable <- datatable[order(Year)]
datatable.step.2 <<- datatable
datatable <- datatable[Year %in% c(2001, 2002, 2003)]
datatable.step.3 <<- datatable
return(datatable)
}
I find out that nor datatable.step.1
neither datatable.step.2
has the "NA problem", but datatable.step.3
does have the "NA problem". I could not understand why is this happening so I rewrote it as in:
first.manipulation <- function(datatable) {
datatable.step.1 <<- datatable
datatable <- datatable[order(Year)]
datatable.step.2 <<- datatable
print(colSums(is.na(datatable)))
datatable.step.2.debug.1 <<- datatable.step.2
datatable.step.2.debug.2 <<- datatable.step.2.debug.1[Year %in% c(2001, 2002, 2003)]
datatable <- datatable[Year %in% c(2001, 2002, 2003)]
datatable.step.3 <<- datatable
print(colSums(is.na(datatable)))
return(datatable)
}
The question is that even after this,
datatable.step.3
is different than datatable.step.2.debug.2
.
datatable.step.3
presents the "NA problem", while datatable.step.2.debug.2
does not present it.
Why is this happening?
Second, both print(colSums(is.na(datatable)))
, return 0
everywhere, while clearly there are NAs in the second print()
. Is this expected?
Last, is it a good practice to use the datatable = datatable
in the definition of do.everything()
?
Obviously, I am open to any other recommendations.