I think I understand what you are asking for, though admittedly I am not familiar with the function that reads data in by chunks and uses stringsAsFactors = TRUE
while making assumptions a priori on the makeup of the data (and does not offer a way to superimpose other characteristics of the factors). I offer in advance the suggestion that either you are misinterpreting the function or you are mis-applying it to your specific data problem.
I'm easily wrong in problems like this, so I'll try to address the inferred problem regardless.
You claim that the function will be reading in the first 8 elements, on which it do its processing. It must know that there are (in this case) four factors to be considered; the easiest way, as you are asking, is to have each of these factors present in each chunk. Once it has processed these first 8 rows, it will then read the second 8 elements. In the case of your sample data, this does not work since the second 8 elements does not include a 5.
I'll define slightly augmented data later to remedy this.
Assumptions / Rules
the number of unique values overall in the data must be no larger than the size of each chunk;
each factor must have at least as many occurrences as the number of chunks to be read; and
all chunks have precisely chunksize
elements in them (i.e., full) except for the last chunk will have between 1 and chunksize
elements in it; ergo,
the last chunk has at least as many elements as there are unique values.
Function Definition
Given those rules, here's some code. This is most certainly not the only solution, and it may not perform well with significantly large datasets (I have not done extensive testing).
myfunc <- function(x, chunksize = 8) {
numChunks <- ceiling(length(x) / chunksize)
uniqx <- unique(x)
lastChunkSize <- chunksize * (1 - numChunks) + length(x)
## check to see if it is mathematically possible
if (length(uniqx) > chunksize)
stop('more factors than can fit in one chunk')
if (any(table(x) < numChunks))
stop('not enough of at least one factor to cover all chunks')
if (lastChunkSize < length(uniqx))
stop('last chunk will not have all factors')
## actually arrange things in one feasible permutation
allIndices <- sapply(uniqx, function(z) which(z == x))
## fill one of each unique x into chunks
chunks <- lapply(1:numChunks, function(i) sapply(allIndices, `[`, i))
remainder <- unlist(sapply(allIndices, tail, n = -3))
remainderCut <- split(remainder, ceiling(seq_along(remainder)/4))
## combine them all together, wary of empty lists
finalIndices <- sapply(1:numChunks,
function(i) {
if (i <= length(remainderCut))
c(chunks[[i]], remainderCut[[i]])
else
chunks[[i]]
})
x[unlist(finalIndices)]
}
Supporting Execution
In your offered data, you have 18 elements requiring three chunks. Your data will fail on two accounts: three of the elements only occur twice, so the third chunk will most certainly not contain all elements; and your last chunk will only have two elements, which cannot contain each of the four.
I'll augment your data to satisfy both misses, with:
dat3 <- c(4,7,5,7,8,4,4,4,4,4,4,7,4,4,8,8,5,5,5,5)
which will not work unadjusted, if for no other reason than the last chunk will only have four 5's in it.
The solution:
myfunc(dat3, chunksize = 8)
## [1] 4 7 5 8 4 4 4 4 4 7 5 8 4 4 5 5 4 7 5 8
(spaces were added to the output for easy inspection). Each chunk has 4, 7, 5, 8
as its first four elements, therefore all factors are covered in each chunk.
Breakdown
A quick walkthrough (using debug(myfunc)
), assuming x = dat3
and chunksize = 8
. Jumping down the code:
## Browse[2]> uniqx
## [1] 4 7 5 8
## Browse[2]> allIndices
## [[1]]
## [1] 1 6 7 8 9 10 11 13 14
## [[2]]
## [1] 2 4 12
## [[3]]
## [1] 3 17 18 19 20
## [[4]]
## [1] 5 15 16
This shows the indices for each unique element. For example, there are 4's located at indices 1, 6, 7, etc.
## Browse[2]> chunks
## [[1]]
## [1] 1 2 3 5
## [[2]]
## [1] 6 4 17 15
## [[3]]
## [1] 7 12 18 16
There are three chunks to be filled, and this list starts forming those chunks. In this example, we have placed indices 1, 2, 3, and 5 in the first chunk. Looking back at allIndices
, you'll see that these represent the first instance of each of uniqx
, so the first chunk now contains c(4, 7, 5, 8)
, as do the other two chunks.
At this point, we have satisfied the basic requirement that each unique element be found in every chunk. The rest of the code fills with the remaining elements.
## Browse[2]> remainder
## [1] 8 9 10 11 13 14 19 20
These are all indices that have so far not been added to the chunks.
## Browse[2]> remainderCut
## $`1`
## [1] 8 9 10 11
## $`2`
## [1] 13 14 19 20
Though we have three chunks, we only have two lists here. This is fine, we have nothing (and need nothing) to add to the last chunk. We will then zip-merge these with chunks
to form a list of index lists. (Note: you might be tempted to try mapply(function(a, b) c(a, b), chunks, remainderCut)
, but you may notice that if remainderCut
is not the same size as chunks
, as we see here, then its values are recycled. Not acceptable. Try it.)
## Browse[2]> finalIndices
## [[1]]
## [1] 1 2 3 5 8 9 10 11
## [[2]]
## [1] 6 4 17 15 13 14 19 20
## [[3]]
## [1] 7 12 18 16
Remember, each number represents the index from within x
(originally dat3
). We then unlist
this split-vector and apply the indices to the data.