-3

using a package for a glm which reads a dataframe in chunks. It is required that all levels of a factor occur in every chunk. I am looking for a nice strategy to rearrange observations so as to maximise the probability to have all values in every chunk.

The example would be

c(4,7,4,4,4,4,4,4,4,4,4,7,4,4,8,8,5,5)

for a size of the chunk of 8 the best rearrangement would be

c(4,7,5,8,4,4,4,4,4,4,4,7,4,4,8,8,4,5,8)

is there some elegant way to shuffel the data around?

just saw the comments..the library itself is called bigglm (where it reads data chunkwise). The vectors should be of eqal lenegth. The question is really just about re arranging the data that most level are present in most chunks

An example for the coumn of the dataframe can be found here (https://www.dropbox.com/s/cth8kwcq9ph5j0p/d1.RData?dl=0)

the most important thing in this case is that as many levels as possible are present in as many chunks as possible. The smaller the chunk, the less memory will be needed when reading in. I think it would be a good point to assume 10 chunks.

Hein
  • 175
  • 1
  • 3
  • 13
  • How do you define a chunk? – Roman Luštrik Jan 09 '15 at 22:12
  • This description isn't very clear. Could you give documentation from the package itself that describes what you need? – rsoren Jan 09 '15 at 23:22
  • Do you want to just randomly shuffle the numbers? You can do that with `sample()` – Marat Talipov Jan 09 '15 at 23:27
  • your first vector has 18 values. Your second has 19. This question really isn't well-defined. – A5C1D2H2I1M1N2O1R2T1 Jan 10 '15 at 04:02
  • With a chunk size of 8, a length of 18 leads to 3 chunks. With only two occurrences each of 5, 7, and 8, it is not possible to have each of them in all of the chunks. Though I think I understand your need here, as @Reed suggested, you need to clarify your problem. As AnandaMahto suggested, you need a better [reproducible problem](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) as well. – r2evans Jan 10 '15 at 05:18

1 Answers1

1

I think I understand what you are asking for, though admittedly I am not familiar with the function that reads data in by chunks and uses stringsAsFactors = TRUE while making assumptions a priori on the makeup of the data (and does not offer a way to superimpose other characteristics of the factors). I offer in advance the suggestion that either you are misinterpreting the function or you are mis-applying it to your specific data problem.

I'm easily wrong in problems like this, so I'll try to address the inferred problem regardless.

You claim that the function will be reading in the first 8 elements, on which it do its processing. It must know that there are (in this case) four factors to be considered; the easiest way, as you are asking, is to have each of these factors present in each chunk. Once it has processed these first 8 rows, it will then read the second 8 elements. In the case of your sample data, this does not work since the second 8 elements does not include a 5.

I'll define slightly augmented data later to remedy this.

Assumptions / Rules

  • the number of unique values overall in the data must be no larger than the size of each chunk;

  • each factor must have at least as many occurrences as the number of chunks to be read; and

  • all chunks have precisely chunksize elements in them (i.e., full) except for the last chunk will have between 1 and chunksize elements in it; ergo,

  • the last chunk has at least as many elements as there are unique values.

Function Definition

Given those rules, here's some code. This is most certainly not the only solution, and it may not perform well with significantly large datasets (I have not done extensive testing).

myfunc <- function(x, chunksize = 8) {
    numChunks <- ceiling(length(x) / chunksize)
    uniqx <- unique(x)
    lastChunkSize <- chunksize * (1 - numChunks) + length(x)
    ## check to see if it is mathematically possible
    if (length(uniqx) > chunksize)
        stop('more factors than can fit in one chunk')
    if (any(table(x) < numChunks))
        stop('not enough of at least one factor to cover all chunks')
    if (lastChunkSize < length(uniqx))
        stop('last chunk will not have all factors')
    ## actually arrange things in one feasible permutation
    allIndices <- sapply(uniqx, function(z) which(z == x))
    ## fill one of each unique x into chunks
    chunks <- lapply(1:numChunks, function(i) sapply(allIndices, `[`, i))
    remainder <- unlist(sapply(allIndices, tail, n = -3))
    remainderCut <- split(remainder, ceiling(seq_along(remainder)/4))
    ## combine them all together, wary of empty lists
    finalIndices <- sapply(1:numChunks,
           function(i) {
               if (i <= length(remainderCut))
                   c(chunks[[i]], remainderCut[[i]])
               else
                   chunks[[i]]
           })
    x[unlist(finalIndices)]
}

Supporting Execution

In your offered data, you have 18 elements requiring three chunks. Your data will fail on two accounts: three of the elements only occur twice, so the third chunk will most certainly not contain all elements; and your last chunk will only have two elements, which cannot contain each of the four.

I'll augment your data to satisfy both misses, with:

dat3 <- c(4,7,5,7,8,4,4,4,4,4,4,7,4,4,8,8,5,5,5,5)

which will not work unadjusted, if for no other reason than the last chunk will only have four 5's in it.

The solution:

myfunc(dat3, chunksize = 8)
##  [1] 4 7 5 8 4 4 4 4   4 7 5 8 4 4 5 5   4 7 5 8

(spaces were added to the output for easy inspection). Each chunk has 4, 7, 5, 8 as its first four elements, therefore all factors are covered in each chunk.

Breakdown

A quick walkthrough (using debug(myfunc)), assuming x = dat3 and chunksize = 8. Jumping down the code:

## Browse[2]> uniqx
## [1] 4 7 5 8
## Browse[2]> allIndices
## [[1]]
## [1]  1  6  7  8  9 10 11 13 14
## [[2]]
## [1]  2  4 12
## [[3]]
## [1]  3 17 18 19 20
## [[4]]
## [1]  5 15 16

This shows the indices for each unique element. For example, there are 4's located at indices 1, 6, 7, etc.

## Browse[2]> chunks
## [[1]]
## [1] 1 2 3 5
## [[2]]
## [1]  6  4 17 15
## [[3]]
## [1]  7 12 18 16

There are three chunks to be filled, and this list starts forming those chunks. In this example, we have placed indices 1, 2, 3, and 5 in the first chunk. Looking back at allIndices, you'll see that these represent the first instance of each of uniqx, so the first chunk now contains c(4, 7, 5, 8), as do the other two chunks.

At this point, we have satisfied the basic requirement that each unique element be found in every chunk. The rest of the code fills with the remaining elements.

## Browse[2]> remainder
## [1]  8  9 10 11 13 14 19 20

These are all indices that have so far not been added to the chunks.

## Browse[2]> remainderCut
## $`1`
## [1]  8  9 10 11
## $`2`
## [1] 13 14 19 20

Though we have three chunks, we only have two lists here. This is fine, we have nothing (and need nothing) to add to the last chunk. We will then zip-merge these with chunks to form a list of index lists. (Note: you might be tempted to try mapply(function(a, b) c(a, b), chunks, remainderCut), but you may notice that if remainderCut is not the same size as chunks, as we see here, then its values are recycled. Not acceptable. Try it.)

## Browse[2]> finalIndices
## [[1]]
## [1]  1  2  3  5  8  9 10 11
## [[2]]
## [1]  6  4 17 15 13 14 19 20
## [[3]]
## [1]  7 12 18 16

Remember, each number represents the index from within x (originally dat3). We then unlist this split-vector and apply the indices to the data.

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • very detailed...and helpful could you let me know whether your function would also work in cases in which not every value can be present in every chunk? I did niot make this too clear but the point is to get the optimal shuffeling with the data. So augmenting the data for every level to be present in every chunk is s not possible. It would rather be if a level is only present twice and it occurs in rows 1 and 2 (and there ar 3 chunks overall), then at least to rearrange so that it occurs in the max number fo chunks possible (two ) in this case – Hein Jan 10 '15 at 08:05
  • the input I am using is a factor column from a data frame. However, if i try to run you function with it ist just returna ll the different levels present in that frame column. – Hein Jan 10 '15 at 08:13
  • Have you tried `myfunc(mydf$column, 8)`? Look, we really cannot help you much with the dearth of information you've provided. If you address the issues in the comments, perhaps we can do something more. – r2evans Jan 10 '15 at 16:09
  • Some pointers: never start a new question to rephrase another; always include the library and function in question (I'm guessing `biglm::bigglm`), we can't help much at all; include a small piece of your data or representative data for the problem, in this case **provide** us a sample data.frame. – r2evans Jan 11 '15 at 03:36
  • With the recent insight, I think what you need to learn about is the `factor` function. When a data.frame is read and has a column of factors, you can define what all of the levels are regardless if all levels are all present. Try `df <- data.frame(aa = sample(letters, size = 5)) ; df$aa ; df$aa <- factor(df$aa, levels = letters); df$aa`. All that this requires for your application is to know all of the *possible* levels, even if you only read in a small portion at a time. K? – r2evans Jan 11 '15 at 03:40
  • the problem is not that I wanted to know which f levels are present but rather that I have the order of the specific levels in the factor changed tso that the bigglm package will not crash when reading them in chunkwise. Your approach is well thought trough. The rpoblem is just that I would need a function that works even if not every level can be in every chunk – Hein Jan 11 '15 at 03:52
  • i just updated the question with the link to the spoecific column in question. THis is a predictor in my regression model. There are 98 levels present in 50k observations and they would need to be as spread out as somehow possible. The size of chunks should be as small as possible. So you approach goes into the right direction but works only if i have the assumptions fullfilled you set out above, which si not the case in the data. Could you maybe slightly modify? – Hein Jan 11 '15 at 03:56