Binning values in R with multiple files

Question

So I've got a slight problem with binning values contained in multiple text files into set ranges. I've had a look online for various packages and came across sm which can bin values and you can specify the bin range as well like so:-

xb <- binning(x, breaks=seq(-4,4,by=0.5))

but I have a few problems with implementing this. I can't specify an upper limit as I don't know what the upper limit is (the files have thousands of values obtained from an instrument), I have 25 files each of which contains thousands of values that I need to bin all at the same time (I need to then subsequently take the median of all those binned values) and I'm having troubles with just reading in the text files to begin with. I perform:-

read.table("foobar.txt", sep=",")

to read just one text file as all the values are separated out by a ',' but it doesn't seem to be able to handle this. Also I want to be able to bin the values with a given range of 0.0005 (so for example values are binned between 200.0000 - 200.0005, 200.0005 - 200.0010 and so on)

The text files I'm using are .txt with values that are separated out by commas and contain thousands of values. The idea is that I bin the values into groups within a set range and then take the median of those values that represents that particular bin. For example, if I had the values 1,1,2,3,3,4,5,5,6,7,7,9,10 and I set the range of the bin to 2, then the first bin will contain 1,1,2, second 3,3,4, third 5,5,6 etc and the median of all those values is taken to represent the bin, the first bin having a median value of 1, second 3, third 5 (I know in this example it may seem pointless to take the median value but with the data I have it makes sense)

Is there a way of reading in multiple text files of values and processing them all simultaneously in the manner in which I described? Is there a package for something like this that I can just look at the manual for? Any suggestions or tips would be grateful!

Apologies I should have mentioned that I want to bin values within a set range (for what I'm doing I want values that are within 0.0005 binned, so 200.0000 - 200.0005 etc) — user2062207, Feb 17 '15 at 16:48
I'll edit the text above to incorporate all this, but the files are all .txt files as I have too many values for .csv and it's in the form of x1, x2, x3 (so like 104, 204, 223.3432) and so on thousands of times. Basically my aim is to bin all the values into groups, and then when all values are binned within a given range I want to take the median of all those values of the original non-binned values that fall into a given bin if that makes sense. So for example, if I have 1, 2, 4, 5, 10 and I bin them with a range of 2 so 1 and 2 are together, 4 and 5 and then 10, the median is taken of them — user2062207, Feb 17 '15 at 16:54

r2evans · Accepted Answer · 2015-02-18T00:20:41.357

There are several ways to do it, I'll provide one method using base functions. (An alternative would be to use dplyr, also well suited for this. However, the base example should be simple enough.)

Generate Data

(This is here merely because we don't have any of your data.)

n <- 10
for (ii in 1:3) {
    dat <- runif(n)
    writeLines(paste(dat, collapse = ','),
               con = sprintf('user2062207-file%s.txt', ii))
}
readLines('user2062207-file1.txt')
## [1] "0.929472318384796,0.921938128070906,0.707776406314224,0.236701443558559,0.271322417538613,0.388766387710348,0.422867075540125,0.324589917669073,0.92406965768896,0.171326051233336"

Read the Data

This is where you'll start, assuming you have a simple pattern for finding the files.

fnames <- list.files(pattern = 'user2062207-file.*.txt')
allData <- unlist(sapply(fnames, read.table, sep = ','))
allRange <- range(allData)
df <- data.frame(x = allData)
head(df)
##           x
## 1 0.9294723
## 2 0.9219381
## 3 0.7077764
## 4 0.2367014
## 5 0.2713224
## 6 0.3887664
dim(df)
## [1] 30  1

Set the Bins

The {floor,ceiling} +/- binSize below is because the bins include only one side of the range (default: right side), so the minimum value(s) will not be binned. It also ensures the bins are on rounded boundaries.

binSize <- 0.05
allBins <- seq(floor(allRange[1] / binSize) * binSize,
               ceiling(allRange[2] / binSize) * binSize,
               by = binSize)
## bin the data
df$bin <- cut(df$x, breaks = allBins)
head(df)
##           x        bin
## 1 0.9294723 (0.9,0.95]
## 2 0.9219381 (0.9,0.95]
## 3 0.7077764 (0.7,0.75]
## 4 0.2367014 (0.2,0.25]
## 5 0.2713224 (0.25,0.3]
## 6 0.3887664 (0.35,0.4]

Statistics on Each Bin

sapply(levels(df$bin), function(lvl) median(df$x[df$bin == lvl], na.rm = TRUE))
##   (0,0.05] (0.05,0.1] (0.1,0.15] (0.15,0.2] (0.2,0.25] (0.25,0.3] (0.3,0.35] 
## 0.03802277         NA 0.11528715 0.18195392 0.22918094 0.27132242 0.33626971 
## (0.35,0.4] (0.4,0.45] (0.45,0.5] (0.5,0.55] (0.55,0.6] (0.6,0.65] (0.65,0.7] 
## 0.38009637 0.42184059         NA 0.53826028 0.57820253 0.64165116 0.67825992 
## (0.7,0.75] (0.75,0.8] (0.8,0.85] (0.85,0.9] (0.9,0.95]   (0.95,1] 
## 0.74243926         NA 0.80759621 0.88974267 0.92406966 0.95691077

This is an area where numerous other options could be advantageous. For instance, the base function by can work, though dealing with its data structure is not always intuitive even if the function call itself is easy to read:

head(by(df$x, df$bin, median, na.rm = TRUE))
## df$bin
##   (0,0.05] (0.05,0.1] (0.1,0.15] (0.15,0.2] (0.2,0.25] (0.25,0.3] 
## 0.03802277         NA 0.11528715 0.18195392 0.22918094 0.27132242

You could also use dplyr with ease. This example starts with the original allData and allBins:

library(dplyr)
data.frame(x = allData) %>%
    mutate(bin = cut(x, breaks = allBins)) %>%
    group_by(bin) %>%
    summarise(median(x))
## Source: local data frame [17 x 2]
##           bin  median(x)
## 1    (0,0.05] 0.03802277
## 2  (0.1,0.15] 0.11528715
## 3  (0.15,0.2] 0.18195392
## 4  (0.2,0.25] 0.22918094
## 5  (0.25,0.3] 0.27132242
#### ..snip..

The first example preserves empty bins whereas the other methods are not aware of empty bins. There are possibly other ways of using by and dplyr that would include these empty bins, but this seems to suffice.

EDIT

After a bit of chat, it was determined that the range of the data was too wide with a bin width of 0.0005. A better solution was devised. (No sample data to provide, sorry, not mine to give ...) I'll use random data to mimic the process:

set.seed(42)
x <- 5e7 * runif(5e5)

library(dplyr)
binSize <- 0.0005
df <- data.frame(dat = sort(x))
df$bin <- floor(df$dat / binSize) * binSize
head(df)
##         dat       bin
## 1  410.9577  410.9575
## 2  456.6275  456.6270
## 3  552.3674  552.3670
## 4  875.4898  875.4895
## 5 1018.6806 1018.6805
## 6 1102.2436 1102.2435
system.time(results <- df %>% group_by(bin) %>% summarize(med = median(dat)))
##    user  system elapsed 
##   12.08    0.00   12.11 
head(results)
## Source: local data frame [6 x 2]
##         bin       med
## 1  410.9575  410.9577
## 2  456.6270  456.6275
## 3  552.3670  552.3674
## 4  875.4895  875.4898
## 5 1018.6805 1018.6806
## 6 1102.2435 1102.2436

Awesome I'll try implementing this now and will get back to you asap but thank you so much for this! — user2062207, Feb 17 '15 at 17:40
Apologies for taking time to get back to you about this, the files I have are very large and it's still processing allData <- unlist(sapply(fnames, read.table, sep = ',')) part — user2062207, Feb 17 '15 at 18:13
I'm not surprised, files with large amounts of data on a single line can be problematic. If the files are that large, perhaps you should consider an alternative way of reading them in (in lieu of `read.table`). I don't have enough experience in any one to make a recommendation, but there are several previous questions on the topic (e.g., [1727772](http://stackoverflow.com/questions/1727772/quickly-reading-very-large-tables-as-dataframes-in-r) and [11782084](http://stackoverflow.com/questions/11782084/reading-in-large-text-files-in-r)). — r2evans, Feb 17 '15 at 18:18
Sorry I got through the loading of the data, however I'm having problems with binning the data. I require all values to be binned inbetween 0.0005 so I set the binSize to that value, however I get the following error when performing the allBins part:- Error in seq.default(floor(allRange[1]/binSize) * binSize, ceiling(allRange[2]/binSize) * : 'by' argument is much too small — user2062207, Feb 17 '15 at 22:23
What are the values of `allRange`? That should only ever happen when you get ridiculously large (more than 2,147,483,647 on windows). — r2evans, Feb 17 '15 at 22:32
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/71119/discussion-between-r2evans-and-user2062207). — r2evans, Feb 17 '15 at 22:45