There are several ways to do it, I'll provide one method using base functions. (An alternative would be to use dplyr
, also well suited for this. However, the base example should be simple enough.)
Generate Data
(This is here merely because we don't have any of your data.)
n <- 10
for (ii in 1:3) {
dat <- runif(n)
writeLines(paste(dat, collapse = ','),
con = sprintf('user2062207-file%s.txt', ii))
}
readLines('user2062207-file1.txt')
## [1] "0.929472318384796,0.921938128070906,0.707776406314224,0.236701443558559,0.271322417538613,0.388766387710348,0.422867075540125,0.324589917669073,0.92406965768896,0.171326051233336"
Read the Data
This is where you'll start, assuming you have a simple pattern for finding the files.
fnames <- list.files(pattern = 'user2062207-file.*.txt')
allData <- unlist(sapply(fnames, read.table, sep = ','))
allRange <- range(allData)
df <- data.frame(x = allData)
head(df)
## x
## 1 0.9294723
## 2 0.9219381
## 3 0.7077764
## 4 0.2367014
## 5 0.2713224
## 6 0.3887664
dim(df)
## [1] 30 1
Set the Bins
The {floor,ceiling} +/- binSize
below is because the bins include only one side of the range (default: right side), so the minimum value(s) will not be binned. It also ensures the bins are on rounded boundaries.
binSize <- 0.05
allBins <- seq(floor(allRange[1] / binSize) * binSize,
ceiling(allRange[2] / binSize) * binSize,
by = binSize)
## bin the data
df$bin <- cut(df$x, breaks = allBins)
head(df)
## x bin
## 1 0.9294723 (0.9,0.95]
## 2 0.9219381 (0.9,0.95]
## 3 0.7077764 (0.7,0.75]
## 4 0.2367014 (0.2,0.25]
## 5 0.2713224 (0.25,0.3]
## 6 0.3887664 (0.35,0.4]
Statistics on Each Bin
sapply(levels(df$bin), function(lvl) median(df$x[df$bin == lvl], na.rm = TRUE))
## (0,0.05] (0.05,0.1] (0.1,0.15] (0.15,0.2] (0.2,0.25] (0.25,0.3] (0.3,0.35]
## 0.03802277 NA 0.11528715 0.18195392 0.22918094 0.27132242 0.33626971
## (0.35,0.4] (0.4,0.45] (0.45,0.5] (0.5,0.55] (0.55,0.6] (0.6,0.65] (0.65,0.7]
## 0.38009637 0.42184059 NA 0.53826028 0.57820253 0.64165116 0.67825992
## (0.7,0.75] (0.75,0.8] (0.8,0.85] (0.85,0.9] (0.9,0.95] (0.95,1]
## 0.74243926 NA 0.80759621 0.88974267 0.92406966 0.95691077
This is an area where numerous other options could be advantageous. For instance, the base function by
can work, though dealing with its data structure is not always intuitive even if the function call itself is easy to read:
head(by(df$x, df$bin, median, na.rm = TRUE))
## df$bin
## (0,0.05] (0.05,0.1] (0.1,0.15] (0.15,0.2] (0.2,0.25] (0.25,0.3]
## 0.03802277 NA 0.11528715 0.18195392 0.22918094 0.27132242
You could also use dplyr
with ease. This example starts with the original allData
and allBins
:
library(dplyr)
data.frame(x = allData) %>%
mutate(bin = cut(x, breaks = allBins)) %>%
group_by(bin) %>%
summarise(median(x))
## Source: local data frame [17 x 2]
## bin median(x)
## 1 (0,0.05] 0.03802277
## 2 (0.1,0.15] 0.11528715
## 3 (0.15,0.2] 0.18195392
## 4 (0.2,0.25] 0.22918094
## 5 (0.25,0.3] 0.27132242
#### ..snip..
The first example preserves empty bins whereas the other methods are not aware of empty bins. There are possibly other ways of using by
and dplyr
that would include these empty bins, but this seems to suffice.
EDIT
After a bit of chat, it was determined that the range of the data was too wide with a bin width of 0.0005. A better solution was devised. (No sample data to provide, sorry, not mine to give ...) I'll use random data to mimic the process:
set.seed(42)
x <- 5e7 * runif(5e5)
library(dplyr)
binSize <- 0.0005
df <- data.frame(dat = sort(x))
df$bin <- floor(df$dat / binSize) * binSize
head(df)
## dat bin
## 1 410.9577 410.9575
## 2 456.6275 456.6270
## 3 552.3674 552.3670
## 4 875.4898 875.4895
## 5 1018.6806 1018.6805
## 6 1102.2436 1102.2435
system.time(results <- df %>% group_by(bin) %>% summarize(med = median(dat)))
## user system elapsed
## 12.08 0.00 12.11
head(results)
## Source: local data frame [6 x 2]
## bin med
## 1 410.9575 410.9577
## 2 456.6270 456.6275
## 3 552.3670 552.3674
## 4 875.4895 875.4898
## 5 1018.6805 1018.6806
## 6 1102.2435 1102.2436