It's more natural to use some of the Bioconductor's libraries for such tasks. In my case I use Biostrings
, but maybe you could find another way.
Data
Your file is too big, so I have created a text file (in memory), which contains random DNA for each line:
# set seed to create reproducible example
set.seed(53101614)
# create an example text file in memory
temp <- tempfile()
writeLines(
sapply(1:100, function(i){
paste(sample(c("A", "T", "C", "G"), sample(100:6000),
replace = T), collapse = "")
}),
con = temp
)
# read lines from tmp file
dna <- readLines(temp)
# unlink file
unlink(temp)
Data preprocessing
Creating Biostrings::DNAStringSet
object
Using Biostrings::DNAStringSet()
function we can read character
vector to create DNAStringSet
object. Note that I assume that all the records are in standard DNA alphabet i.e. each string contains only A, T, C, G
symbols. If it does not hold in your case, refer to Biostrings
documentation.
dna <- DNAStringSet(dna, use.names = F)
# inspect the output
dna
A DNAStringSet instance of length 100
width seq
[1] 2235 GGGCTTCCGGTGGTTGTAGGCCCATAAGGTGGGAAATATACA...GAAACGTCGACAAGATACAAACGAGTGGTCAACAGGCCAGCC
[2] 1507 ATGCGGTCTATCTACTTGTTCGGCCGAACCTTGAGGGCAGCC...AACGCTTTGTACCTGTCCCAGAGTCAGAAGTAACAGTTTAGC
[3] 1462 CATTGGAGTACATAGGGTATTCCCTCTCGTTGTATAACTCCA...TCCTACTTGCGAAGGCAGTCGCACACAAGGGTCTATTTCGTC
[4] 1440 ATGCTACGTTGGTAGGGTAACGCAGACTAGAACCACACGGGA...ATAAAGCCGTCACAAGGAATGTTAGCACTCAATGGCTCGCTA
[5] 3976 AAGCGGAAGTACACGTACCCGCGTAGATTACGTATAGTCGCC...TTACGCGTTGCTCAAATCGTTCGGTGCAGTTTTATAGTGATG
... ... ...
[96] 4924 AGTAAGCAGATCCAGAGTACTGTGAAAGACGTCAGATCCCGA...TATAATGGGTTGCGTGTTTGATTCTGCCATGAATCCTATGTT
[97] 5702 CCTGAAGAGGACGTTTCCCCCTACATCCAGTAGTATTGGTGT...TCTGCTTTGCGCGGCGGGGCCGGACTGTCCATGGCTCACTTG
[98] 5603 GCGGCTGATTATTGCCCGTCTGCCTGCATGATCGAGCAGAAC...CTCTTTACATGCTCATAGGAATCGGCAACGAAGGAGAGAGTC
[99] 3775 GGCAAGACGGTCAGATGTTTTGATGTCCGGGCGGATATCCTT...CGCTGCCCGTGACAATAGTTATCATAAGGAGACCTGGATGGT
[100] 407 TGTCGCAACCTCTCTTGCACGTCCAATTCCCCGACGGTTCTA...GCGACATTCCGGAGTCTGCGCAGCCTATGTATACCCTACAGA
Create the vector of random N numbers of bins
set.seed(53101614)
k <- sample(100, 100, replace = T)
# inspect the output
head(k)
[1] 37 32 63 76 19 41
Create Views
object were each DNA sequence represented by N = k[i]
chunks
It is much easier to solve your problem using IRanges::Views
container. This thing is furiously fast and beautiful.
First of all we divide each DNA sequenced into k[i]
ranges:
seqviews <- lapply(seq_along(dna), function(i){
seq = dna[[i]]
seq_length = length(seq)
starts = seq(1, seq_length - seq_length %% k[i], seq_length %/% k[i])
Views(seq, start = starts, end = c(starts[-1] - 1, seq_length))
}
)
# inspect the output for k[2] and seqviews[2]
k[2]
seqviews[2]
32
Views on a 1507-letter DNAString subject subject: ATGCGGTCTATCTACTTG...GTCAGAAGTAACAGTTTAG
views:
start end width
[1] 1 47 47 [ATGCGGTCTATCTACTTGTTCGGCCGAACCTTGAGGGCAGCCAGCTA]
[2] 48 94 47 [ACCGCCGGAGACCTGAGTCCACCACACCCATTCGATCTCCATGGTTG]
[3] 95 141 47 [GCGCTCTCCGAGGTGCCACGTCAAGTTGTACTACTCTCTCAGACCTC]
[4] 142 188 47 [TTGTTAGAAGTCCCGAGGTATATGCGCAATACCTCAACCGAAGCGCC]
[5] 189 235 47 [TGATGAGCAAACGTTTCTTATAGTCGCGACCTTGTCCCGAGGACTTG]
... ... ... ... ...
[28] 1270 1316 47 [AGGCGAGGGCAGGGCACATGTTTCTACAGTGAGGCGTGATCCGCTCC]
[29] 1317 1363 47 [GAGGCAAGCTCGTGAACTGTCGTGGCAAGTTACTTATGAGGATGTCA]
[30] 1364 1410 47 [TGGGCAGATGCAACAGACTGCTATTGGCGGGAGAGAGGCATCGACAT]
[31] 1411 1457 47 [ACCGTCTCAAGTACCACAGCTGAGAGGCTCTCGTGGAGATGCGCACA]
[32] 1458 1507 50 [TGAGTCGTAACGCTTTGTACCTGTCCCAGAGTCAGAAGTAACAGTTTAGC]
After that, we check if all sequences have been divided to desired number of chunks:
all(sapply(seq_along(k), function(i) k[i] == length(seqviews[[i]])))
[1] TRUE
Important observation
Before we proceed, there is one important observation about your function.
Your function produces N chunks with variable length (because the indices it produces are floats but not integers, so substr()
when you call it, rounds provided indices to the nearest integer.
As an example, extracting 1st record from the dna
set, and splitting this sequence into 37 bins using your code will produce following results:
dna_1 <- as.character(dna[[1]])
sprintf("DNA#1: %d bp long, 37 chunks", nchar(dna_1))
[1] "DNA#1: 2235 bp long, 37 chunks"
bins <- sapply(seq(1, nchar(dna_1), nchar(dna_1)/37),
function(x){
substr(dna_1, x, x + nchar(dna_1)/37 - 1)
}
)
bins_length <- sapply(bins, nchar)
barplot(table(bins_length),
xlab = "Bin's length",
ylab = "Count",
main = "Bin's length variability"
)

The approach I use in my code, while length(dna[[i]]) %% k[i] != 0
(reminder), produces k[i] - 1
bins of equal lengths, and only the last bin has its length equal to length(dna[i]) %/% k[i] + length(dna[[i]] %% k[i]
:
bins_length <- sapply(seqviews, length)
barplot(table(bins_length),
xlab = "Bin's length",
ylab = "Count",
main = "Bin's length variability"
)

GC content calculation
As it is mentioned above, Biostrings::letterFrequency()
applied to IRanges::Views
allows you to calculate GC content easily:
Find the GC frequency for each bin in every DNA sequence
GC <- lapply(seqviews, letterFrequency, letters = "GC", as.prob = TRUE)
Convert to percents
GC <- lapply(GC, "*", 100)
Inspect the output
head(GC[[1]])
G|C
[1,] 53.33333
[2,] 46.66667
[3,] 50.00000
[4,] 55.00000
[5,] 60.00000
[6,] 45.00000
Plot GC content for DNAs 1:9
par(mfrow = c(3, 3))
invisible(
lapply(1:9, function(i){
plot(GC[[i]],
type = "l",
main = sprintf("DNA #%d, %d bp, %d bins", i, length(dna[[i]]), k[i]),
xlab = "N bins",
ylab = "GC content, %",
ylim = c(0, 100)
)
abline(h = 50, lty = 2, col = "red")
}
)
)
