0

I have a dataframe of many rows with only one column, the column having strings of variable lengths, ranging from 30000 to 200000 characters(DNA sequence). [Below is a sample of 150 characters]

TTCCCCAAACAGCAACTTTAAGGAGCAGCTTCCTTTATGATCCCTGATTGCCTCCCCTTTGTTCCCATAACAAGTAGTTTAAATTTTCTGTTAAAGTCCAAACCACATATTTACAATACCTCGCACC

Here is the full dataset: https://drive.google.com/open?id=1f9prtKW5NnS-BLI5lqsl4FEi4PvRfxGR

I have a code in R, which divides each row into 20 bins depending on its length, and counts the occurrence of G's and C's for each bin, and gives me back a matrix of 20 columns. Here is the code:

library(data.table)
data <- fread("string.fa", header = F)

loopchar <- function(data){ bins <- sapply(seq(1, nchar(data), nchar(data)/20), function(x) substr(data, x, x + nchar(data)/20 - 1))output <- (str_count(bins, c("G"))/nchar(bins) + str_count(bins, c("C"))/nchar(bins))*100}

result <- data.frame(t(apply(data,1,loopchar)))

However, now I want to do something different. Instead of nchar(data)/20, I want the substring segments (20) to vary from a list I have. So now for my data frame, the first row should be divided into 22 bins/segments, and the code would be nchar(data)/22.

The second row should be divided into 21 bins, and the code would be nchar(data)/21, and so on. I want the function to keep changing the number of bins for the data. Both my data dataframe with strings and vector list of numbers with bins are of the same length.

What is the best way to do this?

rishi
  • 267
  • 1
  • 9
  • 2
    See: How to create a Minimal, Complete, and Verifiable example. https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610 – IceCreamToucan Nov 01 '18 at 12:50
  • It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Nov 01 '18 at 15:00
  • @MrFlick I completely understand, however giving a reproducible example for this exact data and problem for me seems a bit problematic for a few reasons. I apologize in advance. If you can still help me I'd really appreciate it. The full dataset is on the link :) – rishi Nov 02 '18 at 13:08
  • @MrFlick if you could continue this discussion on chat, I can be more clear – rishi Nov 02 '18 at 13:10
  • 1
    @rishi Do you have your data in *fasta* format (meaning with headers etc.) or just as a table, you gave link to? – utubun Nov 02 '18 at 13:56
  • @utubun a table with just one column, each row is a fasta string – rishi Nov 05 '18 at 08:57
  • I've looked into your data, and it looks like there are no fasta headers there. I mean if it have been in format `>seq_name\nAACCT...GGCT` it would be much easier to work with. Does the original file contains DNA seqs with fasta headers? – utubun Nov 05 '18 at 09:43
  • @utubun no unfortunately it doesn't :( I extracted these fasta strings from bedtools and I didn't need headers because I just wanted to calculate the GC content – rishi Nov 05 '18 at 09:50
  • @rishi see my answer, hope it would help – utubun Nov 05 '18 at 12:26

1 Answers1

1

It's more natural to use some of the Bioconductor's libraries for such tasks. In my case I use Biostrings, but maybe you could find another way.

Data

Your file is too big, so I have created a text file (in memory), which contains random DNA for each line:

# set seed to create reproducible example

set.seed(53101614)

# create an example text file in memory

temp <- tempfile()
writeLines(
  sapply(1:100, function(i){
    paste(sample(c("A", "T", "C", "G"), sample(100:6000), 
                 replace = T), collapse = "")
  }),
  con = temp
)

# read lines from tmp file

dna <- readLines(temp)

# unlink file

unlink(temp)

Data preprocessing

Creating Biostrings::DNAStringSet object

Using Biostrings::DNAStringSet() function we can read character vector to create DNAStringSet object. Note that I assume that all the records are in standard DNA alphabet i.e. each string contains only A, T, C, G symbols. If it does not hold in your case, refer to Biostrings documentation.

dna <- DNAStringSet(dna, use.names = F)

# inspect the output

dna

A DNAStringSet instance of length 100
      width seq
  [1]  2235 GGGCTTCCGGTGGTTGTAGGCCCATAAGGTGGGAAATATACA...GAAACGTCGACAAGATACAAACGAGTGGTCAACAGGCCAGCC
  [2]  1507 ATGCGGTCTATCTACTTGTTCGGCCGAACCTTGAGGGCAGCC...AACGCTTTGTACCTGTCCCAGAGTCAGAAGTAACAGTTTAGC
  [3]  1462 CATTGGAGTACATAGGGTATTCCCTCTCGTTGTATAACTCCA...TCCTACTTGCGAAGGCAGTCGCACACAAGGGTCTATTTCGTC
  [4]  1440 ATGCTACGTTGGTAGGGTAACGCAGACTAGAACCACACGGGA...ATAAAGCCGTCACAAGGAATGTTAGCACTCAATGGCTCGCTA
  [5]  3976 AAGCGGAAGTACACGTACCCGCGTAGATTACGTATAGTCGCC...TTACGCGTTGCTCAAATCGTTCGGTGCAGTTTTATAGTGATG
  ...   ... ...
 [96]  4924 AGTAAGCAGATCCAGAGTACTGTGAAAGACGTCAGATCCCGA...TATAATGGGTTGCGTGTTTGATTCTGCCATGAATCCTATGTT
 [97]  5702 CCTGAAGAGGACGTTTCCCCCTACATCCAGTAGTATTGGTGT...TCTGCTTTGCGCGGCGGGGCCGGACTGTCCATGGCTCACTTG
 [98]  5603 GCGGCTGATTATTGCCCGTCTGCCTGCATGATCGAGCAGAAC...CTCTTTACATGCTCATAGGAATCGGCAACGAAGGAGAGAGTC
 [99]  3775 GGCAAGACGGTCAGATGTTTTGATGTCCGGGCGGATATCCTT...CGCTGCCCGTGACAATAGTTATCATAAGGAGACCTGGATGGT
[100]   407 TGTCGCAACCTCTCTTGCACGTCCAATTCCCCGACGGTTCTA...GCGACATTCCGGAGTCTGCGCAGCCTATGTATACCCTACAGA

Create the vector of random N numbers of bins

set.seed(53101614)

k <- sample(100, 100, replace = T)

# inspect the output

head(k)

[1] 37 32 63 76 19 41

Create Views object were each DNA sequence represented by N = k[i] chunks

It is much easier to solve your problem using IRanges::Views container. This thing is furiously fast and beautiful.

First of all we divide each DNA sequenced into k[i] ranges:

seqviews <- lapply(seq_along(dna), function(i){

  seq = dna[[i]]
  seq_length = length(seq)

  starts = seq(1, seq_length - seq_length %% k[i], seq_length %/% k[i])

  Views(seq, start = starts, end = c(starts[-1] - 1, seq_length))
  }
)

# inspect the output for k[2] and seqviews[2]

k[2]
seqviews[2]

32

Views on a 1507-letter DNAString subject subject: ATGCGGTCTATCTACTTG...GTCAGAAGTAACAGTTTAG
    views:
         start  end width
     [1]     1   47    47 [ATGCGGTCTATCTACTTGTTCGGCCGAACCTTGAGGGCAGCCAGCTA]
     [2]    48   94    47 [ACCGCCGGAGACCTGAGTCCACCACACCCATTCGATCTCCATGGTTG]
     [3]    95  141    47 [GCGCTCTCCGAGGTGCCACGTCAAGTTGTACTACTCTCTCAGACCTC]
     [4]   142  188    47 [TTGTTAGAAGTCCCGAGGTATATGCGCAATACCTCAACCGAAGCGCC]
     [5]   189  235    47 [TGATGAGCAAACGTTTCTTATAGTCGCGACCTTGTCCCGAGGACTTG]
     ...   ...  ...   ... ...
    [28]  1270 1316    47 [AGGCGAGGGCAGGGCACATGTTTCTACAGTGAGGCGTGATCCGCTCC]
    [29]  1317 1363    47 [GAGGCAAGCTCGTGAACTGTCGTGGCAAGTTACTTATGAGGATGTCA]
    [30]  1364 1410    47 [TGGGCAGATGCAACAGACTGCTATTGGCGGGAGAGAGGCATCGACAT]
    [31]  1411 1457    47 [ACCGTCTCAAGTACCACAGCTGAGAGGCTCTCGTGGAGATGCGCACA]
    [32]  1458 1507    50 [TGAGTCGTAACGCTTTGTACCTGTCCCAGAGTCAGAAGTAACAGTTTAGC]

After that, we check if all sequences have been divided to desired number of chunks:

all(sapply(seq_along(k), function(i) k[i] == length(seqviews[[i]])))

[1] TRUE

Important observation

Before we proceed, there is one important observation about your function.

Your function produces N chunks with variable length (because the indices it produces are floats but not integers, so substr() when you call it, rounds provided indices to the nearest integer.

As an example, extracting 1st record from the dna set, and splitting this sequence into 37 bins using your code will produce following results:

dna_1 <- as.character(dna[[1]])

sprintf("DNA#1: %d bp long, 37 chunks", nchar(dna_1))
[1] "DNA#1: 2235 bp long, 37 chunks"

bins <- sapply(seq(1, nchar(dna_1), nchar(dna_1)/37), 
               function(x){
                 substr(dna_1, x, x + nchar(dna_1)/37 - 1)
                 }
               )

bins_length <- sapply(bins, nchar)

barplot(table(bins_length), 
        xlab = "Bin's length", 
        ylab = "Count", 
        main = "Bin's length variability"
)

bin's length variability 1

The approach I use in my code, while length(dna[[i]]) %% k[i] != 0 (reminder), produces k[i] - 1 bins of equal lengths, and only the last bin has its length equal to length(dna[i]) %/% k[i] + length(dna[[i]] %% k[i]:

bins_length <- sapply(seqviews, length)

barplot(table(bins_length), 
        xlab = "Bin's length", 
        ylab = "Count", 
        main = "Bin's length variability"
)

variability of bins's length 2

GC content calculation

As it is mentioned above, Biostrings::letterFrequency() applied to IRanges::Views allows you to calculate GC content easily:

Find the GC frequency for each bin in every DNA sequence

GC <- lapply(seqviews, letterFrequency, letters = "GC", as.prob = TRUE)
Convert to percents
GC <- lapply(GC, "*", 100)
Inspect the output
head(GC[[1]])
          G|C
[1,] 53.33333
[2,] 46.66667
[3,] 50.00000
[4,] 55.00000
[5,] 60.00000
[6,] 45.00000

Plot GC content for DNAs 1:9

par(mfrow = c(3, 3))

invisible(
  lapply(1:9, function(i){
    plot(GC[[i]], 
         type = "l", 
         main = sprintf("DNA #%d, %d bp, %d bins", i, length(dna[[i]]), k[i]),
         xlab = "N bins",
         ylab = "GC content, %",
         ylim = c(0, 100)
    )
    abline(h = 50, lty = 2, col = "red")
  }
  )
)

enter image description here

utubun
  • 4,400
  • 1
  • 14
  • 17