parsing information out of sequencing data

Question

I have a txt file that is a converted fasta file that just has a particular region that I'm interested in analyzing. It looks like this

CTGGCCGCGCTGACTCCTCTCGCT

CTCGCAGCACTGACTCCTCTTGCG

CTAGCCGCTCTGACTCCGCTAGCG

CTCGCTGCCCTCACACCTCTTGCA

CTCGCAGCACTGACTCCTCTTGCG

CTCGCAGCACTAACACCCCTAGCT

CTCGCTGCTCTGACTCCTCTCGCC

CTGGCCGCGCTGACTCCTCTCGCT

I am currently using excel to perform some calculations on the nucleotide diversity at each position. Some of the files have like 200,000 reads so this makes the excel files unwieldy. I figure there must be an easier way to do this using python or R.

Basically I want to take the .txt file with the list of sequences and measure nucleotide diversity at each position using this equation –p(log2(p)). Does anyone know how this might be done in a way besides excel?

Thanks so much in advance for any help.

Please read [(1)](http://stackoverflow.com/help/how-to-ask) how do I ask a good question, [(2)](http://stackoverflow.com/help/mcve) How to create a MCVE as well as [(3)](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example#answer-5963610) how to provide a minimal reproducible example in R. Then edit and improve your question accordingly. I.e., abstract from your real problem... And please `dput()` your data with expected results. — Christoph, Jul 24 '17 at 13:22
https://stackoverflow.com/questions/37909873/how-to-calculate-the-entropy-of-a-dna-sequence-in-a-fasta-file — Ben Bolker, Jul 24 '17 at 14:22

hugot · Answer 1 · 2017-08-02T08:50:57.440

If you can work from the fasta file, that might be better, as there are packages specifically designed to work with that format.

Here, I give a solution in R, using the packages seqinr and also dplyr (part of tidyverse) for manipulating data.

If this were your fasta file (based on your sequences):

>seq1
CTGGCCGCGCTGACTCCTCTCGCT
>seq2
CTCGCAGCACTGACTCCTCTTGCG
>seq3
CTAGCCGCTCTGACTCCGCTAGCG
>seq4
CTCGCTGCCCTCACACCTCTTGCA
>seq5
CTCGCAGCACTGACTCCTCTTGCG
>seq6
CTCGCAGCACTAACACCCCTAGCT
>seq7
CTCGCTGCTCTGACTCCTCTCGCC
>seq8
CTGGCCGCGCTGACTCCTCTCGCT

You can read it into R using the seqinr package:

# Load the packages
library(tidyverse) # I use this package for manipulating data.frames later on
library(seqinr)

# Read the fasta file - use the path relevant for you
seqs <- read.fasta("~/path/to/your/file/example_fasta.fa")

This returns a list object, which contains as many elements as there are sequences in your file.

For your particular question - calculating diversity metrics for each position -
we can use two useful functions from the seqinr package:

getFrag() to subset the sequences
count() to calculate the frequency of each nucleotide

For example, if we wanted the nucleotide frequencies for the first position of our sequences, we could do:

# Get position 1
pos1 <- getFrag(seqs, begin = 1, end = 1)

# Calculate frequency of each nucleotide
count(pos1, wordsize = 1, freq = TRUE)

a c g t 
0 1 0 0

Showing us that the first position only contains a "C".

Below is a way to programatically "loop" through all positions and to do the calculations we might be interested in:

# Obtain fragment lenghts - assuming all sequences are the same length!
l <- length(seqs[[1]])

# Use the `lapply` function to estimate frequency for each position
p <- lapply(1:l, function(i, seqs){
  # Obtain the nucleotide for the current position
  pos_seq <- getFrag(seqs, i, i)

  # Get the frequency of each nucleotide
  pos_freq <- count(pos_seq, 1, freq = TRUE)

  # Convert to data.frame, rename variables more sensibly
  ## and add information about the nucleotide position
  pos_freq <- pos_freq %>% 
    as.data.frame() %>%
    rename(nuc = Var1, freq = Freq) %>% 
    mutate(pos = i)
}, seqs = seqs)

# The output of the above is a list.
## We now bind all tables to a single data.frame
## Remove nucleotides with zero frequency
## And estimate entropy and expected heterozygosity for each position
diversity <- p %>% 
  bind_rows() %>% 
  filter(freq > 0) %>% 
  group_by(pos) %>% 
  summarise(shannon_entropy = -sum(freq * log2(freq)),
            het = 1 - sum(freq^2), 
            n_nuc = n())

The output of these calculations now looks like this:

head(diversity)

# A tibble: 6 x 4
    pos shannon_entropy     het n_nuc
  <int>           <dbl>   <dbl> <int>
1     1        0.000000 0.00000     1
2     2        0.000000 0.00000     1
3     3        1.298795 0.53125     3
4     4        0.000000 0.00000     1
5     5        0.000000 0.00000     1
6     6        1.561278 0.65625     3

And here is a more visual view of it (using ggplot2, also part of tidyverse package):

ggplot(diversity, aes(pos, shannon_entropy)) + 
  geom_line() +
  geom_point(aes(colour = factor(n_nuc))) +
  labs(x = "Position (bp)", y = "Shannon Entropy", 
       colour = "Number of\nnucleotides")

Update:

To apply this to several fasta files, here's one possibility (I did not test this code, but something like this should work):

# Find all the fasta files of interest
## use a pattern that matches the file extension of your files
fasta_files <- list.files("~/path/to/your/fasta/directory", 
                          pattern = ".fa", full.names = TRUE)

# Use lapply to apply the code above to each file
my_diversities <- lapply(fasta_files, function(f){
  # Read the fasta file
  seqs <- read.fasta(f)

  # Obtain fragment lenghts - assuming all sequences are the same length!
  l <- length(seqs[[1]])

  # .... ETC - Copy the code above until ....
  diversity <- p %>% 
    bind_rows() %>% 
    filter(freq > 0) %>% 
    group_by(pos) %>% 
    summarise(shannon_entropy = -sum(freq * log2(freq)),
              het = 1 - sum(freq^2), 
              n_nuc = n())
})

# The output is a list of tables. 
## You can then bind them together, 
## ensuring the name of the file is added as a new column "file_name"

names(my_diversities) <- basename(fasta_files) # name the list elements
my_diversities <- bind_rows(my_diversities, .id = "file_name") # bind tables

This will give you a table of diversities for each file. You can then use ggplot2 to visualise it, similarly to what I did above, but perhaps using facets to separate the diversity from each file into different panels.

To expand on that, I would like to do this with hundreds of separate fasta files. Is there a way that I can concatenate all of these files and run this same analysis for all at one time? — James Weger, Jul 26 '17 at 22:28
@JamesWeger If you concatenate all the files, then the diversity measures will be based on all sequences together (is that what you want? Or do you want the diversity to be calculated for each fasta file separately?). If indeed you want to contatenate all fasta files, read them with `lapply` like exemplified [here](https://stackoverflow.com/questions/14958516/looping-through-all-files-in-directory-in-r-applying-multiple-commands) and then combine them with `do.call("c", your_list_of_fastas)`. Then you will have a single object with all sequences and you can run the code in the answer. — hugot, Jul 31 '17 at 10:32
sorry, I want them separate, I guess I want to run a loop. I should be able to use lapply for this as well? — James Weger, Jul 31 '17 at 15:47
@JamesWeger see updated answer. Also, please consider accepting the answer as correct. I'm glad this was helpful, but don't forget to be [more specific with your question](https://stackoverflow.com/help/how-to-ask) next time. — hugot, Aug 02 '17 at 08:49

score 0 · Answer 2 · answered Jul 24 '17 at 13:28

0

you can open and read your file :

plist=[]
with open('test.txt', 'r') as infile:
     for i in infile:
           # make calculation of 'p' for each line here
           plist.append(p)

And then use you plist to make calculation of your entropy

answered Jul 24 '17 at 13:28

Dadep

2,796
5
27
40

parsing information out of sequencing data

2 Answers2