-2

I have been given a shotgun genome sequence, which can be found here:

https://www.ncbi.nlm.nih.gov/nuccore/NZ_LRPF01000001

This sequence is made of 205,000 letters. Some of them are CDS (coding sequences) but most are non-coding and therefore not important.

For example the first coding region is entries 343 to 780 and then the second one is 937 to 1866, this obviously means that there are non-coding regions from 1 to 342 and then from 781 to 936 etc.

I am asked to perform some analysis on this sequence, and I would like to have 1 fasta file made of the coding sequence and another made of non-coding sequence.

I know how to cut this file into two vectors manually in R but there are 187 coding regions which I will need to manually locate and correctly cut. Is there some r function or algorithm that will detect the coding and non-coding regions and group them separately?

Perhaps there is a way to do it on the ncbi website?

EDIT: Could someone at least explain why am I getting downvoted?

Scavenger23
  • 209
  • 1
  • 6

1 Answers1

1

maybe this post will be useful for you Extracting the last n characters from a string in R.

Thinking about it, what I will do using R ( although I am sure other people can propose more optimised alternatives) is: First create two dataframes with the start and end coordinates of all the exons features and another with the introns and then apply the function stri_sub or any of the others you can see in the post before after adjusting the code. And then just a for loop may do the trick to iterate over the positions on the dataframe to not do it by hand.

Or if this sequence is available to download from the UCSC or ENSEMBL biomart webs another option will be: A. From UCSC, use table browser to download first a bed file with the coordinates of the introns, exons and/or UTRs and then use the bedtools getfasta function to get the fasta sequence. b) In ENSEMBLE biomart you can get the exons, and UTRs fasta sequences directly.