0

I am trying make a dataframe in R which includes fasta headers and sequences. I used the code below to do this however now I would like to make columns in my df using information from the fasta headers.

Here is the content of the header that I would like to use to make columns in my df. Ideally each piece of information between brackets ([]) would be a column. The main thing I need is the location as a column.

lcl|FR839628.1_cds_CCA36173.1_1 [locus_tag=PP7435_CHR1-0001] [db_xref=EnsemblGenomes-Gn:PP7435_Chr1-0001,EnsemblGenomes-Tr:CCA36173,UniProtKB/TrEMBL:F2QL95] [protein=Hypothetical_protein] [protein_id=CCA36173.1] [location=5023..6504] [gbkey=CDS]

Thanks for your help!

I tried this and it worked for making a df but now I want to make columns from the df$seq_name

    library("Biostrings")
    fastaFile <- readDNAStringSet("my.fasta")
    seq_name = names(fastaFile)
    sequence = paste(fastaFile)
    df <- data.frame(seq_name, sequence)

I tried to use this string split command but I am not sure how to do it in a way that saves the outputs into columns of the df.

    string = df$seq_name
    strsplit(string,split='[', fixed=TRUE)
callmcg
  • 1
  • 1
  • 1
    Does this answer your question? [https://stackoverflow.com/questions/4350440/split-data-frame-string-column-into-multiple-columns](https://stackoverflow.com/questions/4350440/split-data-frame-string-column-into-multiple-columns) – Cloudberry Aug 13 '23 at 18:07

1 Answers1

0

You could try with tidyverse...you might need to modify depending on what pieces of info you're trying to extract but I think it should look something like..

library(tidyverse)
fasta_lines <- readLines("your_file.fasta")

some empty lists to store extracted information

sequence_headers <- vector("character")
sequence_lengths <- vector("integer")
current_header <- NULL
current_sequence <- NULL

Loop through each line in the fasta

for (line in fasta_lines) {
  if (startsWith(line, ">")) {
#Note: use > for header characters or whatever other symbol denotes header and store the header and sequence length

if (!is.null(current_header)) {
  sequence_headers <- append(sequence_headers, current_header)
  sequence_lengths <- append(sequence_lengths, nchar(current_sequence))
}

Update current header and sequence

current_header <- substring(line, 2)  # Remove ">"
current_sequence <- ""
  } else {
# If it's not a header line, then it's a sequence
current_sequence <- paste(current_sequence, line)
  }
}

Store the last sequence info

sequence_headers <- append(sequence_headers, current_header)
sequence_lengths <- append(sequence_lengths, nchar(current_sequence))

Create a df

fasta_df <- data.frame(Header = sequence_headers, SequenceLength = sequence_lengths)
print(fasta_df)