I have a file, file.txt
of DNA sequences, where each line is a DNA sequence. The first 5 lines look like this:
GACAGAGGGTGCAAACGTTGTTCGGAATTACTGGGCGTAAAGCGCGTGTAGGCGGCCATGCAAGTCGGATGTGAAAGCCCTCGGCTCAACCGGGGAAGTGCACTCGAAACTGCAAGGCTAGAGTCTCGGAGAGGATCGTGGAATTCTCGGTGTAGAGGTGAAATTCGTAGATATCGAGAGGAACACCGGTGGCGAAGGCGGCGATCTGGACGATGACTGACGCTGAGACGCGAAAGCGTGGGGAGCAAACAGG
TACGTAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTCTCGTAAGCTGGGTGTGAAAGCCCCGGGCTTAACCTGGGAATGGCATTCAGGACTGCGAGGCTCGAGTGTGGCAGAGGGAGGTGGAATTCCACGTGTAGCAGTGAAATGCGTAGAGATGTGGAGGAACACCGATGGCGAAGGCAGCCTCCTGGGCCAGCACTGACGCTCATGCACGAAAGCGTGGGGAGCAAACAGG
GACGTGTGAGGCAAGCGTTATTCGTCATTAATGGGTTTAAAGGGTACGTAGGCGGAATACTTTATTATGTTTAAGAAGACACTTAAAAGTGAACATGATAATAAAATTCTAGAGTTTGAAAGGAGTAAACAATTACCTCGAGAGTAAGGGACAACTAATACGGAAATACTTGGGGGGATTCTAAGCGGCGAAAGCATGTTACTATTGAAAACTGACGCTGAGGTACGAAGGCTTGGGTATCGACTGGG
TACGAAGGGTGCAAACGTTGCTCGGAATTATTGGGCGTAAAGCGCATGTAGGCGGCTTAGCAAGTCGGATGTGAAATCCCTCGGCTCAACCAAGGAAGTGCATCCGAAACTGCTGAGCTTGAGTACGAAAGAGGATCGCGGAATTCCCGGTGTAGAGGTGAAATTCGTAGATATCGGGAGGAACACCAGTGGCGAAGGCGGCGATCTGGGTCGATACTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACAGG
AACGTAGGAGACAAACGTTATCCGGAGTTACTGGGCGTAAAGGGCGTGTAGGTGGTTGCGTAAGTCTGGCGTGAAATTTTTCGGCTTAACCGGGAAAGGTCGTCGGATACTGCGTAGCTAGAGGACGGTAGAGGCGTGTGGAATTCCGGGGGTAGTGGTGAAATGCGTAGAGATCCGGAGGAACACCAGTGGCGAAGGCGACACGCTGGGCCGTACCTGACACTGATGCGCGACAGCATGGGGAGCAAACACT
The actual file contains tens of thousands of lines. I would like to identify all the unique sequences in this file (or the unique lines) and the number of times each sequence (or line) was observed in the file. Ideally this would be returned as a matrix with one column in R where the entries are the sequence abundances and the rownames are the unique sequences.
Alternatively, this could write out to a .csv
file where the first line is a comma separated string of the unique sequences (lines) and the second line is a comma separated string of the number of times each sequence occurs in the file.
Second, this file is large-ish (~5 MB), but there are many files like it. Downstream I will have to merge many of these vectors together. How can I generate this vector while minimizing memory usage?