0

I have a file, file.txt of DNA sequences, where each line is a DNA sequence. The first 5 lines look like this:

GACAGAGGGTGCAAACGTTGTTCGGAATTACTGGGCGTAAAGCGCGTGTAGGCGGCCATGCAAGTCGGATGTGAAAGCCCTCGGCTCAACCGGGGAAGTGCACTCGAAACTGCAAGGCTAGAGTCTCGGAGAGGATCGTGGAATTCTCGGTGTAGAGGTGAAATTCGTAGATATCGAGAGGAACACCGGTGGCGAAGGCGGCGATCTGGACGATGACTGACGCTGAGACGCGAAAGCGTGGGGAGCAAACAGG
TACGTAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTCTCGTAAGCTGGGTGTGAAAGCCCCGGGCTTAACCTGGGAATGGCATTCAGGACTGCGAGGCTCGAGTGTGGCAGAGGGAGGTGGAATTCCACGTGTAGCAGTGAAATGCGTAGAGATGTGGAGGAACACCGATGGCGAAGGCAGCCTCCTGGGCCAGCACTGACGCTCATGCACGAAAGCGTGGGGAGCAAACAGG
GACGTGTGAGGCAAGCGTTATTCGTCATTAATGGGTTTAAAGGGTACGTAGGCGGAATACTTTATTATGTTTAAGAAGACACTTAAAAGTGAACATGATAATAAAATTCTAGAGTTTGAAAGGAGTAAACAATTACCTCGAGAGTAAGGGACAACTAATACGGAAATACTTGGGGGGATTCTAAGCGGCGAAAGCATGTTACTATTGAAAACTGACGCTGAGGTACGAAGGCTTGGGTATCGACTGGG
TACGAAGGGTGCAAACGTTGCTCGGAATTATTGGGCGTAAAGCGCATGTAGGCGGCTTAGCAAGTCGGATGTGAAATCCCTCGGCTCAACCAAGGAAGTGCATCCGAAACTGCTGAGCTTGAGTACGAAAGAGGATCGCGGAATTCCCGGTGTAGAGGTGAAATTCGTAGATATCGGGAGGAACACCAGTGGCGAAGGCGGCGATCTGGGTCGATACTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACAGG
AACGTAGGAGACAAACGTTATCCGGAGTTACTGGGCGTAAAGGGCGTGTAGGTGGTTGCGTAAGTCTGGCGTGAAATTTTTCGGCTTAACCGGGAAAGGTCGTCGGATACTGCGTAGCTAGAGGACGGTAGAGGCGTGTGGAATTCCGGGGGTAGTGGTGAAATGCGTAGAGATCCGGAGGAACACCAGTGGCGAAGGCGACACGCTGGGCCGTACCTGACACTGATGCGCGACAGCATGGGGAGCAAACACT

The actual file contains tens of thousands of lines. I would like to identify all the unique sequences in this file (or the unique lines) and the number of times each sequence (or line) was observed in the file. Ideally this would be returned as a matrix with one column in R where the entries are the sequence abundances and the rownames are the unique sequences.

Alternatively, this could write out to a .csv file where the first line is a comma separated string of the unique sequences (lines) and the second line is a comma separated string of the number of times each sequence occurs in the file.

Second, this file is large-ish (~5 MB), but there are many files like it. Downstream I will have to merge many of these vectors together. How can I generate this vector while minimizing memory usage?

zx8754
  • 52,746
  • 12
  • 114
  • 209
colin
  • 2,606
  • 4
  • 27
  • 57
  • 1
    Possible duplicate of [Count number of rows within each group](https://stackoverflow.com/questions/9809166/count-number-of-rows-within-each-group) ? – zx8754 May 09 '18 at 20:27
  • @zx8754 here come the stackoverflow police that keep people from wanting to use this site. This question is asking to write to a file on disk. Furthermore, it needs to perform this operation based on a file on the disk, without loading that file into memory. The question you reference does neither of these things. Its R code that requires everything already be loaded in memory and cannot generalize to multiple files. – colin May 09 '18 at 21:58
  • 2
    Ugh, it says “possible” with a question mark. Post **is** tagged with R. There **are** packages that do not clog the memory. Did you try using table(myvector) ? No need to have all files at once in R, read one, get counts, output, remove object, etc. – zx8754 May 09 '18 at 22:18
  • @zx8754 Hi! sorry if this is not the best place to ask this question, but since the accepted answer is not related to R, the tags in the question should be edited? thanks. – tk3 May 10 '18 at 11:06
  • 1
    @tk3 I think tag should stay, it is relevant. – zx8754 May 10 '18 at 11:09
  • 1
    @zx8754 I don't see this as a direct duplicate of the question you linked. There are probably some closer duplicates in the `bash`/`linux` tags or somewhere else on this site. At this point it sounds like if @colin still wanted some additional answers he could make some edits stating clearly that the desired solution doesn't _need_ to be `r` and that it needs to work with the files that won't fit in memory. It could make sense to add some related tags like `bash`, `linux` etc., but I agree that `R` should stay in case someone can propose a solution in the @colin's preferred language. – Matt Summersgill May 10 '18 at 11:30
  • @zx8754 this is why everyone hates using this site and there is a problem with new members being scared away from this community. Just let people ask questions! – colin May 10 '18 at 23:39
  • Could you point exactly where I am/was out of line? – zx8754 May 11 '18 at 07:02
  • 3
    @colin suggesting duplicates isn't meant to hurt you, rather the opposite. Usually, the dupe target already has answers and it is linked to other similar questions that were closed as dupes too. Hence, you don't need to wait. It is also SO policy and this is the reason that SO gave zx8754 the power to close this question single handily if he wanted to, so please don't make a scene of it. And a last thing, SO is a Q/A site that meant to serve everyone, it is not your private assistant. The idea of a closing as a dupe is that **future readers** could find as many relevant answers as possible. – David Arenburg May 11 '18 at 07:22
  • @DavidArenburg fair enough. However, there is a difference between how the feedback is intended and how it feels, and I really think this is at the heart of the current disagreement with the approachability of this website. apologies to zx8754 for dragging them into a comment fight. – colin May 11 '18 at 17:45

4 Answers4

3

EDIT

I didn't know Unix answers were allowed. So the following is two alternatives for sort | uniq answers. Considering your files are in the same folder, named myFile_1.txt myFile_2.txt myFile_n.txt

The best in my tests with 700k lines 160Mb:

perl -ne '$count{$_}++; END { print "$count{$_} $_" for sort {$count{$b} <=> $count{$a} || $b cmp $a} keys %count}' myFile*.txt > output.txt

A more detailed explanation can be found here.

And an alternative that differs because does not have to actually sort the file (but if you have too many different keys, it will use more memory).

cat myFile*.txt | awk '{ cnts[$0] += 1 } END { for (v in cnts) print cnts[v], v }' > output.txt

A more detailed explanation can be found here.

Previous R answer

You can put your data in a vector struture like this:

data <- c("GACAGAG", "TACGTAGG", "AACGTAGG", "GACGTGTG", "TACGAAGG", "AACGTAGG")
ans <- table(data)
ans["AACGTAGG"]

5MB fits in your memory, so I guess it will work. However, if you have some data that does not fit in the memory, you will have to process the file line by line or use some solution like SparkR.

Hope it helps :)

tk3
  • 990
  • 1
  • 13
  • 18
2

Are you working on a Unix system? (This answer won't work on Windows out of the box)

I created a file named testtext.txt with the content as follows:

c
a
b
a
b
b
b
c

Then executing the following command in the terminal

sort testtext.txt | uniq -c > testcounts.txt

generates a file, testcounts.txt with the content below.

2 a
4 b
2 c

I cant speak to how this will perform relative to other solutions, but seems be worth a shot.

You could also do it simultaneously across all files matching a pattern in the current directory - I made three - testtext.txt, testtext2.txt, and testtext3.txt

find . -type f -name 'testtext*' | xargs sort | uniq -c > Counts.txt

then creates the file Counts.txt

10 a
 6 b
 5 c
 3 d
 1 e
 1 f

Alternatively (and particularly if memory usage is of concern) you could put the single file example in a simple bash script for loop to handle files one at a time. Either way, the Unix command line tools are shockingly efficient when used elegantly.

Credit: Unix.StackExchange: Sort and Count Number of Occurence of Lines on

Matt Summersgill
  • 4,054
  • 18
  • 47
  • No need for `cat`. – zx8754 May 10 '18 at 08:22
  • 1
    Thanks @zx8754 , updated to reflect. I mentioned that _"the Unix command line tools are shockingly efficient when used elegantly"_, but perhaps I should have also mentioned that it can take a lot of exposure/ reading to understand how to use them in an "elegant" way -- clearly I'm no exception. – Matt Summersgill May 10 '18 at 11:46
2

Just using R base commands compare the results for 500000 rows of data:

Here is our test file, 500K rows, 122MB.

wc -l myFile.txt
# 500000 myFile.txt

ls -lh myFile.txt
# xxx xxx xxx xxx 122M May 10 09:05 myFile.txt

Using sort | uniq:

time sort myFile.txt | uniq -c > myFileCounts1.txt

# real    0m7.317s
# user    0m12.998s
# sys     0m0.228s

Using R, table (from related post):

system.time(write.table(table(readLines("myFile.txt")), "myFileCounts2.txt",
                        col.names = FALSE, row.names = FALSE , quote = FALSE))

#  user  system elapsed
# 3.028   0.100   3.142
zx8754
  • 52,746
  • 12
  • 114
  • 209
1

You should use a HashMap with your sequences are the keys and the values an Integer that counts the number of appearances.

The algorithm in java pseudo code would be as follows, reading lines til EOF

Map<String, Integer> map = new Map...
String line;
Integer appearances;

while(not EOF)
    line = read line however suits your problem
    appearances = map.get(line)
    if(appearances == null)
        map.put(line, 1)
    else
        map.put(line, appearances+1)

Then you can access all the lines you've and their values by consulting the map keyset and doing gets or just using the entryset

Regarding efficency, you probably can't get a more efficient method than using a dictionary in this way.

Jorge.V
  • 1,329
  • 1
  • 13
  • 19