-1

I have files on the order of a few dozen gigabytes (genome data) on which I need to find the number of occurrences for a substring. While the answers I've seen here use grep -o then wc -l, this seems like a hacky way that might not work for the very large files I need to work with.

Does the grep -o/wc -l method scale well for large files? If not, how else would I go about doing it?

For example,

aaataaaagtcgaaaaagtccatgcatatgatacttttttttttttttttt
111
    222
     333
            444
             555
              666 

must return 6 occurrences for aaa. (Except there are maybe 10 million more lines of this.)

Walter A
  • 19,067
  • 2
  • 23
  • 43
user3932000
  • 671
  • 8
  • 24

2 Answers2

2

Find 6 overlapping substrings aaa in the string

line="aaataaaagtcgaaaaagtccatgcatatgatacttttttttttttttttt"

You don't want to see the strings, you want to count them. When you try

# wrong
grep -o -F "aaa" <<< "${line}" | wc -l

you are missing the overlapping strings.
With the substring aaa you have 5 hits in aaaaaaa, so how handle ${line}?
Start with

grep -Eo "a{3,}" <<< "${line}"

Result

aaa
aaaa
aaaaa

Hom many hits do we have? 1 for aaa, 2 for aaaa and 3 for aaaaa. Compare the total count of characters with the number of lines (wc):

 match lines chars  add_to_total
   aaa     1     4      1
  aaaa     1     5      2
 aaaaa     1     6      3

For each line substract 3 from the total count of characters for that line.
When the result has 3 lines and 15 characters, calculate

15 characters - (3 lines * 3 characters) = 15 - 9 = 6

In code:

read -r lines chars < <(grep -Eo "a{3,}" <<< "${line}" | wc -lc)
echo "Substring count: $((chars - (3 * lines)))"

Or for a file

read -r lines chars < <(grep -Eo "a{3,}" "${file}" | wc -lc)
echo "Substring count: $((chars - (3 * lines)))"

aaa was "easy", how about other searchstrings?
I think you have to look for the substring and think of a formula that works for that substring. abcdefghi will have no overlapping strings, but abcdabc might.
Potential matches with abcdabc are

abcdabc
abcdabcdabc
abcdabcdabcdabc

Use testline

line="abcdabcdabcdabc something else abcdabcdabcdabc no match here abcdabc and abcdabcdabc"

you need "abc(dabc)+" and have

          match   lines    chars  add_to_total
abcdabcdabcdabc       1      16    3
abcdabcdabcdabc       1      16    3
        abcdabc       1       8    1
    abcdabcdabc       1      12    2

For each line substract 4 from the total count of characters and divide the answer by 4. Or (characters/4) - nr_line. When the result has 4 lines and 52 characters, calculate

(52 characters / fixed 4) / 4 lines = 13 - 4 = 9

In code:

read -r lines chars < <(grep -Eo "abc(dabc)+" <<< "${line}" | wc -lc)
echo "Substring count: $(( chars / 4 - lines))"

When you have a large file, you might want to split it first.

Walter A
  • 19,067
  • 2
  • 23
  • 43
  • I was gonna post `grep -Eo 'AAA{1,}' file | wc -lc | awk '{print $1 + $2 - $1 * 4}'` This takes 8 seconds to process a ~1GB (10M rows, 100 cols) file on an *AMD A9-9410 RADEON R5 2C+3G (2) @ 2.900GHz* with GNU grep, GNU wc and GNU awk – oguz ismail Apr 11 '20 at 17:32
  • 1
    `$1 + $2 - $1 * 4 = $2 - $1 * 3`, so you found the same solution as I did. Nice job, I didn't expect that. You tested the performance, that seems good news for OP. Thanks. – Walter A Apr 11 '20 at 18:17
  • 1
    Oh yeah, I always sucked at maths – oguz ismail Apr 11 '20 at 18:18
  • I don't understand why you're using `read <(grep -o | wc -lc); echo $(( ))`. will it be faster than just `grep -o | awk` ? – Fravadona Jun 27 '22 at 11:49
  • @Fravadona I had no reason for skipping `awk`. – Walter A Jun 28 '22 at 07:35
0

I suppose there are 2 approaches to this (both methods report 29/6 for the 2 test lines):

  1. Use the summation method :

    # WHINY_USERS=1 is a shell param for mawk-1 to pre-sort array
    
    ${input……} | WHINY_USERS=1 {m,g}awk '
     BEGIN {
    
     1      FS = "[^a]+(aa?[^a]+)*"
     1      OFS = "|"
     1      PROCINFO["sorted_in"] = "@ind_str_asc"
     } {
     2        _ = ""
     2      OFS = "|"
     2      gsub("^[|]*|[|]*$",_, $!(NF=NF))
    
     2      split(_,__)
            split($-_,___,"[|]+")
    
    12      for (_ in ___) {
    12          __[___[_]]++
            }
     2      _____=____=_<_
     2      OFS = "\t"
     2      print " -- line # "(NR)
     7      for (_ in __) {
     7          print sprintf(" %20s",_), __[_], \
                          ______=__[_] * (length(_)-2),\
                      "| "(____+=__[_]), _____+=______
            }
            print "" }'
    

|

 -- line # 1
                  aaa   3   3   | 3 3
                 aaaa   2   4   | 5 7
                aaaaa   3   9   | 8 16
      aaaaaaaaaaaaaaa   1   13  | 9 29

 -- line # 2
                  aaa   1   1   | 1 1
                 aaaa   1   2   | 2 3
                aaaaa   1   3   | 3 6
  1. Print out all the copies of that substring :

    {m,g}awk' {
    2   printf("%s%.*s",____=$(_=_<_),_, NF=NF)
    
    9   do { _+=gsub(__,_____)
    
        } while(index($+__,__))
    
    2   if(_) {
    2       ____=substr(____,-_<_,_)
    2       gsub(".", (":")__, ____)
    
    2       print "}-[(# " (_) ")]--;\f\b" substr(____, 2)
    
       } else { print "" } }' FS='[^a]+(aa?[^a]+)*' OFS='|' __='aaa' _____='aa'  
    

|

aaagtcgaaaaagtccatgcaaataaaagtcgaaaaagtccatgcatatgatactttttttttt
tttttttaaagtcgaaaaagaaaaaaaaaaaaaaatataaaatccatgc}-[(# 29)]--;
                   
         aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:
         aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa

aaataaaagtcgaaaaagtccatgcatatgatacttttttttttttttttt}-[(# 6)]--;
         aaa:aaa:aaa:aaa:aaa:aaa
RARE Kpop Manifesto
  • 2,453
  • 3
  • 11