Count overlapping occurrences of a substring in a very large file using Bash

Question

I have files on the order of a few dozen gigabytes (genome data) on which I need to find the number of occurrences for a substring. While the answers I've seen here use grep -o then wc -l, this seems like a hacky way that might not work for the very large files I need to work with.

Does the grep -o/wc -l method scale well for large files? If not, how else would I go about doing it?

For example,

aaataaaagtcgaaaaagtccatgcatatgatacttttttttttttttttt
111
    222
     333
            444
             555
              666

must return 6 occurrences for aaa. (Except there are maybe 10 million more lines of this.)

Have you tried those commands on any of your files yet? Could you please add to your post how long it would take for for a file with what size? — fxdeaway, Apr 11 '20 at 10:49
You probably want to use awk instead - https://stackoverflow.com/questions/53222347/grep-with-multi-threading — marblewraith, Apr 11 '20 at 11:38

Walter A · Answer 1 · 2020-04-11T17:16:58.920

Find 6 overlapping substrings aaa in the string

line="aaataaaagtcgaaaaagtccatgcatatgatacttttttttttttttttt"

You don't want to see the strings, you want to count them. When you try

# wrong
grep -o -F "aaa" <<< "${line}" | wc -l

you are missing the overlapping strings.
With the substring aaa you have 5 hits in aaaaaaa, so how handle ${line}?
Start with

grep -Eo "a{3,}" <<< "${line}"

Result

aaa
aaaa
aaaaa

Hom many hits do we have? 1 for aaa, 2 for aaaa and 3 for aaaaa. Compare the total count of characters with the number of lines (wc):

 match lines chars  add_to_total
   aaa     1     4      1
  aaaa     1     5      2
 aaaaa     1     6      3

For each line substract 3 from the total count of characters for that line.
When the result has 3 lines and 15 characters, calculate

15 characters - (3 lines * 3 characters) = 15 - 9 = 6

In code:

read -r lines chars < <(grep -Eo "a{3,}" <<< "${line}" | wc -lc)
echo "Substring count: $((chars - (3 * lines)))"

Or for a file

read -r lines chars < <(grep -Eo "a{3,}" "${file}" | wc -lc)
echo "Substring count: $((chars - (3 * lines)))"

aaa was "easy", how about other searchstrings?
I think you have to look for the substring and think of a formula that works for that substring. abcdefghi will have no overlapping strings, but abcdabc might.
Potential matches with abcdabc are

abcdabc
abcdabcdabc
abcdabcdabcdabc

Use testline

line="abcdabcdabcdabc something else abcdabcdabcdabc no match here abcdabc and abcdabcdabc"

you need "abc(dabc)+" and have

          match   lines    chars  add_to_total
abcdabcdabcdabc       1      16    3
abcdabcdabcdabc       1      16    3
        abcdabc       1       8    1
    abcdabcdabc       1      12    2

For each line substract 4 from the total count of characters and divide the answer by 4. Or (characters/4) - nr_line. When the result has 4 lines and 52 characters, calculate

(52 characters / fixed 4) / 4 lines = 13 - 4 = 9

In code:

read -r lines chars < <(grep -Eo "abc(dabc)+" <<< "${line}" | wc -lc)
echo "Substring count: $(( chars / 4 - lines))"

When you have a large file, you might want to split it first.

I was gonna post `grep -Eo 'AAA{1,}' file | wc -lc | awk '{print $1 + $2 - $1 * 4}'` This takes 8 seconds to process a ~1GB (10M rows, 100 cols) file on an *AMD A9-9410 RADEON R5 2C+3G (2) @ 2.900GHz* with GNU grep, GNU wc and GNU awk — oguz ismail, Apr 11 '20 at 17:32
`$1 + $2 - $1 * 4 = $2 - $1 * 3`, so you found the same solution as I did. Nice job, I didn't expect that. You tested the performance, that seems good news for OP. Thanks. — Walter A, Apr 11 '20 at 18:17
I don't understand why you're using `read <(grep -o | wc -lc); echo $(( ))`. will it be faster than just `grep -o | awk` ? — Fravadona, Jun 27 '22 at 11:49

score 0 · Answer 2 · answered Jun 26 '22 at 17:27

I suppose there are 2 approaches to this (both methods report 29/6 for the 2 test lines):

Use the summation method :

# WHINY_USERS=1 is a shell param for mawk-1 to pre-sort array

${input……} | WHINY_USERS=1 {m,g}awk '
 BEGIN {

 1      FS = "[^a]+(aa?[^a]+)*"
 1      OFS = "|"
 1      PROCINFO["sorted_in"] = "@ind_str_asc"
 } {
 2        _ = ""
 2      OFS = "|"
 2      gsub("^[|]*|[|]*$",_, $!(NF=NF))

 2      split(_,__)
        split($-_,___,"[|]+")

12      for (_ in ___) {
12          __[___[_]]++
        }
 2      _____=____=_<_
 2      OFS = "\t"
 2      print " -- line # "(NR)
 7      for (_ in __) {
 7          print sprintf(" %20s",_), __[_], \
                      ______=__[_] * (length(_)-2),\
                  "| "(____+=__[_]), _____+=______
        }
        print "" }'

|

 -- line # 1
                  aaa   3   3   | 3 3
                 aaaa   2   4   | 5 7
                aaaaa   3   9   | 8 16
      aaaaaaaaaaaaaaa   1   13  | 9 29

 -- line # 2
                  aaa   1   1   | 1 1
                 aaaa   1   2   | 2 3
                aaaaa   1   3   | 3 6

Print out all the copies of that substring :

{m,g}awk' {
2   printf("%s%.*s",____=$(_=_<_),_, NF=NF)

9   do { _+=gsub(__,_____)

    } while(index($+__,__))

2   if(_) {
2       ____=substr(____,-_<_,_)
2       gsub(".", (":")__, ____)

2       print "}-[(# " (_) ")]--;\f\b" substr(____, 2)

   } else { print "" } }' FS='[^a]+(aa?[^a]+)*' OFS='|' __='aaa' _____='aa'

|

aaagtcgaaaaagtccatgcaaataaaagtcgaaaaagtccatgcatatgatactttttttttt
tttttttaaagtcgaaaaagaaaaaaaaaaaaaaatataaaatccatgc}-[(# 29)]--;
                   
         aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:
         aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa

aaataaaagtcgaaaaagtccatgcatatgatacttttttttttttttttt}-[(# 6)]--;
         aaa:aaa:aaa:aaa:aaa:aaa

Count overlapping occurrences of a substring *in a very large file* using Bash

2 Answers2

Count overlapping occurrences of a substring in a very large file using Bash