Count the number of times a substring appears in a file and place it in a new column

Question

Question:

I have 2 files, file 1 is a TSV (BED) file that has 23 base-pair sequences in column 7, for example:

1 779692 779715 Sample_3 + 1  ATGGTGCTTTGTTATGGCAGCTC
1 783462 783485 Sample_4 - 1  ATGAATAAGTCAAGTAAATGGAC

File 2 is a FASTA file (hg19.fasta) that looks like this. Although it breaks across the lines, this continous string of A,C,G, and T's reprsents a continous string (i.e. a chromsosome). This file is the entire human reference genome build 19, so the two > headers followed by sequences essentially occurs 23 times for each of the 23 chromosomes:

>1 dna:chromosome chromosome:GRCh37:1:1:249250621:1
AATTTGACCAGAAGTTATGGGCATCCCTCCCCTGGGAAGGAGGCAGGCAGAAAAGTTTGGAATCTATGTAGTAAAATATG
TTACTCTTTTATATATATGAATAAGTCAAGTAAATGGACATACATATATGTGTGTATATGTGTATATATATATACACACA
TATATACATACATACATACATACATATTATCTGAATTAGGCCATGGTGCTTTGTTATGGCAGCTCTCTGGGATACATGTG
CAGAATGTACAGGTTTGTTACACAGGTATACACCTGCCATGGTTGTTTGCTGCACCCATCAACTCACCATCTACATTAGG
TATTTCTCCTAACGTTATCCCTCATGAATAAGTCAAGTAAATGGAC


>2 dna:chromosome chromosome:GRCh37:1:1:2492300:1

AATTTGACCAGAAGTTATGGGCATCCCTCCCCTGGGAAGGAGGCAGGCAGAAAAGTTTGGAATCTATGTAGTAAAATATG
TTACTCTTTTATATATATGAATAAGTCAAGTAAATGGACATACATATATGTGTGTATATGTGTATATATATATACACACA
TATATACATACATACATACATACATATTATCTGAATTAGGCCATGGTGCTTTGTTATGGCAGCTCTCTGGGATACATGTG

I want to 1) Find out how many times each 23bp sequence appears in the second file without overlapping any others including sequences that break across the lines and 2) append this number to a new column next to the sequence so the new file looks like this:

Desired Output:

1 779692 779715 Sample_3  + 1  ATGGTGCTTTGTTATGGCAGCTC 1 
1 783462 783485 Sample_4 - 1  ATGAATAAGTCAAGTAAATGGAC 2

My attempt:

I imagine solving the first part will be some variation on grep, so far I've managed:

grep -o ATGGTGCTTTGTTATGGCAGCTC "$file_2" | grep -c ""

which gets the count of a specific sequence, but not each sequence in the column. I think appending the grep results will require awk and paste but I haven't gotten that far!

Any help is appreciated as always! =)

Updates and Edits:

The actual size of these files is relatively massive (30mb or ~500,000 lines for each tsv/BED file) and the FASTA file is the entire human reference genome build 19, which is ~60,000,000 lines. The perl solution proposed by @choroba works, but doesn't scale well to these sizes.
Unfortunately, because of the need to identify matches across the lines, the awk and bash/grep solutions memtnioned below won't work.
I want multiple non-overlapping hits in the same chromosome to count as the actual number of hits. I.e. If you search for a sequence and get 2 hits in a single chromosome and 1 in another chromosome, the total count should be 3.
Ted Lyngmo is very kindly helping me develop a solution in C++ that allows this to be run in a realistic timeframe, there's more detail on his post in this thread. And link to the Github for this is here =)

Oh that's a good point! Ideally, I wouldn't want overlap between the 23-bp sequences. — KLM117, Jan 19 '22 at 15:29
for the 1st file: could a 23bp show up more than once in the file, or is each 23bp unique within the file? — markp-fuso, Jan 19 '22 at 16:24
It's a good exercise and all, but aren't there [many tools for gRNA design](https://www.addgene.org/crispr/reference/#grna) that already generate this info? Also, what about reverse complement matches? — merv, Jan 20 '22 at 00:44
hi @merv, good point. so all the sequences are in the 5 -> 3 direction like the fasta file. I did consider some of these tools, but I'm actually working with multiple tsv's. I think given the size and number of files I'm trying to stick with custom scripts. It also not quite gRNA design, although I can see why the 23bp would suggest that! — KLM117, Jan 20 '22 at 11:13
The DNA in cells is double-stranded, so the reverse complement would be present experimentally. — merv, Jan 30 '22 at 14:58

choroba · Answer 1 · 2022-01-20T15:11:44.333

3

I'd reach for a programming language like Perl.

#!/usr/bin/perl
use warnings;
use strict;

my ($fasta_file, $bed_file) = @ARGV;
open my $fasta, '<', $fasta_file or die "$fasta_file: $!";
open my $bed,   '<', $bed_file   or die "$bed_file: $!";


my $seq;
while (<$fasta>) {
    $seq .= "\n", next if /^>/;
    chomp;
    $seq .= $_;
}

while (<$bed>) {
    chomp;
    my $short_seq = (split /\t/, $_)[-1];
    my $count = () = $seq =~ /\Q$short_seq\E/g;
    print "$_\t$count\n";
}

To count overlapping sequences, change the regex to a lookahead.

my $count = () = $seq =~ /(?=\Q$short_seq\E)/g;

edited Jan 20 '22 at 15:11

answered Jan 19 '22 at 15:34

choroba

231,213
25
204
289

hi @choroba, would I have to prespecify the filepaths to $fasta_file and $bed_file prior to ```my ($fasta_file, $bed_file) = @ARGV;```? Admiteddly I've never used perl! – KLM117 Jan 19 '22 at 17:31
1

Just specify them as arguments to the script. `perl count.pl hg19.fasta tsv.bed > new-tsv.bed` – choroba Jan 19 '22 at 17:38
thank you very much! I ran it as you suggested, although it is running and adding a new column, it's appears to be all zeros? I wonder if this might be something on my end? – KLM117 Jan 20 '22 at 10:33
When running it for the sample you gave (replacing spaces with tabs in the bed file), I'm getting 1 and 2. Maybe some control characters in the input? MSWin line ends? – choroba Jan 20 '22 at 12:22
got it running and again does exactly what I want it to do! I think the issue is just my files sizes (~3gb fasta_file and ~30mb tsv_file) made the output file appear as 0kb because text was being added slowly – KLM117 Jan 20 '22 at 15:01
If you have many sequences in the fasta file, it might work incorrectly. You might need to skip all the header lines (updated). – choroba Jan 20 '22 at 15:11
Also, specifying the proportions of the input and what you want to optimize (memory consumption or speed) might change the solution. – choroba Jan 20 '22 at 15:14

oliv · Answer 2 · 2022-01-20T08:49:55.397

3

If the second file is significantly bigger than the first one, I would try this awk script:

awk 'v==1 {a[$7];next}          # Get the pattern from first file into the array a
     v==2 {                     # For each line of the second file
       for(i in a){             # Loop though all patterns
         a[i]=split($0,b,i)-1   # Get the number of pattern match in the line
       }
     }
     v==3 {print $0,a[$7]}      # Re-read first file to add the number of pattern matches
 ' v=1 file1 v=2 file2 v=3 file1

edited Jan 20 '22 at 08:49

answered Jan 19 '22 at 15:54

oliv

12,690
25
45

Hi oliv, thanks for the reply, I'm just wondering, are "a" and "b" representative of file 1 and file 2? I'm still fairly new to awk so I hope you'll forgive me if its a dumb question! – KLM117 Jan 19 '22 at 16:22
@KLM117 no, both `a` and `b` are arrays. `a` contains the patterns. `b` is a dummy array needed by `split` function – oliv Jan 19 '22 at 16:25
thanks! I've tried to get it to run but no luck it seems, should it be appending the number of pattern matches to file1? – KLM117 Jan 19 '22 at 16:49
1

@KLM117 this answer will not append anything to `file1`; you'll need to capture the output to a new file (eg, `tempfile`), decide on what to do with `file1` (save a backup copy?) and then `mv tempfile file1`; also fwiw, this code does generate the desired results (in my console; using `GNU awk v 5.1.1`) so if you're not seeing any output (or the wrong output) you may want to update the question with the output from `awk --version` – markp-fuso Jan 19 '22 at 16:58
@markp-fuso, thank you! where would I insert the output command? e.g. ```v=1 file1 v=2 file2 v=3 file1 > tempfile```. Thanks for the note on GNU awk, I'm running GNU Awk v 4.0.2 so maybe that could be causing an issue. – KLM117 Jan 19 '22 at 17:10
@KLM117 `awk 4.0.2` should be fine - I wanted to rule out possible issues w/ `mawk`, `nawk`, `awk 3` (common w/ macos?), etc; yes, you have the right idea re: capturing output to ne file; another issue that comes to mind that may be causing problems ... do either of your files have Windows/DOS line endings (`\r\n`)? to check run `head -1 |od -c` and look for the string `\r` (if you see this then you have Windows/DOS line endings and these are likely the reson you see no output) – markp-fuso Jan 19 '22 at 17:55
a couple references re: dealing with the `\r`: [this](https://stackoverflow.com/q/11680815/7366100) and [this](https://riptutorial.com/awk/example/28963/modifying-rows-on-the-fly--e-g--to-fix-windows-line-endings-) – markp-fuso Jan 19 '22 at 18:03
@markp-fuso, ok that's good to hear! I haven't got access to the files until tomorrow morning, but I've run dos2unix on the script itself, would running that on my files work to fix that? – KLM117 Jan 19 '22 at 18:04
1

@KLM117 yes, running `dos2unix` on the data files should remove the `\r` characters – markp-fuso Jan 19 '22 at 18:05
@markp-fuso, I ran dos2unix on all the files, and still nothing it seems. Just to clarify, am I specifiying the files within the script, i.e. ```file1=path/to/file``` and then ```v =1 $file1```, or am I specifying them when I run the code as with the perl script suggested here? I.e. ```bash this_script.bash file2 file1 > file3```. Apologies if this is a dumb question. – KLM117 Jan 20 '22 at 11:43
@KLM117 I have no idea what's in your 'script' so I can't comment on that aspect of your issue; in your comment you've mentioned `v =1` (space between `v` and `=1`), but there should be no space; forget your script for a moment and at the command line run the `awk` code ... does it generate an error? generate no output? generate the wrong output? once you get the `awk` code working at the command line *then* look at placing it in your script and working through any details (in the script) – markp-fuso Jan 20 '22 at 15:02

Ted Lyngmo · Answer 3 · 2022-01-21T09:59:48.057

1

Since grep -c seems to give you the correct count (matching lines, not counting multiple occurances on the same line) you could read the 7 fields from the TSV (BED) file and just print them again with the grep output added to the end:

#!/bin/bash

# read the fields into the array `v`:
while read -ra v
do
    # print the 7 first elements in the array + the output from `grep -c`:
    echo "${v[@]:0:7}" "$(grep -Fc "${v[6]}" hg19.fasta)"
done < tsv.bed > outfile

outfile will now contain

1 779692 779715 Sample_3 + 1 ATGGTGCTTTGTTATGGCAGCTC 1
1 783462 783485 Sample_4 - 1 ATGAATAAGTCAAGTAAATGGAC 2

Benchmarks

This table is a comparison of the three different solutions presented as answers here, with timings to finish different amount of tsv/bed records with the full hg19.fa file (excluding the records containing only N:s). hg19.fa contains 57'946'726 such records. As a baseline I've used two versions of a C++ program (called hgsearch/hgsearchmm). hgsearch reads the whole hg19.fa file into memory and then searches it in parallel. hgsearchmm uses a memory mapped file instead and then searches that (also in parallel).

search \ beds	1	2	100	1000	10000
awk	1m0.606s	17m19.899s	-	-	-
perl	13.263s	15.618s	4m48.751s	48m27.267s	-
bash/grep	2.088s	3.670s	3m27.378s	34m41.129s	-
hgsearch	8.776s	9.425s	30.619s	3m56.508s	38m43.984s
hgsearchmm	1.942s	2.146s	21.715s	3m28.265s	34m56.783s

The tests were run on an Intel Core i9 12 cores/24 HT:s in WSL/Ubuntu 20.04 (SSD disk). The sources for the scripts and baseline programs used can be found here

edited Jan 21 '22 at 09:59

answered Jan 19 '22 at 15:32

Ted Lyngmo

93,841
5
60
108

Hi Ted, just gave this a go and works! Seen you've just edited it so I'll give that a run, it does exactly what I want, but I think the size of the files I'm working with means it might be running a touch slow! – KLM117 Jan 19 '22 at 16:05
@KLM117 Great! Regarding speed: Yeah, it'll be running `grep` (opening and closing the fasta file) one time per line in the tsv.bed file so speed could become an issue. You could perhaps use `readarray` to read the whole fasta file into memory one time and then find your entries in that array. Might be faster or it may not be. Not sure. – Ted Lyngmo Jan 19 '22 at 16:08
@KLM117 Just for fun, I rewrote this in C++ using a memory mapped file on "hg19.fasta" and like to compare it with the above bash script. How many lines are there in tsv.bed and hg19.fasta? If they don't contain anything copyrighted I wouldn't mind downloading the actual files. – Ted Lyngmo Jan 20 '22 at 08:11
that awesome! So the hg19.fasta is the human reference genome 19, this is relatively massive file with 61,998,735 lines (~3gb). This is publically available and I tend to ftp it from the UCSC https://hgdownload.soe.ucsc.edu/downloads.html using this code: ```ftp hgdownload.soe.ucsc.edu #use "Anonymous" as name, and email as passwd. cd goldenPath/hg19/bigZips mget hg19.fa.gz gzip hg19.fa``` the tsv file tends to be around 600,000 lines (i'm actually running it on around 12 tsv files) I generated them myself so happy to send you one to try out with! – KLM117 Jan 20 '22 at 09:43
@KLM117 I got `hg19.fa`. Thanks! 4796636 lines contain all `N`:s. Perhaps one could treat them specially? Would it be ok to just ignore those lines or will you do matching against those too? And yes please if you could put one or tsv files up somewhere that'd be grand! One strange thing. Your first pattern, `ATGGTGCTTTGTTATGGCAGCTC` isn't found in `hg19.fa` anywhere ... – Ted Lyngmo Jan 20 '22 at 10:11
Hi Ted, yes, I think that removing the N's could be removed, they simply stand for any nucleotide (A,C,G, or T) and so aren't needed. I've thrown an example tsv in a github repository https://github.com/kai-lawsonmcdowall/tsv_file, it's ~420,000 lines/25mb. Let me know how it goes! – KLM117 Jan 20 '22 at 10:31
@KLM117 Nice! What's bothering me is that many of the entries in the tsv file are not found in the geonome file and I've found no entries with more than 1 match in ~60 first entries. Is that how it should be? – Ted Lyngmo Jan 20 '22 at 10:50
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/241242/discussion-between-klm117-and-ted-lyngmo). – KLM117 Jan 20 '22 at 10:59

Count the number of times a substring appears in a file and place it in a new column

3 Answers3

Benchmarks