2

I have a file from which I need to find 10 characters before and after every substring instance.

For example, from:

(1M characters)...ldkS9jfasdfalkasjFalskdfjsDljBASHcslakfjsalZkf4djfsa3Jkjl...(1M characters)

I would like the output:

lskdfjsDljBASHcslakfjsal

Of course, in the file there are many instances of the string, and I would like to return all of them in the same manner of having the previous and subsequent 10 characters.

Right now I am using grep as follows:

grep -o -P '.{0,10}BASH.{0,10}' input.txt > output.txt

While this works, it seems to be very slow. Is there any way to speed up the process? Thanks in advance.

jww
  • 97,681
  • 90
  • 411
  • 885
  • 2
    Is there a reason you're searching for 0 to 10 characters before and after the string, rather than just exactly 10? From how you described your issue, it seems like there will always be at least 10 on either side (and if there's an instance of the string near the very beginning or end of the input, that can be caught with an additional, trivially quick step). `.{10}BASH.{10}` should be much, much faster. – CAustin Dec 17 '19 at 00:48
  • Thank you CAustin, I will try this. To be honest I thought {0,10} meant every character from position 0 to 10. I only want specifically 10 characters on either side. – Fighting Time Dec 17 '19 at 01:15
  • @FightingTime nope, it means BASH preceded with atleast 0 char and atmost 10 char. So for this `BASHsad` string where BASH exists at the start , your regex should find a match where Austin's won't because `BASH` preceded by 0 char – Avinash Raj Dec 17 '19 at 02:18
  • Also see questions like [Fastest possible grep](https://stackoverflow.com/q/9066609/608639), [Fastest way to find lines of a file from another larger file in Bash](https://stackoverflow.com/q/42239179/608639) and [Fast string search in a very large file](https://stackoverflow.com/q/37693682/608639). – jww Dec 17 '19 at 02:22
  • try `awk -v RS='.{10}BASH.{10}' '{print RT}'`... also, if input is all ASCII, using `LC_ALL=C awk -v RS='.{10}BASH.{10}' '{print RT}'` will give better results... same with grep... `LC_ALL=C grep -oE '.{10}BASH.{10}'` – Sundeep Dec 17 '19 at 03:02
  • Thank you all for the very helpful answers. I found them all very helpful! – Fighting Time Dec 17 '19 at 05:48

2 Answers2

3

Would you try the following:

grep -F 'BASH' input.txt | grep -o -E '.{10}BASH.{10}'

Cascading multiple greps may in general look like an antipattern due to poorly-designed search patterns. In this case it works as follows: the 1st grep efficiently narrows down the lines which contain the target word with the -F (fixed) option; then the 2nd grep works to extract the substrings surrounding the word.

I have generated a text file with random characters of 100,000 columns and 10,000 lines (1Gbytes). Here is the benchmark result with an old Celeron CPU:

time grep -o -P '.{0,10}BASH.{0,10}' input.txt
=> 2m48s

time grep -F 'BASH' input.txt | grep -o -E '.{10}BASH.{10}'
=> 0m20s

BTW surprisingly I found nine BASH strings in the randomly generated ascii file.

[EDIT]

If you need to keep overlapping matches, please try:

grep -F 'BASH' file | perl -ne 'while (/(?=(.{10}BASH.{10}))/g) {print $1, "\n"}'

It requires no additional execution time compared to the answer above.

tshiono
  • 21,248
  • 2
  • 14
  • 22
  • 1
    Note that the -o option only reports non-overlapping matches, so occurrences of BASH that are too close together will not be reported separately using this technique. – peak Dec 17 '19 at 04:51
  • @peak thanks for the comment. You're right, but the OP's code also specifies `-o` option and I don't think the result differs from the OP's one in this sense. If we need to separate overlapping substrings, it will be achieved with small amount of additional time. It depends on the OP's requirements anyway. – tshiono Dec 17 '19 at 05:03
  • Understood, but the discrepancy between the main problem statement ("I need to find 10 characters before and after every substring instance") and the various proposed solutions so far might not be obvious to everyone. Of course, it might not be important either :-) – peak Dec 17 '19 at 05:48
  • @peak I see. Understood. I will study further in the background.:) – tshiono Dec 17 '19 at 05:56
  • Thanks for your responses. As peak mentioned, I need to keep overlapping matches, but at this point I prefer speed over the loss of some of those matches. In the future, I may use this thread to address the issue. https://unix.stackexchange.com/questions/276159/grep-that-works-with-overlapping-patterns – Fighting Time Dec 17 '19 at 05:59
  • I've updated my answer with a version which keeps overlapping matches. I do appreciate the comment by @peak . – tshiono Dec 17 '19 at 06:21
  • nice answer.. for time comparison, not sure why you are using `grep -o -P '.{0,10}BASH.{0,10}'` instead of `grep -oE '.{10}BASH.{10}'` ... also, if you have ripgrep, could you check how `rg -oP '.{10}BASH(?=(.{10}))' -r '$0$1'` performs? – Sundeep Dec 17 '19 at 12:59
  • I can safely say that whereas my previous code without splitting the file is still running since yesterday for a single operation, the above perl code took around 5 minutes to complete. The key was to parse the lines into segments first (I did around 60 characters per line). Thanks! – Fighting Time Dec 17 '19 at 22:00
  • 1
    @Sundeep Thank you for the comment. As for the time comparison, I didn't put a deep meaning by changing `{0,10}` to `{10}`. I should have changed the conditions one by one. Indeed there is no difference in time between `{0,10}` and `{10}`. It may be better to use `{0,10}` considering the case the target word is located in the beginning or ending of the line. I've installed `rg` and measured the performance with your suggested code. It took 40sec in the same condition. Amazingly faster than the original `grep` but my proposal seems to be still efficient. BR. – tshiono Dec 18 '19 at 00:54
  • @FightingTime Thank you for the feedback. My pleasure to know it's working. BR. – tshiono Dec 18 '19 at 00:56
  • 1
    @Sundeep I've also tested: `rg -F 'BASH' file | rg -oP '.{10}BASH(?=(.{10}))' -r '$0$1' which turns out to take only 0.6 sec! – tshiono Dec 18 '19 at 01:30
1

I have a file from which I need to find 10 characters before and after every substring instance.

Interpreted literally, this means that naive uses of grep -o cannot in general meet the requirements, because this option only reports non-overlapping sequences.

To illustrate, suppose for simplicity that the substring of interest is "X", and that the window on either side must have length 3.

Then given the string "aaaXaaXaaa" the output (according to the requirement statement) must be the two lines:

aaaXaaX
XaaXaaa

Here's a script that illustrates a solution using :

#!/bin/bash

for x in X aXa aaaXaaa aaaXaaXaaa aaaXXaaa
do
  echo $x ::
  jq -Rrs --arg ss X --argjson n 3 '
    . as $in
    | indices($ss)[] as $i
    | select($i-$n >=0 and $i+$n <= length)
    | $in[$i-$n:$i+$n+1]' <<< "$x"
  echo
done

Note that the -s option here in effect causes control characters, such as newline, to be regarded as single characters.

Output

X ::

aXa ::

aaaXaaa ::
aaaXaaa

aaaXaaXaaa ::
aaaXaaX
XaaXaaa

aaaXXaaa ::
aaaXXaa
aaXXaaa
peak
  • 105,803
  • 17
  • 152
  • 177