0

I need to check for multiple phrases in txt files, and if file contains them in particular line, remove the line from txt fie.

Using inverse grep with file containing phrases that needs to be removed works as a charm.

THE PROBLEM is that I need to search in part of the each line, rather than the whole line.

I need to check only part of the line until 10th comma character. If grep finds phrase after that I want to keep the line, if grep matches before that point I want to remove the line.

I can't figure out how I could use regex alongside phrases file. Any suggestions welcome.

#!/bin/bash 

shopt -s globstar

for f in /uploads/txt/original/**/*.txt ; do

  grep -i -v -w -f phrase.txt "$f" > tmp
  mv tmp $f

done  

echo "Finished!"

EDIT

   # Rule to set the flag if the line needs to be printed or not
{
    ok = 1
    # loop upto tenth column
    for (i = 1; i <= 10; i++){
        # match against each pattern
        for (p in PATS) {
            if ($i ~ p) {
                ok = 0
            }
        }
    }
}

Does this mean that every column is run agains PATS?

Would it be possible to merge 10 columns into one string and then check agains all patterns instead of checking 10 columns against all patterns?

Sam Axe
  • 437
  • 1
  • 8
  • 25
  • Can you avoid the loop with `grep -i -v -w -f phrase.txt <(cat /uploads/txt/original/**/*.txt)` ? – Walter A Nov 02 '16 at 21:23
  • @WalterA not sure how cutting line is phrase.txt would help in this instance, as I need to search part of the line in txt files, not in phrase.txt which contains phrases that I'm searching for in those txt files. – Sam Axe Nov 02 '16 at 21:54
  • Mixed up the files. Can you use your solution on a <(cat /uploads/txt/original/**/*.txt| cut -d"," -f1-9) and grep the result from <(cat /uploads/txt/original/**/*.txt)? This will fail for overlapping frases in phrase.txt, but might be something in your case. – Walter A Nov 02 '16 at 22:35
  • @JayRajput Could you please have a look into edit part of the question? I'm trying to improve speed and try to understand how it currently works – Sam Axe Nov 14 '16 at 17:13

1 Answers1

0

Input data /tmp/test

Col1, Col2, Col3, Col4, Col5, Col6, Col7, Col8, Col9, Col10, Col11, Col12
Val1, Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, Val11, Val12
FOO,  Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, Val11, Val12
FOO1,  Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, Val11, Val12
foo,  Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, Val11, Val12
Val1, BAR,  Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, Val11, Val12
Val1, Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, FOO,   Val12
Val1, Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, BAR,   Val12
Val1, Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, FOO,   Val11, Val12
Val1, Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, BAR,   Val11, Val12

Phrases /tmp/phrases

FOO
BAR

Awk Script with comments

#!/usr/bin/gawk -f

BEGIN {
    FS         = " *, *" # Field Separator regex to split words
    IGNORECASE = 1       # ignore case for regex match

    # read phrases file in an array
    # prepend '^' and append '$' to the phrase for exact match
    while (getline a < "/tmp/phrases") PATS["^"a"$"]
}

# Rule to set the flag if the line needs to be printed or not
{
    ok = 1
    # loop upto tenth column
    for (i = 1; i <= 10; i++){
        # match against each pattern
        for (p in PATS) {
            if ($i ~ p) {
                ok = 0
            }
        }
    }
}

# Rule to actual print if flag is set
ok {print}

# Debugging rule. Get rid for actual code.
END { for (p in PATS) print p }

# One liner
#  gawk 'BEGIN{FS=" *, *";IGNORECASE=1;while(getline a < "/tmp/phrases")PATS["^"a"$"]}{ok=1;for(i=1;i<=10;i++){for(p in PATS){if($i ~ p){ok=0}}}} ok {print}' /tmp/test

Output:

Col1, Col2, Col3, Col4, Col5, Col6, Col7, Col8, Col9, Col10, Col11, Col12
Val1, Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, Val11, Val12
FOO1,  Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, Val11, Val12
Val1, Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, FOO,   Val12
Val1, Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, BAR,   Val12

Credit goes to this answer https://stackoverflow.com/a/14471194/2032943

Community
  • 1
  • 1
Jay Rajput
  • 1,813
  • 17
  • 23
  • great and detail answer, although by 10th character, I meant 10th specific character which I don't know the distance in the string, so substr wouldn't work here. for e.g.: my, line, looks, like, this, with, a lot of, different text, but I want, to check only, up to 10th, coma character. Also is it not possible to wrap p with regex, as my text file contains only phrases but not regex patters itself. I want regex to be case insensitive as well as look for whole word/phrase. – Sam Axe Nov 03 '16 at 08:34
  • Thank you, works perfect, much slower than grep, but saved the dat. Where can I find all gawk options? – Sam Axe Nov 04 '16 at 19:20
  • The GNU manual for awk is pretty good and verbose. I will generally look at the Oreilly book or google for help with awk. If you are seeing performance issue, you probably want to switch to python and do it much faster. The awk is slow because of all those for loops. – Jay Rajput Nov 04 '16 at 20:16
  • I came accross another issue. Probably I didn't specify it correctly in my question. Works in this case: `FOO, Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, Val11, Val12` Doesn't work when keyword is part of the sentence. `FOO some other text, Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, Val11, Val12` Any tips? – Sam Axe Nov 09 '16 at 16:06