Extract lines based on a column matching one of multiple values

Question

I have some files containing the following data:

 160-68 160 68 B-A 0011 3.80247
 160-68 160 68 B-A 0022 3.73454
 160-69 160 69 B-A 0088 2.76641
 160-69 160 69 B-A 0022 3.54446
 160-69 160 69 B-A 0088 4.24609
 160-69 160 69 B-A 0011 3.97644
 160-69 160 69 B-A 0021 1.82292

I need to extract lines having any of values (that can be negative: ex -12222) in an array in the 5th column.

Output with [0088, 0021]:

160-69 160 69 B-A 0088 2.76641
160-69 160 69 B-A 0088 4.24609
160-69 160 69 B-A 0021 1.82292

I'm currently doing this with Ruby, but is there a way to do it faster with Bash?

Thanks.

Bash is not really optimized for speed; unless you're doing it *poorly* with Ruby, it's unlikely that you'll get a speed improvement by switching to Bash. — ruakh, Jan 30 '16 at 22:41
With the benefit of hindsight: as is often the case, when people tag a question (just) [tag:bash] or ask for a "Bash solution", they don't actually mean a _pure_ Bash solution; rather, they mean: a solution based on _utilities that can be called from Bash_. — mklement0, Jan 30 '16 at 23:30

score 4 · Accepted Answer · answered Jan 30 '16 at 22:45

4

bash is unlikely to be faster than ruby: bash is generally pretty slow. I'd pick awk or perl

awk -v values="0088 0021" '
    BEGIN {
        n = split(values, a)
        for (i=1; i<=n; i++) b[a[i]]=1
    }
    $5 in b
' file

perl -ane 'BEGIN {%v = ("0088"=>1, "0021"=>1)} print if $v{$F[4]}' file

answered Jan 30 '16 at 22:45

glenn jackman

238,783
38
220
352

I'm interested - Is there a different between your `awk` solution and this `awk '$5 == "0021" || $5 == "0088"' file` ? Thanks :-) – Rany Albeg Wein Jan 31 '16 at 03:44
@RanyAlbegWein: For the sample input at hand there is no difference, but the Awk command in this answer offers a _generic_ solution: the desired values are passed as _arguments_ (as Awk variables via `-v`) rather than being hard-coded into the script, and the script supports a _variable_ number of arguments. – mklement0 Jan 31 '16 at 04:03
No. It's just a different method to get the keys into the awk script. Depends on how many there are and whe6ther the OP wants to hardcode them into the awk script or elsewhere – glenn jackman Jan 31 '16 at 04:04
`for (i=1; i<=n; i++)` can just be `for (i in a)` so then you also don't need the `n=` in `n = split(values, a)` and finally `b[a[i]]=1` doesn't need the assignment, just `b[a[i]]` is all you need, so the BEGIN section can just be `split(values, a); for (i in a) b[a[i]]`. – Ed Morton Jan 31 '16 at 13:57

peak · Answer 2 · 2016-01-31T05:52:40.600

1

Here's an egrep-based solution.

Suppose the array of special values is given as a simple CSV string, e.g.

A="0088,0021"

Then the following invocation of egrep will select the desired lines:

egrep "( [^ ]+){3} ($(tr , '|' <<< "$A")) "

In practice, it would probably be better to modify the regex above to make it less brittle with respect to the input format.

If the elements of the array ($A) contain characters that are special to egrep (such as square brackets, parentheses, etc.), then some care will be required to escape them. This can be done programatically, e.g.

A=$(sed 's/[]\.|$(){}?+*^]/\\&/g' <<< "$A")

See also the comment below.

edited Jan 31 '16 at 05:52

answered Jan 31 '16 at 03:47

peak

105,803
17
152
177

With all due humility I suggest [this answer](http://stackoverflow.com/a/29613573/45375) as the most robust - and simpler - way to escape a literal for use in a regex; in short: `sed 's/[^^]/[&]/g; s/\^/\\^/g' <<<"$varContainingLiteralToEscape"`. In addition, you'll still need double-quoting around the command substitution, as suggested earlier. – mklement0 Jan 31 '16 at 04:55
1

mklement0 - Thanks for your suggestions, which I've incorporated into the revised answer as you'll see. Please note that (a) my post had originally assumed that $A would be as the OP seemed to describe (i.e., positive or negative decimal integers); and (b) I mentioned the link to the SO page because it included at least one response illustrating how sed can be used to escape special characters in the REGEX component of `sed 's/REGEX/TO/'`. – peak Jan 31 '16 at 06:05

score -1 · Answer 3 · answered Jan 31 '16 at 06:14

Another solution

     #!/bin/bash
     for i in "$@"
         do 
         while read column
         do
            arr=(${column})
            if [ ${arr[4]} = $i ]
            then
                echo $column
            fi
         done < input.txt
    done

where input.txt is data file and you call this script as ./scriptname 0088 0021

Extract lines based on a column matching one of multiple values

3 Answers3