1

I have two files one contains a list of individual entries (fileA) and another file containing a list of ranges (fileB).

I want to find out which entries in fileA are found in any ranges in fileB.

Sample entries in both files are

fileA

00100500000000
00100600000000
00100700000000
00100800000000
00100900000000
00101000000000 
00101300000000
00101500000000
00101600000000
00101700000000
00101710000000
00101800000000
35014080000000
35014088000000
35067373000000

fileB

00100200000000,00100200999999
00100300000000,00100300999999
00100100000000,00100100999999
00100400000000,00100400999999
00100500000000,00100500999999
00100600000000,00100600999999
00100700000000,00100700999999
00100800000000,00100800999999
00100900000000,00100900999999
00101000000000,00101000999999
00101300000000,00101300999999
00101500000000,00101500999999
00101600000000,00101600999999
35048702000000,35048702999999
35048802000000,35048802999999
35077160000000,35077160999999
35077820000000,35077820999999
35085600000000,35085600999999

I used the below script but it takes about 6days to complete 140k entries in fileA and 50k of fileB. Is there a way to make it much faster?

list=`cat fileB`
for mobno in $list
do
  LowVal="$(echo $mobno | cut -d, -f1)"
  HighVal="$(echo $mobno | cut -d, -f2)"

 while read ThisLine; 
do [ ${ThisLine} -ge ${LowVal} ] && [ ${ThisLine} -le ${HighVal} ] && echo "${ThisLine}";done < fileA; 
done;
ghoti
  • 45,319
  • 8
  • 65
  • 104
Raziel
  • 71
  • 4
  • 1
    Do the files really have blank lines between every line? – Hakan Baba Aug 16 '17 at 14:38
  • This looks like an interval search problem. I suggest to take a look at `interval tree` 's in the literature. The asymptotic complexity of your algorithm will improve with that data structure. It looks like that is what you need with data sets in 100K's – Hakan Baba Aug 16 '17 at 14:42
  • Another option is to sort the intervals twice. Once with respect to the start value and then with respect to the end value. For each entry in fileA you need to do two modified binary searches in these two sorted interval lists. – Hakan Baba Aug 16 '17 at 14:50
  • Are fileB ranges crescent like in your example? – Marcos Oliveira Aug 16 '17 at 14:54

5 Answers5

1

You would have to test it for performance but the following awk script solution is an option:

NR == 1 && FNR == 1 { strt=1
        }
FNR == 1 && NR != 1 {
        strt=0
        }
strt==0 {
        pos=$0
        for (i in ranges) {
                split(i,arry,",")
                if ( pos >= arry[1] && pos <= arry[2]) {
                        print i" - "$0
                        }
                }
        }
strt==1 {ranges[$0]=""
        }

Run with:

 awk -f awkfile file B file A

Output:

00100500000000,00100500999999 - 00100500000000
00100600000000,00100600999999 - 00100600000000
00100700000000,00100700999999 - 00100700000000
00100800000000,00100800999999 - 00100800000000
00100900000000,00100900999999 - 00100900000000
00101000000000,00101000999999 - 00101000000000
00101300000000,00101300999999 - 00101300000000
00101500000000,00101500999999 - 00101500000000
00101600000000,00101600999999 - 00101600000000
00101700000000,00101700999999 - 00101700000000
00101710000000,00101710999999 - 00101710000000
00101800000000,00101800999999 - 00101800000000

We are essentially reading both files in using the variable strt to determine the end of one file and the start of the other. We read the ranges into an array (ranges) and then remove the leading zeros from both the ranges and each value in fileA to do the comparison.

Raman Sailopal
  • 12,320
  • 2
  • 11
  • 18
  • Unfortunately it is giving out undesired results. See an extract of the output. It seems you are checking only the last 12 digits 86100800000000,86100803999999 - 00100800000000 00100800000000,00100800999999 - 00100800000000 01100800000000,01100800999999 - 00100800000000 – Raziel Aug 17 '17 at 07:05
  • I've changed the solution to process the whole string and not just the last 12 digits. – Raman Sailopal Aug 17 '17 at 08:44
  • Thanks very much Raman. My problem is resolved. It took 4hrs 35min to process all the entries. – Raziel Aug 17 '17 at 14:23
0

Two approaches:

-- with grep:

grep -of fileA fileB

-- with comm + sort + sed commands:

comm -12 <(sort fileA) <(sed 's/,/\n/' fileB | sort)

The output:

00100500000000
00100600000000
00100700000000
00100800000000
00100900000000
00101300000000
00101500000000
00101600000000
00101700000000
00101710000000
00101800000000
RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105
  • Mmm, no, he's looking to find values from fileA that are within *ranges* listed in fileB, he's not looking for exact matches. The fact that his sample data is lame should not confuse this issue. :-) – ghoti Aug 16 '17 at 20:44
0

If the ranges on fileB are ascending like in your example, you just need the first and last values as LowVal and HighVal. Try this:

LowVal=$(head -n1 fileB | cut -d, -f1)
HighVal=$(tail -n1 fileB | cut -d, -f2)

awk -vHighVal=$HighVal -vLowVal=$LowVal '$0 >= LowVal && $0 <= HighVal' fileA
Marcos Oliveira
  • 449
  • 4
  • 4
0

Cut seems to be slow, that's why it is taking so much time. Try this code

list=`cat fileB`
for mobno in $list
do
  IFS=', ' read -r -a array <<< $mobno
  LowVal=${array[0]}
  HighVal=${array[1]}

 while read ThisLine; 
do [ ${ThisLine} -ge ${LowVal} ] && [ ${ThisLine} -le ${HighVal} ] && echo "${ThisLine}";done < fileA; 
done;
0

Here's my take on this. awk is the tool to use. Here it is as a one-liner:

$ awk -F, 'NR==FNR{range[$1]=$2;next}{for(low in range){if($1>=low&&$1<=range[low]){print $1}}}' fileB fileA

Split out for easier commenting:

$ awk '

    BEGIN {
      RS=","         # Record separator, "-F," in the one-liner
    }

    NR==FNR {        # Run this bit on just the first file specified, your ranges
      range[$1]=$2   # Store the range in an array
      next
    }

    {                           # For each value in your data file,
      for (low in range) {      # step through the ranges
        if ($1 >= low && $1 <= range[low]) {  # and test.
          print $1              # If they pass, print the value.
        }
      }
    }

  ' fileB fileA

Note that this loads your full set of ranges into memory as an array, so it may have problems if fileB is millions of lines long. Try and see.

Note that this solution does not depend on either file being sorted or in any particular order, but it assumes that you have no ranges with common lows. That is, you will not have 5 ... 8 along with 5 ... 10. Your sample data doesn't have any of these, but it's only a sample.

I'd love to know by how much this solution beats your 6 day version. :-)


UPDATE #1

Here's the same logic in bash, for the fun of it. Again, I'd love to see speed comparisons on your dataset!

$ declare -A range=()
$ while IFS=, read -r a b; do range["$a"]="$b"; done < fileB
$ while read -r val; do for low in "${!range[@]}"; do [[ 10#$val -ge 10#$low && 10#$val -le 10#${range[$low]} ]] && echo "$val"; done; done < fileA

Or, broken out script-style (with comments)

declare -A range=()

while IFS=, read -r a b; do
  range["$a"]="$b"                      # Store the ranges in an associative array
done < fileB                            # (requires bash 4+)

while read -r val; do                   # Read values...
  for low in "${!range[@]}"; do         # Step through our range, and
    [[ 10#$val -ge 10#$low && 10#$val -le 10#${range[$low]} ]] &&
    echo "$val"                         # test and print.
  done
done < fileA

The one obtuse thing here is the 10# on the beginning of the values in the test. These are here because without them, bash interprets integers with leading zeros as octal numbers, which fails with your dataset because it includes 8's and 9's. :-)


UPDATE #2

Purely for experimentation purposes, here's a variation that may work in bash version 3.

This still uses an array, but a traditional one rather than associative. As such, the indices are numeric, so the numeric comparisons for $low no longer require base padding (10#).

declare -a range=()

while IFS=, read -r a b; do
  range[10#"$a"]="$b"                      # Store the ranges in an associative array
done < fileB                            # (requires bash 4+)

while read -r val; do                   # Read values...
  for low in "${!range[@]}"; do         # Step through our range, and
    [[ 10#$val -ge 10#$low && 10#$val -le 10#${range[$low]} ]] &&
    echo "$val"                         # test and print.
  done
done < fileA
ghoti
  • 45,319
  • 8
  • 65
  • 104
  • I get below errors line 1: declare: -A: invalid option declare: usage: declare [-afFirtx] [-p] [name[=value] ...] line 4: 00100800000000: value too great for base (error token is "00100800000000") – Raziel Aug 17 '17 at 06:34
  • As I mentioned in comments, associative arrays require bash 4. What version of bash are you using? (`echo $BASH_VERSION` to check.) Note that macOS ships with bash version 3. If you're using macOS, you can install a newer bash from [macports](http://macports.org/) or [homebrew](http://brew.sh/). What OS and version are you running? (Your question is tagged **linux**.) – ghoti Aug 17 '17 at 06:37
  • Ah, venerable SLES 11. While you could install bash from source, or from an unauthorized rpm, it would be less work for you to use the awk solution, and it'll probably perform faster anyway. Perhaps it's time to think about upgrading, so you can get some more modern tools. Bash 4 was released in 2009! – ghoti Aug 17 '17 at 06:51
  • @Raziel, I've added a bash 3 option that appears to work for me in macOS (bash 3.2.57). I'd love to know if it works for you, and how it compares to the awk option with your dataset. – ghoti Aug 17 '17 at 07:02