1

I must process a CSV file, and I need to repace some values from a column with a specific index number. Example

10/06-14:04:21.082467 ,1917,33219,239.255.255.250,1900,,,,
10/06-14:04:22.082715 ,1917,33219,239.255.255.250,1900,,,,
10/06-14:04:23.082940 ,1917,33219,239.255.255.250,1900,,,,
10/06-14:04:24.083256 ,1917,33219,239.255.255.250,1900,,,,
10/06-14:04:27.421793 ,1418,64878,192.168.0.13,161,0xC498BF38,0x0,,
10/06-14:04:27.522099 ,1418,,64879,192.168.0.13,161,0xC499BF39,0x0,,
10/06-14:04:33.445012 ,1421,64878,192.168.0.13,705,0xC498BF38,0x0,,
10/06-14:04:33.545144 ,1421,192.168.0.130xC498BF38,0x0,,

The column of interest is Column number 2. I extract it into a separate tmp file and I looks like this:

1917
1917
1917
1917
1418
1418
1421
1421

My expected output is:

5
5
5
5
2
2
0
0

Until now I use the script below:

csvtool col 2  $file>tmp && echo " TEMPORARY Column DONE "&& echo "Start the Magic"

## Perform an AWK comparation between the integers intervals and the terget number.
##In the end write everything out in the new column .
awk '{ 
if( ($1>=363 && $1<=499) || ($1>=4645 && $1<=4646)){ print 0}  
else if( ($1>=2174 && $1<=2193)) { print 1}  
else if( ($1==500) || ($1>=12308 && $1<=12356)){ print 2} 
else if( ($1>=103 && $1<=220) || ($1>=252 && $1<=299) || ($1>=1980 && $1<=1986) || ($1>=2921 && $1<=2922)){ print 3} 
else if( ($1>=221 && $1<=251) || ($1>=8085 && $1<=8091) || ($1==8350) || ($1>=12809 && $1<=12945) || ($1>=16834 && $1<=17033)){ print 4} 
else if( ($1>=300 && $1<=362) || ($1==522) || ($1>=2923 && $1<=2925) || ($1>=3441 && $1<=3442) || ($1==4644)|| ($1>=5677 && $1<=5695) || ($1>=8082 && $1<=8083)|| ($1>=8093 && $1<=8349) || ($1>=12946 && $1<=12947) || ($1>=12986 && $1<=13215) || ($1>=13309 && $1<=13311)){ print 5}
else if( ($1>=501 && $1<=504) || ($1>=566 && $1<=600) || ($1>=613 && $1<=637) ||  ($1>=2015 && $1<=2040) ||  ($1>=2103 && $1<=2126) || ($1>=2373 && $1<=2374) || ($1>=3828 && $1<=4125) || ($1>=4237 && $1<=4636) || ($1>=4647 && $1<=4889) || ($1>=4991 && $1<=5676) || ($1>=5696 && $1<=5705) || ($1>=6502 && $1<=6595) || ($1>=8429 && $1<=8460) || ($1>=8552 && $1<=8699) || ($1>=10487 && $1<=10977) || ($1>=11326 && $1<=11617) || ($1>=11688 && $1<=11815) || ($1>=11844 && $1<=11938) || ($1>=12490 && $1<=12597) || ($1>=12973 && $1<=12982) || ($1>=13367 && $1<=13414)){ print 6}
else if( ($1>=523 && $1<=548) || ($1>=555 && $1<=565) || ($1>=2005 && $1<=2014) || ($1>=2041 && $1<=2063) || ($1>=2091 && $1<=2102) ||  ($1==2394) || ($1>=2407 && $1<=2411) || ($1>=2926 && $1<=3008) || ($1>=3443 && $1<=3473) || ($1>=3486 && $1<=3813) || ($1>=4132 && $1<=4144) || ($1>=4637 && $1<=4643) || ($1>=4916 && $1<=4981) || ($1>=5711 && $1<=5741) || ($1>=6403 && $1<=6405) || ($1>=6415 && $1<=6466) || ($1>=6701 && $1<=7002) || ($1>=7035 && $1<=7048) || ($1>=8426 && $1<=8428) || ($1>=8496 && $1<=8541) || ($1>=8857 && $1<=9323) || ($1>=9429 && $1<=9618) || ($1>=9674 && $1<=9789) || ($1>=9802 && $1<=9811) || ($1>=9850 && $1<=10009) || ($1>=10131 && $1<=10136) || ($1>=10396 && $1<=10402) || ($1>=11000 && $1<=11175) || ($1==11618) || ($1>=12100 && $1<=12111) || ($1>=12212 && $1<=12219) || ($1==12489) || ($1>=12807 && $1<=12808) || ($1==12983) || ($1>=14616 && $1<=14627) || ($1>=15723 && $1<=15897)){ print 7}
else if( ($1==521) || ($1==554) || ($1>=601 && $1<=612) || ($1>=651 && $1<=708) || ($1>=1905 && $1<=1942) || ($1>=1949 && $1<=1979) || ($1>=1987 && $1<=1993) || ($1>=2259 && $1<=2278) || ($1>=2352 && $1<=2362) || ($1>=2395 && $1<=2406) || ($1>=2412 && $1<=2449) || ($1>=2673 && $1<=2919) || ($1>=3009 && $1<=3016) || ($1>=3814 && $1<=3827) || ($1>=4126 && $1<=4131) || ($1>=4982 && $1<=4990) || ($1>=5706 && $1<=5710) || ($1>=6012 && $1<=6181) || ($1>=6285 && $1<=6339) || ($1>=6409 && $1<=6411) || ($1>=6596 && $1<=6700) || ($1>=7191 && $1<=7424) || ($1==8081) || ($1>=8550 && $1<=8551) || ($1>=8700 && $1<=8716) || ($1>=9324 && $1<=9326) || ($1>=9619 && $1<=9624) || ($1==9729) || ($1>=10018 && $1<=10064) || ($1>=10115 && $1<=10126) || ($1>=10198 && $1<=10386) || ($1==10486) || ($1>=12112 && $1<=12115) || ($1>=12209 && $1<=12211)){ print 8}
else if( ($1>=489 && $1<=498) || ($1>=505 && $1<=520) || ($1>=549 && $1<=553) || ($1>=638 && $1<=650) || ($1>=709 && $1<=1904) || ($1>=1943 && $1<=1948) || ($1>=1994 && $1<=2004) || ($1>=2064 && $1<=2090) || ($1>=2127 && $1<=2173) || ($1>=2194 && $1<=2258) || ($1>=2279 && $1<=2351) || ($1>=2363 && $1<=2372) || ($1==2393) || ($1>=2450 && $1<=2672) || ($1>=3474 && $1<=3485) || ($1>=4145 && $1<=4236) || ($1>=4890 && $1<=4915) || ($1>=5742 && $1<=6011) || ($1>=7003 && $1<=7034) || ($1>=7049 && $1<=7295) || ($1>=7425 && $1<=8080) || ($1==8084) || ($1>=8352 && $1<=8425) || ($1>=8461 && $1<=8495) || ($1>=8542 && $1<=8549) || ($1>=8717 && $1<=8856) || ($1>=9327 && $1<=9428) || ($1>=9625 && $1<=9673) || ($1>=9790 && $1<=9791) || ($1>=9793 && $1<=9801) || ($1>=9812 && $1<=9849) || ($1>=10010 && $1<=10017) || ($1>=10065 && $1<=10114) || ($1>=10128 && $1<=10130) || ($1>=10137 && $1<=10197) || ($1>=10387 && $1<=10395) || ($1>=10403 && $1<=10485) || ($1>=10978 && $1<=10999) || ($1>=11176 && $1<=11325) || ($1>=11620 && $1<=11687) || ($1>=11816 && $1<=11843) || ($1>=11939 && $1<=12099) || ($1>=12116 && $1<=12208) || ($1>=12220 && $1<=12307) || ($1>=12357 && $1<=12488) || ($1>=12598 && $1<=12806) || ($1>=12948 && $1<=12972) || ($1>=13216 && $1<=13306) || ($1>=13312 && $1<=13366) || ($1>=13415 && $1<=14615) || ($1>=14628 && $1<=15722) || ($1>=15989 && $1<=16833) || ($1>=17402 && $1<=17431)){ print 9}
}' tmp

This work as a charm until an update hit me. If until now, my maximum integer value was 17431. Now that I got 50421 witch means that I must declare more intervals on my solution but after inserting all the new intervals in their places, the script stop working with the error:

AWK argument list too long 

Do you have any idea how to operate on such a large number of intervals?

I was considering the fallowing: Given the fact that I can create a map file like:

Target,Index
103,1
104,1
105,2
106,5
107,8
108,9
109,6
110,9
111,6
112,9
113,9
114,9
115,9
116,9
117,9
118,9
119,9
120,9

Where Target is my number to look for and the Index is the value that will replace my Target number. How can I import the first column into an Array1 and the second column into Array2 and for every line from tmp file check the target value position form Array1 and print the same potion of Array2

Example: If 1917 is on potion Array1[1853] then print Array1[1853] (the value from the same potion), given the fact that the 2 Arrays are equal in terms of the number of elements.

The main quiescence is: There is any way to fix my script to accept all the new 6000 intervals? If AWK can not support it this way, what do you recommend?

Weiss Willy
  • 109
  • 10
  • 4
    How do you define your intervals? There must be some logic behind it. – kvantour Oct 08 '19 at 11:59
  • 4
    I agree with @kvantour, those can't just be arbitrary numbers you're using so some kind of programatic approach is probably possible rather than hard-coding all those conditions but if not - save the script in a file and execute it as `awk -f scriptfile tmp` instead of passing the script as a string to awk. See https://stackoverflow.com/q/11475221/1745001. – Ed Morton Oct 08 '19 at 11:59
  • in addition to Ed Morton advise, please explain how the input csv is converted to the expected single column output. – Dudi Boy Oct 08 '19 at 12:01
  • Not sure whether this is related to the problem, but as a general remark, @kvantour is right: Even if the condition thresholds are non-systematic it would be more elegant and maintainable to keep them in an array (perhaps read from a file! what an improvement in maintainability!) and loop through them instead of hard-coding them in the script. – Peter - Reinstate Monica Oct 08 '19 at 12:17
  • I have no control on target-index allocation. I know there is some algorithm behind it but I do not control that. All I have is the map file and the task to monitor a file and allocate the index in real time. – Weiss Willy Oct 08 '19 at 13:11
  • 2
    Wait - you have a map file? Are you hard-coding values in your script that you're copying from some other file? – Ed Morton Oct 08 '19 at 13:15

4 Answers4

4

Even if you do have to hard-code your ranges, do it like this so you can validate that you don't have any holes in your ranges or any cases where the same value could be present in multiple ranges and can handle the input with just a hash lookup rather than having to loop through the ranges for every line of input:

$ cat tst.awk
BEGIN {
    # r[] = ranges[], v = value
    v = 0
    r[363,499]          = v
    r[4645,4646]        = v

    v = 1
    r[2174,2193]        = v
    r[500]              = v
    r[12308,12356]      = v

    populate(r,map)
}

$1 in map { print map[$1] }

function populate(ranges,map,    cnt,range,begend,beg,end,val,n,i) {
    for (range in ranges) {
        n = split(range,begend,SUBSEP)
        beg = begend[1]
        end = begend[n]
        val = ranges[range]
        for (i=beg; i<=end; i++) {
            map[i] = val
            cnt[i]++
        }
        min = ((min == "") || (min > beg) ? beg : min)
        max = ((max == "") || (max < end) ? end : max)
    }

    for (i=min; i<=max; i++) {
        if ( cnt[i] != 1 ) {
            if ( cnt[i] == 0 ) {
                printf "Hole: %d\n", i | "cat>&2"
            }
            else {
                printf "Overlap: %d\n", i | "cat>&2"
            }
        }
    }
}

.

$ echo 2180 | awk -f  tst.awk 2>/dev/null
1

$ echo 370 | awk -f  tst.awk 2>/dev/null
0

I'm redirecting stderr to /dev/null above since I haven't populated r[] fully and so there will be hundreds of holes being reported.

Obviously you can trivially populate r[] from a file rather than hard-coding the values in your script if you prefer but since that first part of the BEGIN section is the only place you'd specify the ranges and they're trivial to write so unlikely to mess up, keeping the data in the script itself isn't so bad in this case.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
3

First create a map file like: first column is the min value, the second column is max, the third column is the output. Let's name the file mapfile.txt:

 363  499 0
4645 4646 0
2174 2193 1
...

Then run awk such as (untested, typos expected):

awk 'FNR == NR { ++i; min[i]=$1; max[i]=$2; result[i]=$3; }
     FNR != NR { 
          for (j = 1; j <= i; ++j) { 
              if (min[j] <= $1 && $1 <= max[j]) {
                  print result[j];
                  break
              } 
          }
     }
' mapfile.txt second_column_values.txt

First we read the map file into memory and three arrays. Then we check the value for min/max and print the result if found. Then if the result is found - we break from the loop.

Alternatively, if you have work with huge files, you could do this:

  1. First create another mapfile such the first column is the value and the second is the result. It could be generated from the mapfile.txt above with something like while read a b c; do seq -f "%.0f $c" $a $b; done < mapfile.txt

363 0
364 0
365 0
...
  1. Remember to sort this file. Let's call it mapfile2.txt

  2. Then having numbers from column 2, add a line number on each line, sort it on the second column, then join it with the mapfile2.txt, re-sort on line numbers and remove line numbers.

nl -w1 second_column_values.txt | sort -s -k2 |
join -12 -21 - <(<mapfile2.txt sort -s -k1) |
sort -s -k1 | cut -f2-

Or do the same without numbering the lines if the order of lines does not matter. I think sort+join-ing the files could be faster then plain array lookup with range comparison in very extreme cases.

KamilCuk
  • 120,984
  • 8
  • 59
  • 111
  • Dudi Boy give me the solution that work for the moment given the fact that I already have intervals done but I will try to implement your solution as further development because my map-file will be updated regularly and is much easy to refer myself to a single file then to perform non-stop checks if the map-file was updated then generate new ranges, place them into the new script and runt the script.....your solution is much easier for the future. Thank you. – Weiss Willy Oct 08 '19 at 12:43
  • 1
    Ok. So we have to move the ++i in front of assigment? and iterate from `j = 1`? – KamilCuk Oct 08 '19 at 13:12
1

Here is a small change that will work. Create an awk script file and than run it.

script.awk

## Perform an AWK comparation between the integers intervals and the terget number.
##In the end write everything out in the new column .
{ 
if( ($1>=363 && $1<=499) || ($1>=4645 && $1<=4646)){ print 0}  
else if( ($1>=2174 && $1<=2193)) { print 1}  
else if( ($1==500) || ($1>=12308 && $1<=12356)){ print 2} 
else if( ($1>=103 && $1<=220) || ($1>=252 && $1<=299) || ($1>=1980 && $1<=1986) || ($1>=2921 && $1<=2922)){ print 3} 
else if( ($1>=221 && $1<=251) || ($1>=8085 && $1<=8091) || ($1==8350) || ($1>=12809 && $1<=12945) || ($1>=16834 && $1<=17033)){ print 4} 
else if( ($1>=300 && $1<=362) || ($1==522) || ($1>=2923 && $1<=2925) || ($1>=3441 && $1<=3442) || ($1==4644)|| ($1>=5677 && $1<=5695) || ($1>=8082 && $1<=8083)|| ($1>=8093 && $1<=8349) || ($1>=12946 && $1<=12947) || ($1>=12986 && $1<=13215) || ($1>=13309 && $1<=13311)){ print 5}
else if( ($1>=501 && $1<=504) || ($1>=566 && $1<=600) || ($1>=613 && $1<=637) ||  ($1>=2015 && $1<=2040) ||  ($1>=2103 && $1<=2126) || ($1>=2373 && $1<=2374) || ($1>=3828 && $1<=4125) || ($1>=4237 && $1<=4636) || ($1>=4647 && $1<=4889) || ($1>=4991 && $1<=5676) || ($1>=5696 && $1<=5705) || ($1>=6502 && $1<=6595) || ($1>=8429 && $1<=8460) || ($1>=8552 && $1<=8699) || ($1>=10487 && $1<=10977) || ($1>=11326 && $1<=11617) || ($1>=11688 && $1<=11815) || ($1>=11844 && $1<=11938) || ($1>=12490 && $1<=12597) || ($1>=12973 && $1<=12982) || ($1>=13367 && $1<=13414)){ print 6}
else if( ($1>=523 && $1<=548) || ($1>=555 && $1<=565) || ($1>=2005 && $1<=2014) || ($1>=2041 && $1<=2063) || ($1>=2091 && $1<=2102) ||  ($1==2394) || ($1>=2407 && $1<=2411) || ($1>=2926 && $1<=3008) || ($1>=3443 && $1<=3473) || ($1>=3486 && $1<=3813) || ($1>=4132 && $1<=4144) || ($1>=4637 && $1<=4643) || ($1>=4916 && $1<=4981) || ($1>=5711 && $1<=5741) || ($1>=6403 && $1<=6405) || ($1>=6415 && $1<=6466) || ($1>=6701 && $1<=7002) || ($1>=7035 && $1<=7048) || ($1>=8426 && $1<=8428) || ($1>=8496 && $1<=8541) || ($1>=8857 && $1<=9323) || ($1>=9429 && $1<=9618) || ($1>=9674 && $1<=9789) || ($1>=9802 && $1<=9811) || ($1>=9850 && $1<=10009) || ($1>=10131 && $1<=10136) || ($1>=10396 && $1<=10402) || ($1>=11000 && $1<=11175) || ($1==11618) || ($1>=12100 && $1<=12111) || ($1>=12212 && $1<=12219) || ($1==12489) || ($1>=12807 && $1<=12808) || ($1==12983) || ($1>=14616 && $1<=14627) || ($1>=15723 && $1<=15897)){ print 7}
else if( ($1==521) || ($1==554) || ($1>=601 && $1<=612) || ($1>=651 && $1<=708) || ($1>=1905 && $1<=1942) || ($1>=1949 && $1<=1979) || ($1>=1987 && $1<=1993) || ($1>=2259 && $1<=2278) || ($1>=2352 && $1<=2362) || ($1>=2395 && $1<=2406) || ($1>=2412 && $1<=2449) || ($1>=2673 && $1<=2919) || ($1>=3009 && $1<=3016) || ($1>=3814 && $1<=3827) || ($1>=4126 && $1<=4131) || ($1>=4982 && $1<=4990) || ($1>=5706 && $1<=5710) || ($1>=6012 && $1<=6181) || ($1>=6285 && $1<=6339) || ($1>=6409 && $1<=6411) || ($1>=6596 && $1<=6700) || ($1>=7191 && $1<=7424) || ($1==8081) || ($1>=8550 && $1<=8551) || ($1>=8700 && $1<=8716) || ($1>=9324 && $1<=9326) || ($1>=9619 && $1<=9624) || ($1==9729) || ($1>=10018 && $1<=10064) || ($1>=10115 && $1<=10126) || ($1>=10198 && $1<=10386) || ($1==10486) || ($1>=12112 && $1<=12115) || ($1>=12209 && $1<=12211)){ print 8}
else if( ($1>=489 && $1<=498) || ($1>=505 && $1<=520) || ($1>=549 && $1<=553) || ($1>=638 && $1<=650) || ($1>=709 && $1<=1904) || ($1>=1943 && $1<=1948) || ($1>=1994 && $1<=2004) || ($1>=2064 && $1<=2090) || ($1>=2127 && $1<=2173) || ($1>=2194 && $1<=2258) || ($1>=2279 && $1<=2351) || ($1>=2363 && $1<=2372) || ($1==2393) || ($1>=2450 && $1<=2672) || ($1>=3474 && $1<=3485) || ($1>=4145 && $1<=4236) || ($1>=4890 && $1<=4915) || ($1>=5742 && $1<=6011) || ($1>=7003 && $1<=7034) || ($1>=7049 && $1<=7295) || ($1>=7425 && $1<=8080) || ($1==8084) || ($1>=8352 && $1<=8425) || ($1>=8461 && $1<=8495) || ($1>=8542 && $1<=8549) || ($1>=8717 && $1<=8856) || ($1>=9327 && $1<=9428) || ($1>=9625 && $1<=9673) || ($1>=9790 && $1<=9791) || ($1>=9793 && $1<=9801) || ($1>=9812 && $1<=9849) || ($1>=10010 && $1<=10017) || ($1>=10065 && $1<=10114) || ($1>=10128 && $1<=10130) || ($1>=10137 && $1<=10197) || ($1>=10387 && $1<=10395) || ($1>=10403 && $1<=10485) || ($1>=10978 && $1<=10999) || ($1>=11176 && $1<=11325) || ($1>=11620 && $1<=11687) || ($1>=11816 && $1<=11843) || ($1>=11939 && $1<=12099) || ($1>=12116 && $1<=12208) || ($1>=12220 && $1<=12307) || ($1>=12357 && $1<=12488) || ($1>=12598 && $1<=12806) || ($1>=12948 && $1<=12972) || ($1>=13216 && $1<=13306) || ($1>=13312 && $1<=13366) || ($1>=13415 && $1<=14615) || ($1>=14628 && $1<=15722) || ($1>=15989 && $1<=16833) || ($1>=17402 && $1<=17431)){ print 9}
}

running the script:

awk -f script.awk input.csv
Dudi Boy
  • 4,551
  • 1
  • 15
  • 30
0

I suggest another solution for this problem. It will be easier to maintain and use lookup table to prevent repeated comparisons.

This is the initial set up to the first 6 filters.

The user is advised to add the filters 6,7,8,9 and increase loop counter from 5 to 9.

script.awk

# initialize filter range pairs
# each filter is a pair of lower/upper ranges
BEGIN {
     filterArr[0] = "363,499,4645,4646";
     filterArr[1] = "2174,2193";
     filterArr[2] = "500,500,12308,12356";
     filterArr[3] = "103,220,252,299,1980,1986,2921,2921";
     filterArr[4] = "221,251,8085,8091,8350,8350,12809,12945,16834,17033";
     filterArr[5] = "300,362,522,522,2923,2925,3441,3442,4644,4644,5677,5695,8082,8083,8093,8349,12946,12947,12986,13215,13309,13311";
}

# for each input line from input file, loop through all the filters (call scanFilter utility funtion).
{
     for (filter = 0; filter <= 1; filter++) scanFilter(filter);
}

# utility function to scan input field $1 to be in ranges
function scanFilter(filterIdx) {
     # if input field alread found, return its value
     if ($1 in foundAlready) {
          print foundAlready[$1];
          return;
     }
     # need to scan the input field through all filters
     rangesCount = split(filterArr[filterIdx], rangesArr, ",");
     # rangesCount holds the range pair values (lower, upper)
     # rangesArr hold rangePairs: rangesArr[1]=1st range lower,rangesArr[2]=1st range upper,rangesArr[3]=2nd range lower,rangesArr[4]=2nd range upper, etc

     for (rangeCounter = 1; rangeCounter <= rangesCount; rangeCounter += 2) {
          # test $1 for each upper range and lower range
          if ($1 >= rangesArr[rangeCounter] && $1 <= rangesArr[rangeCounter + 1]) {
               print filterIdx;
               foundAlready[$1] = filterIdx;
               return;
          }
     }
}
Dudi Boy
  • 4,551
  • 1
  • 15
  • 30