1

I have a big file in which the third element $3 in each line is a value representing time.

I want to split my file so that I will get several file each having the lines in an interval of time. The number of lines can change from a file to another.

Example

Input file:

$xx_ at 0.0 "$elt_(0) coordinates 656.02 1819.19 0.00"
$xx_ at 1.0 "$elt_(0) coordinates 654.99 1818.19 1.44"
$xx_ at 1.0 "$elt_(1) coordinates 365.41 1284.31 0.00"
$xx_ at 4.0 "$elt_(0) coordinates 652.74 1816.04 3.12"
$xx_ at 4.0 "$elt_(1) coordinates 365.7 1281.79 2.54"
$xx_ at 5.0 "$elt_(0) coordinates 649.08 1812.52 5.08"
$xx_ at 5.0 "$elt_(1) coordinates 366.2 1277.44 4.37"
$xx_ at 8.0 "$elt_(0) coordinates 643.59 1807.23 7.62"
$xx_ at 8.0 "$elt_(1) coordinates 366.88 1271.47 6.01"
$xx_ at 10.0 "$elt_(0) coordinates 636.46 1800.37 9.90"
$xx_ at 10.0 "$elt_(1) coordinates 367.78 1263.63 7.90"

If I want to split by an interval of 5 seconds, I will have 3 files:

file1:

$xx_ at 0.0 "$elt_(0) coordinates 656.02 1819.19 0.00"
$xx_ at 1.0 "$elt_(0) coordinates 654.99 1818.19 1.44"
$xx_ at 1.0 "$elt_(1) coordinates 365.41 1284.31 0.00"
$xx_ at 4.0 "$elt_(0) coordinates 652.74 1816.04 3.12"
$xx_ at 4.0 "$elt_(1) coordinates 365.7 1281.79 2.54"
$xx_ at 5.0 "$elt_(0) coordinates 649.08 1812.52 5.08"
$xx_ at 5.0 "$elt_(1) coordinates 366.2 1277.44 4.37"

file5:

$xx_ at 8.0 "$elt_(0) coordinates 643.59 1807.23 7.62"
$xx_ at 8.0 "$elt_(1) coordinates 366.88 1271.47 6.01"
$xx_ at 10.0 "$elt_(0) coordinates 636.46 1800.37 9.90"
$xx_ at 10.0 "$elt_(1) coordinates 367.78 1263.63 7.90"

file10:

$xx_ at 13.0 "$elt_(1) coordinates 380.78 1279.63 7.90"

Also, for each file, I want just to keep each element only once (the last time it appears) and I want to only keep the index of the element and the 2 numeric fields just after coordinates:

file1:

0 649.08 1812.52 
1 366.2 1277.44 

Update: So from the two answers I got, I tried to mix both to get my answer

awk 'BEGIN{n=1}{x=$3;if(x>n*5){++n}{print > "file" n*5}}' file

for (i in file){awk 'BEGIN{}{if(($3+0)>max[$1])
{max[$1]=$3; line[$1]=$0}}END{for(i in line)
{print line[i];}}' file[i]}

Now the second part ( which is from the proposed uniq.awk), when tried on a single file gives me only a single unique line not all unique lines.

Moreover the for loop is giving me an error, although this is all I added for it

for (i in file){}

2 Answers2

1

I wrote two awk scripts. When used in conjunction they can accomplish this. Envoke first one (testsort.awk) like:

./testsort.awk test.txt

where test.txt is the input file. There are some diagnostic prints, real output is in the files named file0, file5 ... etc.

testsort.awk uses internally uniq.awk (both included below)

testsort.awk:

#! /bin/gawk -f

BEGIN{max=0;}{

  #use an array to map time values to first column value lists
  if($3 in arr){
    arr[$3]=arr[$3]" "$1;
  }else{
    arr[$3]=$1;
  }

  #use another array to store the whole line
  arr2[$3"_"$1]=$0;

  #keep track of the maximum time observed
  if(($3+0)>max){
    max=($3+0);
  }
}
END{

  #sort them into their files starting at zero
  for(i=0;i<max;i+=5){
    for(j in arr){
      split(arr[j],a," ")
      for(k in a){
        idx=j"_"a[k];
        num=(j+0);
        if(num>i && num<=i+5){
          output["file"i]=output["file"i]arr2[idx]"\n"
        }
      }
    }
  }

  #write the appropriate files
  for(i in output){
    print i;
    print output[i];
    if(length(output[i])>0){
      system("echo \""output[i]"\" |./uniq.awk|sort >"i);
    }
  }
}

uniq.awk:

#! /bin/gawk -f

BEGIN{}{

  #find the maxes
  if(($3+0)>max[$1]){
    max[$1]=$3
    line[$1]=$0
  }

}
END{

  #write the appropriate files
  for(i in line){
    print line[i];
  }
}    

The solution also depends on having the shell utility sort.

EDIT:
the specification of the input file was changed in the post, now I would do:

  1. $sed -e 's/[$]//g' < test.txt > test_new.txt to get rid of the annoying dollar signs in the original input

  2. $./testsort_new.awk test_new.txt

new file testsort_new.awk:

#! /usr/bin/awk -f

BEGIN{max=0;}{

  #use an array to map time values to first column value lists
  if($3 in arr){
    arr[$3]=arr[$3]" "$4;
  }else{
    arr[$3]=$4;
  }

  #use another array to store the whole line
  arr2[$3"_"$4]=$0;

  #keep track of the maximum time observed
  if(($3+0)>max){
    max=($3+0);
  }
}
END{

  #sort them into their files starting at zero
  for(i=0;i<max;i+=5){
    for(j in arr){
      split(arr[j],a," ")
      for(k in a){
        idx=j"_"a[k];
        num=(j+0);
        if(num>=i && num<i+5+1){
          output["file"i]=output["file"i]arr2[idx]"\n"
        }
      }
    }
  }

  #write the appropriate files
  for(i in output){
    print i;
    print output[i];
    if(length(output[i])>0){
      target=output[i];
      gsub("\"","\\\"",target);
      system("echo \""target"\" |./uniq_new.awk|sort -k4 >"i);
    }
  }
}

new file uniq_new.awk:

#! /bin/awk -f

BEGIN{}{

  #find the maxes
  if(($3+0)>max[$4]){
    max[$4]=$3
    line[$4]=$0
  }

}
END{

  #write the appropriate files
  for(i in line){
    print line[i];
  }
}

The dollar signs will not be reproduced in the output.

villaa
  • 1,043
  • 3
  • 14
  • 32
  • Do I have to use `#! /bin/gawk -f` because it gives me a bad interpreter error, no such file or directory. I tried modifying it with /bin/bash but then I am getting errors from the program. – student beginner Mar 10 '16 at 16:58
  • Even when I installed the package for gawk I am still getting the same error. although when i type gawk alone, i get its help – student beginner Mar 10 '16 at 17:04
  • @studentbeginner `gawk` refers to a specific implementation of `awk` [see here](http://unix.stackexchange.com/questions/29576/difference-between-gawk-vs-awk). On my system `awk` is the same as `gawk` because it's been aliased. I'm not sure what's the situation on your system. You could try `$ls -lh \`which awk\`` to find out. Since I don't think I used any features specific to `gawk` you can probably just replace that first line with the path given by `$which awk`. – villaa Mar 10 '16 at 21:02
  • `$ls -lh `which awk` ` gives me: `lrwxrwxrwx 1 root root 21 oct 13 21:11 /usr/bin/awk -> /etc/alternatives/awk` . But even if i put ` #! /bin/awk `, the same error comes up – student beginner Mar 10 '16 at 21:36
  • try using `#! /usr/bin/awk` – villaa Mar 10 '16 at 21:39
  • it gives me this `awk: cmd. line:1: ./testsort.awk awk: cmd. line:1: ^ syntax error awk: cmd. line:1: ./testsort.awk awk: cmd. line:1: ^ unterminated regexp` – student beginner Mar 10 '16 at 21:43
  • but if i try the solution given by cron, it works (although it is not the desired result). So it means awk works fine. I tried to update my question in top with mixing both answers but i have an error. – student beginner Mar 10 '16 at 21:45
  • What I meant to say was using `#! /usr/bin/awk -f` but I'm not sure that will help the error you just gave. – villaa Mar 10 '16 at 21:45
  • also you have to be sure to give it the file you want to use as input as an argument like `./testsort.awk test.txt` where `test.txt` is your input file and make the files executable `chmod +x testsort.awk` `chmod +x uniq.awk` – villaa Mar 10 '16 at 21:48
  • i already did the chmod for the files. with -f, it works now for the lines if they are simple but not for this type of line `$xx_ at 0.0 "$elt_(0) is 656.02 1819.19 0.00"`. is it because of the symbol $ – student beginner Mar 10 '16 at 21:57
  • I don't understand that line could you edit the question to show a block of example lines that are like the ones you're trying to parse? the solution above requires the elt_(0) to be in the first position for every line and the time to be in the third. – villaa Mar 10 '16 at 22:05
  • time is in 3rd position but elt_ is not at first one. See edit on top – student beginner Mar 10 '16 at 22:14
  • but just a question about previous what does -f changed at the first line(for the awk to work) – student beginner Mar 10 '16 at 22:15
  • when you write that first commented line in a Unix script it is typically called a [Shebang](https://en.wikipedia.org/wiki/Shebang_%28Unix%29). It instructs the program loader to use a specific interpreter to interpret your file. The Bash interpreter is set up to take files as input, but the awk interpreter takes input from the first command line argument (which is the file name in the above case), unless you supply it the -f flag, in which case it takes the code input from a file, like bash. – villaa Mar 10 '16 at 23:40
  • Ok, I understand the -f thank you. I copied your edited codes and chmod +x the two files but when i run `./testsort_new.awk test_new.txt`, I get the output of file 0 on terminal then I get the error `sh: 12: ./uniq_new.awk: not found`. But the file is there in the same folder with execute rights – student beginner Mar 11 '16 at 00:35
  • my bad, i haven't changedthe shebang /bin/awk to /usr/bin/awk in the uniq_new.awk file. Now it works. Thanks a lot – student beginner Mar 11 '16 at 00:43
  • to get only the 3 fields i want, i am trying `#! /usr/bin/awk -f BEGIN { FS="[_() ]"} END{ for (i in file*) for ( line in $(cat "file"i)) print $7 "\t" $10 "\t" $11 > "result"i; }` but it is not working – student beginner Mar 11 '16 at 01:48
  • on each output file you can do `cat file |awk 'BEGIN{FS="[ ()]"}{ print $5" "$8" "$9;}'` – villaa Mar 11 '16 at 15:45
  • It works on a single file but how can i do it for all the files together. `find . -name "file*"|while read fname; do for line in $(cat "$fname") do awk 'BEGIN{FS="[ ()]"}{ print $5" "$8" "$9; > $fname}' done done` – student beginner Mar 11 '16 at 16:16
  • try this one-liner `for i in $(ls file*); do cat "$i"|awk 'BEGIN{FS="[ ()]"}{ print $5" "$8" "$9;}' > "$i".new; done;` assuming that the only files in your directory that match `file*` are the ones you want to transform – villaa Mar 11 '16 at 17:38
  • Thanks a lot, it works fine. I just have to split the original file before using it (maybe because it is around 7gb). – student beginner Mar 12 '16 at 18:20
0

can't get exact requirement according to input. try below.

awk 'BEGIN{n=1}{x=$3;if(x>n*5){++n}{print > "file" n}}' file
Cron
  • 61
  • 3
  • This works for the splitting part just fine but then i want to remove the duplicate elements. Is there nothing like 'uniq' but which can be applied based on the value of a field? – student beginner Mar 10 '16 at 18:37
  • just saw [this post](http://stackoverflow.com/questions/35916509/only-output-line-if-value-in-specific-column-is-unique) today, I think it's similar to what you'd want there – villaa Mar 10 '16 at 21:56
  • yes, it is similar but i don't want to remove lines that are repeated like him. – student beginner Mar 10 '16 at 22:05