How can I split files by grouping the same lines using shell script or awk?
For example, I have 1 file with the content as follow:
1,1,1,1
2,2,2,2
3,3,3,3
x,x,x,x
x,x,x,x
x,x,x,x
x,x,x,x
y,y,y,y
y,y,y,y
y,y,y,y
4,4,4,4
5,5,5,5
What I want is: all the equal lines are a group and must to be in a separated file, the other different lines needs to be in a splited file until specific limit. For example, if I have specific limit as 10, then the original file must to be splited for all lines containing numbers until the limit of 10 (<= 10), if there are more different lines than the limit, create another splited file and so on.
For the equal lines containing letters I need them to have their own separate file. So one file only for x,x,x,x lines, other for y,y,y,y lines and so on(basically to get file's contents based on a field, lets say 3rd field for example).
The content of lines is just example, the real case is a CSV containing different values for all columns where I need to group by specific column value (I'm using sort and uniq for this), but anyway I need to split this csv by equal lines group and by different lines <= limit using shell script or awk (I see awk provides better performance). I also need header(very first line) in each output file(with no duplicate of that header content in output file).
Do you have any idea?
My current code is (it keeps the first line because I'm considering the csv has a header):
#!/bin/bash
COLUMN=$1
FILE=$2
LIMIT=$3
FILELENGTH=`wc -l < $FILE`
COUNTER=$LIMIT
NUMS=""
SORTED="sorted_"`basename $FILE`
sort -t, -k $COLUMN -n $FILE > $SORTED
while [ $COUNTER -le $FILELENGTH ]; do
NUMS+=`uniq -c $SORTED | awk -v val=$COUNTER '($1+prev)<=val {prev+=$1} END{print prev}'`
NUMS+=" "
((COUNTER+=LIMIT))
echo $NUMS "|" $COUNTER "|" $FILELENGTH "|" $SORTED
done
awk -v nums="$NUMS" -v fname=`basename $2` -v dname=`dirname $2` '
NR==1 { header=$0; next}
(NR-1)==1 {
c=split(nums,b)
for(i=1; i<=c; i++) a[b[i]]
j=1; out = dname"/" "splited" j "_"fname
print header > out
system("touch "out".fin")
}
{ print > out }
NR in a {
close(out)
out = dname "/" "splited" ++j "_"fname
print header > out
system("touch "out".fin")
}' $SORTED