How can I split files by grouping the same lines?

Question

How can I split files by grouping the same lines using shell script or awk?

For example, I have 1 file with the content as follow:

1,1,1,1
2,2,2,2
3,3,3,3
x,x,x,x
x,x,x,x
x,x,x,x
x,x,x,x
y,y,y,y
y,y,y,y
y,y,y,y
4,4,4,4
5,5,5,5

What I want is: all the equal lines are a group and must to be in a separated file, the other different lines needs to be in a splited file until specific limit. For example, if I have specific limit as 10, then the original file must to be splited for all lines containing numbers until the limit of 10 (<= 10), if there are more different lines than the limit, create another splited file and so on.

For the equal lines containing letters I need them to have their own separate file. So one file only for x,x,x,x lines, other for y,y,y,y lines and so on(basically to get file's contents based on a field, lets say 3rd field for example).

The content of lines is just example, the real case is a CSV containing different values for all columns where I need to group by specific column value (I'm using sort and uniq for this), but anyway I need to split this csv by equal lines group and by different lines <= limit using shell script or awk (I see awk provides better performance). I also need header(very first line) in each output file(with no duplicate of that header content in output file).

Do you have any idea?

My current code is (it keeps the first line because I'm considering the csv has a header):

#!/bin/bash
COLUMN=$1
FILE=$2
LIMIT=$3
FILELENGTH=`wc -l < $FILE`
COUNTER=$LIMIT
NUMS=""
SORTED="sorted_"`basename $FILE`

sort -t, -k $COLUMN -n $FILE > $SORTED
while [ $COUNTER -le $FILELENGTH ]; do
        NUMS+=`uniq -c $SORTED | awk -v val=$COUNTER '($1+prev)<=val {prev+=$1} END{print prev}'`
        NUMS+=" "
        ((COUNTER+=LIMIT))
        echo $NUMS "|" $COUNTER "|" $FILELENGTH "|" $SORTED
done

awk -v nums="$NUMS" -v fname=`basename $2` -v dname=`dirname $2` '
   NR==1 { header=$0; next}
   (NR-1)==1 {
        c=split(nums,b)
        for(i=1; i<=c; i++) a[b[i]]
        j=1; out = dname"/" "splited" j "_"fname
        print header > out
        system("touch "out".fin")
    }
    { print > out }
    NR in a {
        close(out)
        out = dname "/" "splited" ++j "_"fname
        print header > out
        system("touch "out".fin")
    }' $SORTED

If you don't have lines that are all `.`s in your real data, then don't put such lines (`.......`) in your sample data. Just give us a [mcve] with concise, testable sample input and expected output that we can copy/paste to test with (without having to edit out a bunch of useless `.......` lines) so we can help you. — Ed Morton, Aug 18 '22 at 01:00
You say `I have specific limit as 10` but we don't want to see a lengthy example with multiple 10+ line chunks, just make it, say, 3 for your question - you can change 3 to 10 or whatever else you want later. — Ed Morton, Aug 18 '22 at 01:02
As the [bash tag](https://stackoverflow.com/questions/tagged/bash) you used instructs, you should copy/paste your script into http://shellcheck.net and fix the issues it tells you about. Also read [correct-bash-and-shell-script-variable-capitalization](https://stackoverflow.com/questions/673055/correct-bash-and-shell-script-variable-capitalization). — Ed Morton, Aug 18 '22 at 01:03

RavinderSingh13 · Accepted Answer · 2022-08-19T01:57:03.110

4

With GNU awk you could try following code, written as per your shown samples. With a 2 pass of Input_file here. For lines which are occurring more than once in Input_file their output file will be created with name eg: firstfieldValue.outFile and files which are unique(having only 1 occurrence in your Input_file) will be created with name like: 1.singleOccurrence.outFile, 2.singleOccurrence.outFile and so on.

To keep headers(very first line of your Input_file) into each output file, please try following awk code, little tweak in above code:

awk '
BEGIN{
  count1="1"
  FS=OFS=","
}
NR==1{ headers = $0; next }
FNR==NR && FNR>1{
  arr[$0]++
  next
}
$0 == headers{ next }
arr[$0]>1{
  if(!arr1[$1".outFile"]++){ print headers > ($1".outFile") }
  print > ($1".outFile")
  next
}
{
  if(++count2%10==0){ close(count1".singleOccurrence.outFile") }
  count1+=(count2%10==0?1:0)
  if(prev!=count1){print headers > count1".singleOccurrence.outFile"}
  print > (count1".singleOccurrence.outFile")
  prev=count1
}
'  Input_file  Input_file

edited Aug 19 '22 at 01:57

answered Aug 18 '22 at 00:57

RavinderSingh13

130,504
14
57
93

GOOD! Thanks! In my real scenario I need to split comparing by the column value, so I believe it's just change the arr[$0] to arr[$columnNumber] right? And how can I keep the header across these files? Last question, if I have 1 million lines in the original file how can I manage the error awk: cannot open "1021.singleOccurrence.outFile" for output (Too many open files) ? – bmelo Aug 18 '22 at 01:16
@bmelo, For your first query here, I need to understand it better(so you mean you want to save files by field value OR field number? For your 2nd query use GNU `awk`(OR preferably called `gawk`)'s latest version at that time we shouldn't get too many files opened error IMHO, let me know. – RavinderSingh13 Aug 18 '22 at 01:18
@bmelo, Apart from my ask in previous comment, for headers you want to keep in each output file's first line? – RavinderSingh13 Aug 18 '22 at 01:20
1] for example, i need to group the files by the third column, so all the lines containing different values for third column will be splited by the limited <= 10 files, and those having equals values in other files 2] well, my gawk version is restricted to old one because...well, production server... =/ 3] yes, the first line of original file needs to be replicated in all the other ones. – bmelo Aug 18 '22 at 01:27
@bmelo, for your group query if its 3 field/column by which you want to have file names then change FROM `arr[$0]++` TO `arr[$3]++` AND from `arr[$0]>1` TO `arr[$3]>1`. 2nd: this needs gawk newest version at least try this in a sandbox(non-prod one) will try to come up with non-gnu(but that is going to be very lengthy and less efficient trust me on it). 3rd: I already added solution(OR one in my answer), so that will keep headers in output file, you need to just make changes as per my 1st point here. Try these and let me know how it goes, cheers. – RavinderSingh13 Aug 18 '22 at 01:30
2] Can I use close(file) inside to avoid this problem? 3] The header is almost right. I have an example using 0,0,0,0 as header and it's working but in the 1.outFile I have this line appearing at header and at 4th line. In the 1.singleOccurence I have the header as first line and second line (first is correct). And in 4.outFile (the last file) I have correctly in just first line. – bmelo Aug 18 '22 at 01:37
@bmelo, I have done fix in code for headers kindly check updated code and let me know if it works for you. For closing file it's bit tricky, if your contents would have been sorted then we would have closed files easily. I am afraid to use sort with this code it may slow down things, how about sorting files before running this code(make sure headers coming first line only) and then run this code? By the way how many output files are getting created? Let me know how it goes – RavinderSingh13 Aug 18 '22 at 01:45
For the first file 1.outFile I'm still getting repeated header at 4th line. Yeah, I think I need to sort by specific column anyway, for example the second or third column. So I can use the close()? And will I be able to split equals line even if they are not consecutive each other? – bmelo Aug 18 '22 at 01:51
@bmelo, if you can sort it with 3rd field(separately, don't merge with this code for performance perspective) then we can place close of file(somewhere inside `if(!arr1[$0]++)` and `if(prev!=count1)` conditions respectively I believe so. For headers only 1st file is having issues? If yes then is it's 3rd field occurring single time OR multiple times? Please confirm once. – RavinderSingh13 Aug 18 '22 at 01:54
single time at 4th line. – bmelo Aug 18 '22 at 01:57
@bmelo, you mean its 4th line in your Input_file and its 3rd field occurring only once correct? If this is the case, sorry but I can't see any issues with it honestly, cheers. – RavinderSingh13 Aug 18 '22 at 01:59
@bmelo, I have added closing files logic now in code, kindly do check it once and let me know how it goes. – RavinderSingh13 Aug 18 '22 at 16:41
worked very well. but please try this sample: https://pastebin.com/XXq6Tuhr Look, the 0,0,0,0 is the header and it's being repetead in the 4th line of the first file. – bmelo Aug 19 '22 at 01:04
@bmelo, sure, just be with me, checking quickly and getting back here. – RavinderSingh13 Aug 19 '22 at 01:23
@bmelo, so you mean headers(first line) shouldn't be repeating any where in more than 1 occurrence of file? Please confirm once and I will edit it then. And any other issues, we are all set with other things right? apart from this. – RavinderSingh13 Aug 19 '22 at 01:28
1

Exactly. The header is unique, because in real scenario header is the name of the columns. – bmelo Aug 19 '22 at 01:33
@bmelo, I have updated my 2nd solution now, please check it; once you confirm I will edit my first one also. Let me know how it goes and we can quickly close this one :) – RavinderSingh13 Aug 19 '22 at 01:36
0,0,0,0 1,1,1,1 1,1,1,1 0,0,0,0 1,1,9,1 1,1,9,1 1,1,1,1 1,1,1,1 1,1,9,1 4th line problem persists. What's happening now is when I run the awk I got a header printed at shell. – bmelo Aug 19 '22 at 01:40
1

@bmelo, FIXED it :) Please try my 2nd code and let me know how it goes, worked fine for me when I tested it – RavinderSingh13 Aug 19 '22 at 01:53
1

PERFECT! Thanks a lot. Did you try in another sample other than my pastebin? – bmelo Aug 19 '22 at 01:55
@bmelo, just added 2 more details(minor ones) in your question(which is a GREAT question). Cheers and happy learning. – RavinderSingh13 Aug 19 '22 at 02:00

kvantour · Answer 2 · 2022-08-18T16:39:47.627

Here is my take on the problem, without any sorting needed. On top of that, we preserve the header and assume, in this example to have maximum 42 lines per file. At the end, the files will have the following naming convention "filename.ddddd.csv" and will contain the original header:

awk -v l=42 -v fname="/path/to/output/filename"                \
    '(NR==1){header=$0;next}
     { key = ($0 ~ /[^0-9,]/ ? $0 : "num") }
     {c=a[key]++}
     (c%l==0) {
        close(filename[key]); 
        filename[key]=sprintf("%s.%0.5d.csv",fname,++file_count)
        print header > filename[key]
     }
     {print > filename[key]}' file.csv

This works in the following way:

(NR==1){header=$0;next}: If the record/line is the first line, save that line as the header.
{ key = ($0 ~ /[^0-9,]/ ? $0 : "num") }: define a key. If the current line contains only numbers and commas, define the key to be num, otherwise it is the line itself.
{c=a[key]++}: keep track of how many times we encountered the current line. We do this by storing the count in the associative array a indexed by key. The value c always returns the currently processed count -1.
(c%l==0){...}: Every time we wrote l=42 records/lines, we need to start writing to a new file. This happens every time the modulo of the record/line number hits 0. When such a line is found we do:
- close(filename[key]): close the file you write the current line to. This filename is tracked in the associative array filename indexed by key.
- filename[key]=sprintf("%s.%0.5d.csv",fname,++file_count): define the new filename as FNAME.00XXX.csv
- print header > filename[key]: open the file and write the header to that file.
{print > filename[key]}: write the entries to the file.

@kvantour you are not grouping the equals lines in different file — bmelo, Aug 19 '22 at 00:59
@bmelo AFAIK the split happens based on your description in the OP. _For the equal lines containing letters I need them to have their own separate file. So one file only for x,x,x,x lines, other for y,y,y,y lines and so on_. This indicates that lines with numbers are combined in one file, while lines with letters go in separate files. This distinction is done by the `key`. — kvantour, Aug 19 '22 at 06:29

ufopilot · Answer 3 · 2022-08-18T07:11:01.053

 awk -F, -v limit=3 '
    BEGIN{i=1}
    NR==1{
        header=$0                                       # save the header
        next                                            # go to next line
    }
    FNR==NR{                                            # process letters-lines
        if(f!=$0) print header " > " "tmp/file_" $1     # print initial header      
        f=$0                                            # save line
        print $0 " > " "tmp/file_" $1                   # print line to file
        next                                            # go to next line
    }
    {                                                   # process numbers-lines    
        if (x!=i) print header " > " "tmp/file_" i      # print initial header
        x=i                                             # save number    
        print $0 " > " "tmp/file_" i                    # print line to file    
    }
    FNR % limit == 0{                                   # check limit 
        i++
    }
' <(head -n 1 split.csv;                                # getting the header
    grep "^[a-Z]" <(sed '1d' split.csv)|sort            # getting sorted letters-lines
   ) \
  <(grep "^[^a-Z]" <(sed '1d' split.csv))               # getting numbers-lines


$ head tmp/*
==> tmp/file_1 <==
header
1,1,1,1
2,2,2,2
3,3,3,3

==> tmp/file_2 <==
header
4,4,4,4
5,5,5,5

==> tmp/file_x <==
header
x,x,x,x
x,x,x,x
x,x,x,x
x,x,x,x

==> tmp/file_y <==
header
y,y,y,y
y,y,y,y
y,y,y,y

@bmelo This answer does exactly what you describe in the OP. The output provided in this answer satisfies your condition. If this is not true, please update the question to contain input and expected output. — kvantour, Aug 19 '22 at 06:30
@bmelo as kvantour said.. This answer does what you described.. What did I misunderstand? — ufopilot, Aug 19 '22 at 06:38

How can I split files by grouping the same lines?

3 Answers3