11

How to split file by percentage of no. of lines?

Let's say I want to split my file into 3 portions (60%/20%/20% parts), I could do this manually, -_- :

$ wc -l brown.txt 
57339 brown.txt

$ bc <<< "57339 / 10 * 6"
34398
$ bc <<< "57339 / 10 * 2"
11466
$ bc <<< "34398 + 11466"
45864
bc <<< "34398 + 11466 + 11475"
57339

$ head -n 34398 brown.txt > part1.txt
$ sed -n 34399,45864p brown.txt > part2.txt
$ sed -n 45865,57339p brown.txt > part3.txt
$ wc -l part*.txt
   34398 part1.txt
   11466 part2.txt
   11475 part3.txt
   57339 total

But I'm sure there's a better way!

Benjamin W.
  • 46,058
  • 19
  • 106
  • 116
alvas
  • 115,346
  • 109
  • 446
  • 738
  • 2
    Could you please elaborate on the requirement for _credible and/or official sources_? Why is the high-quality answer you’ve already received not enough? – Dario Feb 08 '17 at 09:54
  • Wrong bounty message, should have been "looking to draw attention" – alvas Feb 08 '17 at 10:00
  • Do the percentages have to be absolutely precise and am I correct in assuming you have a large number of lines? – Mark Setchell Feb 08 '17 at 19:33
  • @edmorton, there's nothing wrong with your answer. It's great but it'll be nice to see different approaches and whether there's a better one. – alvas Feb 09 '17 at 00:43
  • @marksetchell, it does have to be precise as much as possible. But it's acceptable if there's 1-2 lines from the end that dropped out because of rounding off floats. Yes, my actual data does have a large number, in millions. – alvas Feb 09 '17 at 00:44
  • are you limited to bash,awk,sed,split utils? – TJR Feb 11 '17 at 17:31
  • @TJR As long as it doesn't need compilation and can be easily ran on unix shell, it should be good. – alvas Feb 11 '17 at 23:08

6 Answers6

11

There is a utility that takes as arguments the line numbers that should become the first of each respective new file: csplit. This is a wrapper around its POSIX version:

#!/bin/bash

usage () {
    printf '%s\n' "${0##*/} [-ks] [-f prefix] [-n number] file arg1..." >&2
}

# Collect csplit options
while getopts "ksf:n:" opt; do
    case "$opt" in
        k|s) args+=(-"$opt") ;;           # k: no remove on error, s: silent
        f|n) args+=(-"$opt" "$OPTARG") ;; # f: filename prefix, n: digits in number
        *) usage; exit 1 ;;
    esac
done
shift $(( OPTIND - 1 ))

fname=$1
shift
ratios=("$@")

len=$(wc -l < "$fname")

# Sum of ratios and array of cumulative ratios
for ratio in "${ratios[@]}"; do
    (( total += ratio ))
    cumsums+=("$total")
done

# Don't need the last element
unset cumsums[-1]

# Array of numbers of first line in each split file
for sum in "${cumsums[@]}"; do
    linenums+=( $(( sum * len / total + 1 )) )
done

csplit "${args[@]}" "$fname" "${linenums[@]}"

After the name of the file to split up, it takes the ratios for the sizes of the split files relative to their sum, i.e.,

percsplit brown.txt 60 20 20
percsplit brown.txt 6 2 2
percsplit brown.txt 3 1 1

are all equivalent.

Usage similar to the case in the question is as follows:

$ percsplit -s -f part -n 1 brown.txt 60 20 20
$ wc -l part*
 34403 part0
 11468 part1
 11468 part2
 57339 total

Numbering starts with zero, though, and there is no txt extension. The GNU version supports a --suffix-format option that would allow for .txt extension and which could be added to the accepted arguments, but that would require something more elaborate than getopts to parse them.

This solution plays nice with very short files (split file of two lines into two) and the heavy lifting is done by csplit itself.

Benjamin W.
  • 46,058
  • 19
  • 106
  • 116
  • kudos for mentioning `csplit` and for using `getopts` (which, I think, is the least appreciated builtin of all bash) – Dario Feb 09 '17 at 09:21
  • @BenjaminW Thanks for the answer! Don't mind if I give you the checkmark and EdMorton the bounty since he answered first with more votes but I like your solution better =) – alvas Feb 14 '17 at 07:23
9
$ cat file
a
b
c
d
e

$ cat tst.awk
BEGIN {
    split(pcts,p)
    nrs[1]
    for (i=1; i in p; i++) {
        pct += p[i]
        nrs[int(size * pct / 100) + 1]
    }
}
NR in nrs{ close(out); out = "part" ++fileNr ".txt" }
{ print $0 " > " out }

$ awk -v size=$(wc -l < file) -v pcts="60 20 20" -f tst.awk file
a > part1.txt
b > part1.txt
c > part1.txt
d > part2.txt
e > part3.txt

Change the " > " to just > to actually write to the output files.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • What means `pct` and `nrs`? – hek2mgl Nov 04 '16 at 06:58
  • `pct` = percent. `nrs` = NRs = line/record numbers, the list of NRs where the output file number changes. – Ed Morton Nov 04 '16 at 07:22
  • Nice and short. Just has some problems with small percentages/files. Consider a file with 2 lines and `pct="10 90"`. The script will write both lines into `part1.txt`. – Socowi Feb 08 '17 at 21:07
3

Usage

The following bash script allows you to specify the percentage like

./split.sh brown.txt 60 20 20

you also can use the placeholder . which fills the percentage up to 100%.

./split.sh brown.txt 60 20 .

the splitted file is written to

part1-brown.txt
part2-brown.txt
part3-brown.txt

The script always generates as much part files as numbers are specified. If the percentages sum up to 100, cat part* will always generate the original file (no duplicated or missing lines).

Bash Script: split.sh

#! /bin/bash

file="$1"
fileLength=$(wc -l < "$file")
shift

part=1
percentSum=0
currentLine=1
for percent in "$@"; do
        [ "$percent" == "." ] && ((percent = 100 - percentSum)) 
        ((percentSum += percent))
        if ((percent < 0 || percentSum > 100)); then
                echo "invalid percentage" 1>&2
                exit 1
        fi
        ((nextLine = fileLength * percentSum / 100))
        if ((nextLine < currentLine)); then
                printf "" # create empty file
        else
                sed -n "$currentLine,$nextLine"p "$file"
        fi > "part$part-$file"
        ((currentLine = nextLine + 1))
        ((part++))
done
Socowi
  • 25,550
  • 3
  • 32
  • 54
1
BEGIN {
    split(w, weight)
    total = 0
    for (i in weight) {
        weight[i] += total
        total = weight[i]
    }
}
FNR == 1 {
    if (NR!=1) {
        write_partitioned_files(weight,a)
        split("",a,":") #empty a portably
    }
    name=FILENAME
}
{a[FNR]=$0}
END {
    write_partitioned_files(weight,a)
}
function write_partitioned_files(weight, a) {
    split("",threshold,":")
    size = length(a)
    for (i in weight){
        threshold[length(threshold)] = int((size * weight[i] / total)+0.5)+1
    }
    l=1
    part=0
    for (i in threshold) {
        close(out)
        out = name ".part" ++part
        for (;l<threshold[i];l++) {
            print a[l] " > " out 
        }
    }
}

Invoke as:

awk -v w="60 20 20" -f above_script.awk file_to_split1 file_to_split2 ...

Replace " > " with > in script to actually write partitioned files.

The variable w expects space separated numbers. Files are partitioned in that proportion. For example "2 1 1 3" will partition files into four with number of lines in proportion of 2:1:1:3. Any sequence of numbers adding up to 100 can be used as percentages.

For large files the array a may consume too much memory. If that is an issue, here is an alternative awk script:

BEGIN {
    split(w, weight)
    for (i in weight) {
        total += weight[i]; weight[i] = total #cumulative sum
    }
}
FNR == 1 {
    #get number of lines. take care of single quotes in filename.
    name = gensub("'", "'\"'\"'", "g", FILENAME)
    "wc -l '" name "'" | getline size

    split("", threshold, ":")
    for (i in weight){
        threshold[length(threshold)+1] = int((size * weight[i] / total)+0.5)+1
    }

    part=1; close(out); out = FILENAME ".part" part
}
{
    if(FNR>=threshold[part]) {
        close(out); out = FILENAME ".part" ++part
    }
    print $0 " > " out 
}

This passes through each file twice. Once for counting lines (via wc -l) and the other time while writing partitioned files. Invocation and effect is similar to the first method.

pii_ke
  • 2,811
  • 2
  • 20
  • 30
1

i like Benjamin W.'s csplit solution, but it's so long...

#!/bin/bash
# usage ./splitpercs.sh file 60 20 20
n=`wc -l <"$1"` || exit 1
echo $* | tr ' ' '\n' | tail -n+2 | head -n`expr $# - 1` |
  awk -v n=$n 'BEGIN{r=1} {r+=n*$0/100; if(r > 1 && r < n){printf "%d\n",r}}' |
  uniq | xargs csplit -sfpart "$1"

(the if(r > 1 && r < n) and uniq bits are to prevent creating empty files or strange behavior for small percentages, files with small numbers of lines, or percentages that add to over 100.)

webb
  • 4,180
  • 1
  • 17
  • 26
1

I just followed your lead and made what you do manually into a script. It may not be the fastest or "best", but if you understand what you are doing now and can just "scriptify" it, you may be better off should you need to maintain it.

#!/bin/bash

#  thisScript.sh  yourfile.txt  20 50 10 20

YOURFILE=$1
shift

# changed to cat | wc so I dont have to remove the filename which comes from
# wc -l
LINES=$(cat $YOURFILE | wc -l ) 

startpct=0;
PART=1;
for pct in $@
do
  # I am assuming that each parameter is on top of the last
  # so   10 30 10   would become 10, 10+30 = 40, 10+30+10 = 50, ...
  endpct=$( echo "$startpct + $pct" | bc)  

  # your math but changed parts of 100 instead of parts of 10.
  #  change bc <<< to echo "..." | bc 
  #  so that one can capture the output into a bash variable.
  FIRSTLINE=$( echo "$LINES * $startpct / 100 + 1" | bc )
  LASTLINE=$( echo "$LINES * $endpct / 100" | bc )

  # use sed every time because the special case for head
  # doesn't really help performance.
  sed -n $FIRSTLINE,${LASTLINE}p $YOURFILE > part${PART}.txt
  $((PART++))
  startpct=$endpct
done

# get the rest if the % dont add to 100%
if [[ $( "lastpct < 100" | bc ) -gt 0 ]] ; then
   sed -n $FIRSTLINE,${LASTLINE}p $YOURFILE > part${PART}.txt
fi

wc -l part*.txt
Mike Wodarczyk
  • 1,247
  • 13
  • 18