Interleave lines sorted by a column

Question

(Similar to How to interleave lines from two text files but for a single input. Also similar to Sort lines by group and column but interleaving or randomizing versus sorting.)

I have a set of systems and tasks in two columns, SYSTEM,TASK:

alpha,90198500
alpha,93082105
alpha,30184438
beta,21700055
beta,33452909
beta,40850198
beta,82645731
gamma,64910850

I want to distribute the tasks to each system in a balanced way. The ideal case where each system has the same number of tasks would be round-robin, one alpha then one beta then one gamma and repeat until finished.

I get the whole list of tasks + systems at once, so I don't need to keep any state
The list of systems is not static, on the order of N=100
The total number of tasks is variable, on the order of N=500
The number of tasks for each system is not guaranteed to be equal
Hard / absolute interleaving isn't required, as long as there aren't two of the same system twice in a row
The same task may show up more than once, but not for the same system
Input format / delimiter can be changed

I can solve this well enough with some fancy scripting to split the data into multiple files (grep ^alpha, input > alpha.txt etc) and then recombine them with paste or similar, but I'd like to use a single command or set of pipes to run it without intermediate files or a proper scripting language. Just using sort -R gets me 95% of the way there, but I end up with 2 tasks for the same system in a row almost every time, and sometimes 3 or more depending on the initial distribution.

edit: To clarify, any output should not have the same system on two lines in a row. All system,task pairs must be preserved, you can't move a task from one system to another - that'd make this really easy!

One of several possible sample outputs:

beta,40850198
alpha,90198500
beta,82645731
alpha,93082105
gamma,64910850
beta,21700055
alpha,30184438
beta,33452909

You can always use process substitution `paste ... <(grep '^alpha' input) <(grep...)` :) — PesaThe, Jan 24 '19 at 20:23
Can you please post a sample of the desired output? It's quite unclear to me how, with interleaving, two tasks would end up on the same line ... — tink, Jan 24 '19 at 20:29
Define "proper" scripting language. To me, awk is such, as is bash. — Tanktalus, Jan 24 '19 at 20:34
Hopefully clarified the problem definition and added output sample. I can't do process substitution because of bullet #2. I think @tan — Maelstrom, Jan 24 '19 at 20:59
took too long to edit... @Tanktalus proper is in the eye of the beholder obviously, but I think of bash as glue for stuff written in "proper" languages. My view is that python is proper, awk is borderline, bash isn't. — Maelstrom, Jan 24 '19 at 21:05
`I can solve this well enough with some fancy scripting to split the data into multiple files` - show it. I can be 90% sure that someone can easily simply convert the "fancy scripting" you did into a bit more "fancy oneliner". — KamilCuk, Jan 24 '19 at 21:12
I had a hard time understanding this question until I replaced the wording *"in a row"* with *"consecutive"* and *"interleave"* with *"reorder"*. My summary of this **question:** How to reorder lines such that all pairs of consecutive lines have unique first fields. **Example:** (`/` is newline) For the input `a,1 / a,2 / b,3` we can use the order `a,1 / b,3 / a,2`. **Hint:** Sometimes there is no solution, for instance for `a,1 / a,2`. **Still unclear:** Is the order `a,2 / b,3 / a,1` allowed? — Socowi, Jan 24 '19 at 21:27
@Socowi correct, both your understanding and your "Sometimes". Order doesn't matter as long as consecutive lines are unique, so `Still unclear` is valid as well. — Maelstrom, Jan 24 '19 at 21:32
With input: `a,1 / b,1` is the output `a,1 / b,1` allowed, or there is nothing allowed? Same task may be exists for different systems, but are consecutive lines with same task and different systems allowed in the output? — KamilCuk, Jan 24 '19 at 21:36
Any ordering is allowed, `a,1 / b,1` or `b,1 / a,1` - that's what I was trying to say with `The same task may show up more than once, but not for the same system`. There won't ever be `a,1 / a,1` in the input. — Maelstrom, Jan 24 '19 at 21:38

Socowi · Accepted Answer · 2019-01-25T15:28:54.183

We start with by answering the underlying theoretical problem. The problem is not as simple as it seems. Feel free to implement a script based on this answer.

The blocks formatted as quotes are not quotes. I just wanted to highlight them to improve navigation in this rather long answer.

Theoretical Problem

Given a finite set of letters L with frequencies f : L→ℕ₀, find a sequence of letters such that every letter ℓ appears exactly f(ℓ) times and adjacent elements of the sequence are always different.

Example

L = {a,b,c} with f(a)=4, f(b)=2, f(c)=1

ababaca, acababa, and abacaba are all valid solutions.
aaaabbc is invalid – Some adjacent elements are equal, for instance aa or bb.
ababac is invalid – The letter a appears 3 times, but its frequency is f(a)=4
cababac is invalid – The letter c appears 2 times, but its frequency is f(c)=1

Solution

The following approach produces a valid sequence if and only if there exists a solution.

Sort the letters by their frequencies.
For ease of notation we assume, without loss of generality, that f(a) ≥ f(b) ≥ f(c) ≥ ... ≥ 0.
Note: There exists a solution if and only if f(a) ≤ 1 + ∑_ℓ≠a f(ℓ).

Write down a sequence s of f(a) many a.

Add the remaining letters into a FIFO working list, that is:

(Don't add any a)

First add f(b) many b

Then f(c) many c

and so on

Iterate from left to right over the sequence s and insert after each element a letter from the working list. Repeat this step until the working list is empty.

Example

L = {a,b,c,d} with f(a)=5, f(b)=5, f(c)=4, f(d)=2

The letters are already sorted by their frequencies.
s = aaaaa
workinglist = bbbbbccccdd. The leftmost entry is the first one.
We iterate from left to right. The places where we insert letters from the working list are marked with an _ underscore.
- s = a_a_a_a_a_ workinglist = bbbbbccccdd
  s = aba_a_a_a_ workinglist = bbbbccccdd
  s = ababa_a_a_ workinglist = bbbccccdd
  ...
  s = ababababab workinglist = ccccdd
  ⚠️ We reached the end of sequence s. We repeat step 4.
- s = a_b_a_b_a_b_a_b_a_b_ workinglist = ccccdd
  s = acb_a_b_a_b_a_b_a_b_ workinglist = cccdd
  ...
  s = acbcacb_a_b_a_b_a_b_ workinglist = cdd
  s = acbcacbca_b_a_b_a_b_ workinglist = dd
  s = acbcacbcadb_a_b_a_b_ workinglist = d
  s = acbcacbcadbda_b_a_b_ workinglist =
  ⚠️ The working list is empty. We stop.
The final sequence is acbcacbcadbdabab.

Implementation In Bash

Here is a bash implementation of the proposed approach that works with your input format. Instead of using a working list each line is labeled with a binary floating point number specifying the position of that line in the final sequence. Then the lines are sorted by their labels. That way we don't have to use explicit loops. Intermediate results are stored in variables. No files are created.

#! /bin/bash
inputFile="$1" # replace $1 by your input file or call "./thisScript yourFile"

inputBySys="$(sort "$inputFile")"
sysFreqBySys="$(cut -d, -f1 <<< "$inputBySys" | uniq -c | sed 's/^ *//;s/ /,/')"
inputBySysFreq="$(join -t, -1 2 -2 1 <(echo "$sysFreqBySys") <(echo "$inputBySys") | sort -t, -k2,2nr -k1,1)"

maxFreq="$(head -n1 <<< "$inputBySysFreq" | cut -d, -f2)"
lineCount="$(wc -l <<< "$inputBySysFreq")"
increment="$(awk '{l=log($1/$2)/log(2); l=int(l)-(int(l)>l); print 2^l}' <<< "$maxFreq $lineCount")"

seq="$({ echo obase=2; seq 0 "$increment" "$maxFreq" | head -n-1; } | bc |
        awk -F. '{sub(/0*$/,"",$2); print 0+$1 "," $2 "," length($2)}' |
        sort -snt, -k3,3 -k2,2 | head -n "$lineCount")"

paste -d, <(echo "$seq") <(echo "$inputBySysFreq") | sort -nt, -k1,1 -k2,2 | cut -d, -f4,6

This solution could fail for very long input files due to the limited precision of floating point numbers in seq and awk.

KamilCuk · Answer 2 · 2019-01-24T21:56:33.160

Well, this is what I've come up with:

args=()
while IFS=' ' read -r _ name; do
    # add a file redirection with grepped certain SYSTEM only for later eval
    args+=("<(grep '^$name,' file)")
done < <(
    # extract SYSTEM only
    <file cut -d, -f1 | 
    #sort with the count
    sort | uniq -c | sort -nr
)
# this is actually safe, because we control all arguments
eval paste -d "'\\n'" "${args[@]}" |
# paste will insert empty lines when the list ended - remove them
sed '/^$/d'

First, I extract and sort the SYSTEM names in the order which occurs the most often to be first. So for the input example we get:

4 beta
3 alpha
1 gamme

Then for each such name I add the proper string <(grep '...' file) to arguments list witch will be later evalulated.

Then I evalulate the call to paste <(grep ...) <(grep ...) <(grep ...) ... with newline as the paste's delimeter. I remove empty lines with simple sed call.

The output for the input provided:

beta,21700055
alpha,90198500
gamma,64910850
beta,33452909
alpha,93082105
beta,40850198
alpha,30184438
beta,82645731

Converted to a fancy oneliner, with substituting the while read with command substitution and sed. Got safe with inputfile naming with printf "%q" "$inputfile" and double quoting inside sed regex.

inputfile="file"
fieldsep=","
eval paste -d '"\\n"' "$(
    cut -d "$fieldsep" -f1 "$inputfile" | 
    sort | uniq -c | sort -nr |
    sed 's/^[[:space:]]*[0-9]\+[[:space:]]*\(.*\)$/<(grep '\''^\1'"$fieldsep"\'' "'"$(printf "%q" "$inputfile")"'")/' |
    tr '\n' ' '
)" | 
sed '/^$/d'

This may fail in some cases even if a solution exists. Example: For the input `a,1 / c,1 / b,1 / a,2 /a,3` this solution will create the exact same invalid order. The order `a,1 / c,1 / a,2 / b,1 /a,3` would be a possible solution. Maybe the underlying theoretical problem should be asked on https://cs.stackexchange.com/. — Socowi, Jan 24 '19 at 21:56
@Socowi Good point, agreed. It would fail on many such cases. So probably a simple state machine that remembers the last outputted system must be implemented. Output the most occurring system -> decrement it's count -> output the most occurring system different from the last -> decs it's count -> repeat. That could work. — KamilCuk, Jan 24 '19 at 21:59
Such state machine feels like way easier to implement in C/C++/python, then in bash... — KamilCuk, Jan 24 '19 at 22:08

KamilCuk · Answer 3 · 2019-01-24T23:31:01.877

inputfile="inputfile"
fieldsep=","

# remember SYSTEMS with it's occurrence counts
counts=$(cut -d "$fieldsep" -f1 "$inputfile" | sort | uniq -c)

# remember last outputted system name
lastsys=''
# until there are any systems with counts
while ((${#counts})); do
    # get the most occurrented system with it's count from counts
    IFS=' ' read -r cnt sys < <(
        # if lastsys is empty, don't do anything, if not, filter it out
        if [ -n "$lastsys" ]; then 
            grep -v " $lastsys$";
        else
           cat;
        # ha suprise - counts is here!
        # probably would be way more readable with just `printf "%s" "$counts" |`
        fi <<<"$counts" | 
        # with the most occurence
        sort -n | tail -n1
    )

    if [ -z "$cnt" ]; then
        echo "ERROR: constructing output is not possible! There have to be duplicate system lines!" >&2
        exit 1
    fi

    # update counts - decrement the count of this system, or remove it if count is 1
    counts=$(
        # remove current system from counts
        <<<"$counts" grep -v " $sys$"
        # if the count of the system is 1, don't add it back - it's count is now 0
        if ((cnt > 1)); then
            # decrement count and add the line with system to counts
            printf "%s" "$((cnt - 1)) $sys"
        fi
    )

    # finally print output
    printf "%s\n" "$sys"
    # and remember last system
    lastsys="$sys"
done |
{
    # get system names only in `system` - using cached counts variable
    # for each system name open a grep for that name from the input file 
    # with asigned file descritpro
    # The file descriptor list is saved in an array `fds`
    fds=()
    systems=""
    while IFS=' ' read -r _ sys; do
        exec {fd}< <(grep "^$sys," "$inputfile")
        fds+=("$fd")
        systems+="$sys"$'\n'
    done <<<"$counts"

    # for each line in input
    while IFS='' read -r sys; do

        # get the position inside systems list of that system decremented by 1
        # this will be the underlying filesystem for filtering that system out of input
        fds_idx=$(<<<"$systems" grep -n "$sys" | cut -d: -f1)
        fds_idx=$((fds_idx - 1))

        # read one line from that file descriptor
        # I wonder is `sed 1p` would be faster
        IFS='' read -r -u "${fds[$fds_idx]}" line

        # output that line
        printf "%s\n" "$line"
    done
}

To accommodate for strange input values this script implements somewhat simple but hardy in bash statemachine.

The variable counts stores SYSTEM names with their're occurrence count. So from the example input it will be

4 alpha
3 beta
1 gamma

Now - we output the SYSTEM name with the biggest occurrence count that is also different from the last outputted SYSTEM name. We decrement it's occurrence count. If the count is equal to zero, it is removed from the list. We remember the last outputted SYSTEM name. We repeat this process until all occurrence counts reach zero, so the list is empty. For the example input this will output:

beta
alpha
beta
alpha
beta
alpha
beta
gamma

Now, we need to join that list with the job names. We can't use join as the input is not sorted and we don't want to change the ordering. So what I do, I get only SYSTEM names in system. Then for each system I open a different file descriptor with filtered only that SYSTEM name from the input file. All the file descriptors are stored in an array. Then for each SYSTEM name from the input, I find the file descriptor that filters that SYSTEM name from the input file and read exactly one line from the file descriptor. This works like an array of file positions each file position associated / filtering specified SYSTEM name.

beta,21700055
alpha,90198500
beta,33452909
alpha,93082105
beta,40850198
alpha,30184438
beta,82645731
gamma,64910850

The script was done so for the input in the form of:

alpha,90198500
alpha,93082105
alpha,30184438
beta,21700055
gamma,64910850

the script outputs correctly:

alpha,90198500
gamma,64910850
alpha,93082105
beta,21700055
alpha,30184438

I think this algorithm will mostly always print correct output, but the ordering is so that the least common SYSTEMs will be outputted last, which may be not optimal.

Tested manually with some custom tests and checker on paiza.io.

inputfile="inputfile"

in=( 1 2 1 5 )
cat <<EOF > "$inputfile"
$(seq ${in[0]} | sed 's/^/A,/' ) 
$(seq ${in[1]} | sed 's/^/B,/' )
$(seq ${in[2]} | sed 's/^/C,/' )
$(seq ${in[3]} | sed 's/^/D,/' )
EOF

sed -i -e '/^$/d' "$inputfile"

inputfile="inputfile"
fieldsep=","

# remember SYSTEMS with it's occurrence counts
counts=$(cut -d "$fieldsep" -f1 "$inputfile" | sort | uniq -c)

# I think this holds true
# The SYSTEM with the most count should be lower than the sum of all others

# remember last outputted system name
lastsys=''
# until there are any systems with counts
while ((${#counts})); do
    # get the most occurrented system with it's count from counts
    IFS=' ' read -r cnt sys < <(
        # if lastsys is empty, don't do anything, if not, filter it out
        if [ -n "$lastsys" ]; then 
            grep -v " $lastsys$";
        else
           cat;
        # ha suprise - counts is here!
        # probably would be way more readable with just `printf "%s" "$counts" |`
        fi <<<"$counts" | 
        # with the most occurence
        sort -n | tail -n1
    )

    if [ -z "$cnt" ]; then
        echo "ERROR: constructing output is not possible! There have to be duplicate system lines!" >&2
        exit 1
    fi

    # update counts - decrement the count of this system, or remove it if count is 1
    counts=$(
        # remove current system from counts
        <<<"$counts" grep -v " $sys$"
        # if the count of the system is 1, don't add it back - it's count is now 0
        if ((cnt > 1)); then
            # decrement count and add the line with system to counts
            printf "%s" "$((cnt - 1)) $sys"
        fi
    )

    # finally print output
    printf "%s\n" "$sys"
    # and remember last system
    lastsys="$sys"
done |
{
    # get system names only in `system` - using cached counts variable
    # for each system name open a grep for that name from the input file 
    # with asigned file descritpro
    # The file descriptor list is saved in an array `fds`
    fds=()
    systems=""
    while IFS=' ' read -r _ sys; do
        exec {fd}< <(grep "^$sys," "$inputfile")
        fds+=("$fd")
        systems+="$sys"$'\n'
    done <<<"$counts"

    # for each line in input
    while IFS='' read -r sys; do

        # get the position inside systems list of that system decremented by 1
        # this will be the underlying filesystem for filtering that system out of input
        fds_idx=$(<<<"$systems" grep -n "$sys" | cut -d: -f1)
        fds_idx=$((fds_idx - 1))

        # read one line from that file descriptor
        # I wonder is `sed 1p` would be faster
        IFS='' read -r -u "${fds[$fds_idx]}" line

        # output that line
        printf "%s\n" "$line"
    done
} |
{
    # check if the output is correct
    output=$(cat)

    # output should have same lines as inputfile
    if ! cmp <(sort "$inputfile") <(<<<"$output" sort); then
        echo "Output does not match input!" >&2
        exit 1
    fi

    # two consecutive lines can't have the same system
    lastsys=""
    <<<"$output" cut -d, -f1 |
    while IFS= read -r sys; do
        if [ -n "$lastsys" -a "$lastsys" = "$sys" ]; then
            echo "Same systems found on two consecutive lines!" >&2
            exit 1
        fi
        lastsys="$sys"
    done

    # all ok
    echo "all ok!"
    echo -------------
    printf "%s\n" "$output"
}

exit

Interleave lines sorted by a column

3 Answers3

Theoretical Problem

Example

Solution

Example

Implementation In Bash