Sort lines between two patterns in file recursively

Question

I have file with following format

3
Lattice="89.8218778092 0.0 0.0 0.0 15.8543061924 0.0 0.0 0.0 25.203816" Properties=id:I:1:species:S:1:
1 1 -0.00119157 -5.67557e-05 -1.49279e-04
2 3 0.00220589 -0.00133867 9.67397e-04
3 2 -5.43822e-04 -0.00119676 -8.99064e-05
3
Lattice="89.8218778092 0.0 0.0 0.0 15.8543061924 0.0 0.0 0.0 25.203816" Properties=id:I:1:species:S:1:
1 1 -0.00119157 -5.67557e-05 -1.49279e-04
2 3 0.00220589 -0.00133867 9.67397e-04
3 2 -5.43822e-04 -0.00119676 -8.99064e-05

I would like to be able to sort the content of the file according to the second column without modifing the lines analog to the following that should always stay in place

3
Lattice="89.8218778092 0.0 0.0 0.0 15.8543061924 0.0 0.0 0.0 25.203816" Properties=id:I:1:species:S:1:

Expected output

3
Lattice="89.8218778092 0.0 0.0 0.0 15.8543061924 0.0 0.0 0.0 25.203816" Properties=id:I:1:species:S:1:
1 1 -0.00119157 -5.67557e-05 -1.49279e-04
3 2 -5.43822e-04 -0.00119676 -8.99064e-05
2 3 0.00220589 -0.00133867 9.67397e-04
3
Lattice="89.8218778092 0.0 0.0 0.0 15.8543061924 0.0 0.0 0.0 25.203816" Properties=id:I:1:species:S:1:
1 1 -0.00119157 -5.67557e-05 -1.49279e-04
3 2 -5.43822e-04 -0.00119676 -8.99064e-05
2 3 0.00220589 -0.00133867 9.67397e-04

I tried Is there a way to ignore header lines in a UNIX sort? but didn't worked out as expected.

I would like to this in BASH.

Please do mention your sample of expected output in your question. Also keep the samples of input and output clear and short(in length) in your question. — RavinderSingh13, Nov 17 '20 at 17:45
Actually, you might want to change the title of your question to _Sort lines between two patterns in file_. Also, your question is very poorly defined. Your expected output is not sorted according to the first and second column at all. — kvantour, Nov 17 '20 at 17:53
The expected output looks like it's sorted by the *third* column. Are those lines from the entire input file, or just the lines between two "lattice" lines? — tripleee, Nov 17 '20 at 17:57
@tripleee yeah those sorted lines from entire set of 1080 lines after the header. I can't paste all those here. — abhijit dhakane, Nov 17 '20 at 18:02
@EdMorton there are 1080 lines between the pattern of "1080 'Lattice' ..." — abhijit dhakane, Nov 17 '20 at 18:04
That doesn't clarify much. Do you mean there are 1800 lines in each section and you want to sort each section on the third column (not the first and second)? — tripleee, Nov 17 '20 at 18:05
@tripleee there are 1080 lines after ``` 1080 Lattice ``` part and with repeating same pattern. — abhijit dhakane, Nov 17 '20 at 18:21
It doesn't matter how many lines are present in your real data, you need to come up with a [mcve] that represents your problem in a minimal way (e.g. 5 lines instead of 1800 lines) for us to be able to help you. See [ask]. — Ed Morton, Nov 17 '20 at 18:27
Are the "dots" `...` just a way to show that there are more data or are thy really there? — Riccardo Petraglia, Nov 17 '20 at 18:28
Is that `1080` the number of expected lines in the next chunk? — Riccardo Petraglia, Nov 17 '20 at 18:29
@RiccardoPetraglia yeah there is more data means "dots". Yes expected chunk is 1080 excluding `1080 Lattice="89.8218778092 0.0 0.0 0.0 15.8543061924 0.0 0.0 0.0 25.203816" Properties=id:I:1:species:S:1:` — abhijit dhakane, Nov 17 '20 at 18:36

tripleee · Answer 1 · 2020-11-18T10:29:34.897

This is moderately tricky in Bash or with traditional line-oriented Unix utilities, but almost easy in GNU Awk or a modern scripting language like Python.

#!/usr/bin/env python3
import sys

section = []
lattice = False

def sort_em(lines):
    return ''.join(sorted(lines, key=lambda x: tuple(map(float, x.split()[2:4]))))

def print_em(*lines):
    print(*lines, end='')

for line in sys.stdin:
    if line.startswith('1080\n'):
        if section:
            print_em(sort_em(section))
            section = []
        lattice = True
        print_em(line)
    elif lattice:
        if not line.startswith('Lattice="'):
            raise ValueError('Expected Lattice="..." but got %s' % line)
        lattice = False
        print_em(line)
    else:
        section.append(line)
if section:
    print_em(sort_em(section))

You would save this in a file in your PATH, and chmod a+x it. If you called it sortsections, you would run it like

sortsections filename >newfile

to read the lines in filename and output them to newfile sorted as per the requirements.

Demo: https://ideone.com/7RRvXQ

The tuple(map(float ...)) thing extracts the fields we want to sort on, converts them all to float, and collects them into a tuple. (Slightly obscurely, map returns a generator object so we have to generate the result by calling tuple() on it.) The print wrapper avoids having to repeat end='' every time we want to print something. (The lines we read each have a trailing newline already, but print without end='' would add another.)

This hard-codes 1080 as the marker of a new section; it would not be hard to change it to read the first line and then use that as the marker for all subsequent sections, and/or count that each section contains that many lines, and read a new count when you have consumed the number of lines indicated in each header section.

Is there any way to invoke sort after extracting pattern by awk recursively? — abhijit dhakane, Nov 17 '20 at 21:32
@RiccardoPetraglia It could be done in Bash (see other answer now) but one of the skills you need as a shell programmer is when to know when to shift to a language where the solution comes naturally. Traditionally you would have to learn `sed` and Awk as well as shell, but in this day and age, for this particular solution, I chose Python, which is very widely available on a large number of platforms. (Bonus: it runs on Windows too, if you are of a masochistic bent.) — tripleee, Nov 18 '20 at 05:19
@abhijitdhakane It's not impossible, but the problem with an Awk solution is that it would have to rely on an external `sort`. Python has `sorted` built in so the solution is obvious and straightforward. (The main wart is how I had to wrap `print` to avoid writing a second newline. I guess I could have used `write` instead.) — tripleee, Nov 18 '20 at 05:21
@tripleee did you see my solution below? It is probably slower than your (since it is about creating many files) but I used it many times in the past and is quite faster to write... — Riccardo Petraglia, Nov 18 '20 at 10:22
It's not like I spent a lot of time on this one either. Maybe 5 minutes to write and 15 to polish it (took longer than usual because I renamed a variable but forgot to change it in one place). — tripleee, Nov 18 '20 at 10:25

Riccardo Petraglia · Answer 2 · 2020-11-17T18:54:41.080

0

The idea is to split the big file you have in many smaller file containing only 1 cell. Then you use the method you also linked to sort the lines the way you want in each file. Finally you concatenate the files with the sorted data together with cat.

#!/usr/bin/env bash

nlines=$(head -n 1 $1)     # Get the number of lines per each cell
let nlines+=2              # Add to the number of sites the header lines
split -l $nlines -a 5 $1   # Split the file in multiple files each one containing a single cell
for file in ${1}*; do      # Sort each file individually
  (head -n 2 $file && tail -n +3 $file | sort -k 2) > sorted-$file;
cat sorted-${1}* > $2      # Concatenate all the sorted files
rm sorted-${1}*            # Remove the sorted files

Use this as:

script.sh <file_name> <new_file_name>

DISCLAIMER: I did not test this, try it in a clean folder with a copy of the original file. This is going to generate many files and clean them at the end.

If you provide a real example on pastylink for example, I can twik the script better.

edited Nov 17 '20 at 18:54

answered Nov 17 '20 at 18:43

Riccardo Petraglia

1,943
1
13
25

@abhijit dhakane let me know if this worked... :) – Riccardo Petraglia Nov 18 '20 at 10:23
1

Probably [quote your variables](https://stackoverflow.com/questions/10067266/when-to-wrap-quotes-around-a-shell-variable); it would be silly for this to break just because the user passed in a file name which contains spaces or other shell metacharacters. – tripleee Nov 18 '20 at 10:34
1

A proper solution would use `mktemp -d` to create a directory for the temporary files, and a `trap` to remove the temporary directory in the case of an error or SIGTERM. – tripleee Nov 18 '20 at 10:35
@tripleee Yeah... the `mktemp` + `trap` is a very nice idea! Most of the time I write this kind of scripts for a single usage... so do not spend too much time in improving them... but next time will try with your suggestions! tyty – Riccardo Petraglia Nov 18 '20 at 10:37

Sort lines between two patterns in file recursively

2 Answers2