0

I have a file which looks like:

  blah blah blah blah blah blah blah blah 
  blah blah blah blah blah blah blah blah 
  blah blah blah blah blah blah blah blah 
<empty line here>
     Total DOS and NOS and partial (IT) DOSDOWN   
<empty line here>
     E     Total     1
<empty line here>
-1.5000    0.004    0.000    0.004
-1.4953    0.004    0.000    0.004
-1.4906    0.004    0.000    0.004
-1.4859    0.004    0.000    0.004
-1.4812    0.004    0.000    0.004
 0.3563    0.708    5.510    0.708
 0.3609    0.562    5.513    0.562
 0.3656    0.381    5.515    0.381
 0.3703    0.149    5.517    0.149
<empty line here>
     Sublattice  1 Atom Fe   spin DOWN   

What I want is to extract all lines between (first pattern)

     Total DOS and NOS and partial (IT) DOSUP     
<empty line here>    
     E     Total     1
<empty line here>

and (second pattern)

<empty line here>
     Sublattice  1 Atom Fe   spin DOWN   

i.e. I want to get

-1.5000    0.004    0.000    0.004
-1.4953    0.004    0.000    0.004
-1.4906    0.004    0.000    0.004
-1.4859    0.004    0.000    0.004
-1.4812    0.004    0.000    0.004
 0.3563    0.708    5.510    0.708
 0.3609    0.562    5.513    0.562
 0.3656    0.381    5.515    0.381
 0.3703    0.149    5.517    0.149

So, at the end of the day I want to have lines between two multiline patterns. As I understand awk can detect multiline patterns via state machine (see here), but I failed to do it in my case.

Any suggestion how to resolve this problem would be very much appreciated.

Community
  • 1
  • 1
glanz
  • 53
  • 1
  • 1
  • 4
  • second pattern can be reduced to `` – karakfa Aug 12 '16 at 13:51
  • 1
    `awk -v RS= 'NR==3' file` would print the 3rd blank-line-separated block of text and so produce the output you want - any reason you cant just do that? – Ed Morton Aug 12 '16 at 14:16
  • 1
    @EdMorton Good one. I was making it too complicated... – hek2mgl Aug 12 '16 at 14:22
  • @EdMorton That's perfectly fine, but the block of text I'm looking for is mislaid in a huge text file and can be identified only by this line "Total DOS and NOS and partial (IT) DOSUP ..." The line "E Total 1" is not unique and cannot be used. – glanz Aug 12 '16 at 14:40
  • So you want the 2nd block after the block that contains `Total DOS and NOS and partial (IT) DOSUP`? That'd just be `awk -v RS= '/Total DOS and NOS and partial \(IT\) DOSUP/{tgt=NR+2} NR==tgt' file`. Is that it or do you really need a multi-line block to match? – Ed Morton Aug 12 '16 at 14:45
  • 1
    @EdMorton ... sigh, you did it again. Why do I even try to answer questions when you're awake? :-D – ghoti Aug 12 '16 at 14:54

3 Answers3

2

Here's a solution based on Ed Morton's trick.

awk -v RS= 'n==2; /Total DOS/ || n {n++;next} {n=0}' input.txt

Here's how this works.

  • RS= puts awk into multi-line mode, so that records contain blocks of lines.
  • n==2; prints any record processed while this condition is met.
  • /RE/ || n is a condition that evaluates to true if EITHER the RE (pattern) is matched within the current record or the variable n is non-zero.
  • {n++;next} obviously increments n and skips to the next record.
  • {n=0} And if we haven't already skipped to the next record, we reset n.

The effect of all this is that we print the record that is two records after the one with the matched pattern. You could of course adjust the condition that begins the counter to whatever you like. $2=="Total" for example. Salt to taste.

sh-3.2$ cat input.txt
  blah blah blah blah blah blah blah blah
  blah blah blah blah blah blah blah blah
  blah blah blah blah blah blah blah blah

     Total DOS and NOS and partial (IT) DOSUP

     E     Total     1

  -1.5000    0.004    0.000    0.004
  -1.4953    0.004    0.000    0.004
  -1.4906    0.004    0.000    0.004
  .......    .....    .....    .....
   0.3609    0.562    5.513    0.562
   0.3656    0.381    5.515    0.381
   0.3703    0.149    5.517    0.149

   blah      blah     blah     blah

sh-3.2$ awk -v RS=  'n==2; /Total DOS and NOS/||n{n++;next} {n=0}' input.txt
  -1.5000    0.004    0.000    0.004
  -1.4953    0.004    0.000    0.004
  -1.4906    0.004    0.000    0.004
  .......    .....    .....    .....
   0.3609    0.562    5.513    0.562
   0.3656    0.381    5.515    0.381
   0.3703    0.149    5.517    0.149
ghoti
  • 45,319
  • 8
  • 65
  • 104
  • @glanz - can you clarify? For me, given the input data in your question, this produced the output you mentioned under "I want to get". Seven lines, two blocks of three lines with four columns, separated by the line with dots. Nothing else. Is it possible that your actual data has TWO blank lines after the pattern, rather than just one? – ghoti Aug 12 '16 at 15:05
  • I think the `{n=0}` block will only get hit when `n` is already `0` so you could remove it, or come up with some other logic if you're trying to reset it after the first target block is printed. – Ed Morton Aug 12 '16 at 15:22
  • @ghoti @Ed - Your answer is totally correct and nicely explained. I finally realized why it didn't work with my original data. The problem was (and is) that one of the empty lines after `Total DOS...` has one _invisible_ space sign and therefore `awk` cannot count it as empty line. Thank you once again. – glanz Aug 12 '16 at 16:01
1

Using sed: sed -n '5,/^$/{/^$/d}'

But that assumes that "multiline starting pattern" is always at the beginning of the file. Otherwise it gets a bit more complicated. Like this:

/Total/{N;N;N}
/Total.*Total/,/^$/{
    /Total/d
    /^$/d
}

Here I am assuming that 'Total' matches the beginning of multiline pattern, 'Total.*Total' matches the whole pattern. Replace N;N;N with something more complex if there are other patterns that start with first line of you multiline pattern but are shorter than 4 lines.

aragaer
  • 17,238
  • 6
  • 47
  • 49
1

From your comments it sounds like all you need is:

awk -v RS= '/Total DOS/{tgt=NR+2} NR==tgt' file

If not then edit your question to clarify. Make it NR==tgt{print; exit} if you only want the first matching block in the file output and efficiency is a concern. Change the regexp if necessary to be as much of the Total DOS... line as you need to match to make it unique.

Here it is running against your provided sample input:

$ cat file
  blah blah blah blah blah blah blah blah
  blah blah blah blah blah blah blah blah
  blah blah blah blah blah blah blah blah

     Total DOS and NOS and partial (IT) DOSUP

     E     Total     1

  -1.5000    0.004    0.000    0.004
  -1.4953    0.004    0.000    0.004
  -1.4906    0.004    0.000    0.004
  .......    .....    .....    .....
   0.3609    0.562    5.513    0.562
   0.3656    0.381    5.515    0.381
   0.3703    0.149    5.517    0.149

   blah      blah     blah     blah

$ awk -v RS= '/Total DOS/{tgt=NR+2} NR==tgt' file
  -1.5000    0.004    0.000    0.004
  -1.4953    0.004    0.000    0.004
  -1.4906    0.004    0.000    0.004
  .......    .....    .....    .....
   0.3609    0.562    5.513    0.562
   0.3656    0.381    5.515    0.381
   0.3703    0.149    5.517    0.149
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • Your answer is totally correct. The problem with my original data was (and is) that one of the empty lines after `Total DOS...` has one _invisible_ space sign and therefore `awk` cannot count it as empty line. – glanz Aug 12 '16 at 16:16