7

cat grab.txt

My Dashboard
Fnfjfjf. random test
00:50

1:01:56
My Notes
No data found.

                                
Change Language                                                                                                                  + English                                                          

Submit


Estimation of Working Capital Lecture 1

Estimation of Working Capital Lecture 2

Estimation of Working Capital Lecture 3

Money Market Lecture 254

Money Market Lecture 255

Money Market Lecture 256

International Trade Lecture 257

International Trade Lecture 258

International Trade Lecture 259
Terms And Conditions
84749473837373
Random text fifjfofifofjfkfkf

I need to filter this text after doing the following

  1. Delete all lines before the first occurrence of word - Lecture
  2. Delete all lines after the last occurrence of word - Lecture
  3. Remove all empty lines

Expected output

Estimation of Working Capital Lecture 1
Estimation of Working Capital Lecture 2
Estimation of Working Capital Lecture 3
Money Market Lecture 254
Money Market Lecture 255
Money Market Lecture 256
International Trade Lecture 257
International Trade Lecture 258
International Trade Lecture 259

What have I tried so far?

cat grab.txt | sed -r '/^\s*$/d; /Lecture/,$!d'

After searching for a bit and some trial-error, I am able to remove empty lines and remove all lines before the first occurrence but unable to remove all lines after the last occurrence.

Note - Even tho my existing command is using sed, its fine if the answer is in awk, perl or grep

Sachin
  • 1,217
  • 2
  • 11
  • 31

4 Answers4

6

Could you please try following. Written and tested with shown samples with GNU awk.

awk '
/Lecture/{
  found=1
}
found && NF{
  val=(val?val ORS:"")$0
}
END{
  if(val){
    match(val,/.*Lecture [0-9]+/)
    print substr(val,RSTART,RLENGTH)
  }
}'  Input_file

Explanation: Adding detailed explanation for above.

awk '                                        ##Starting awk program from here.
/Lecture/{                                   ##Checking if a line has Lecture keyword then do following.
  found=1                                    ##Setting found to 1 here.
}
found && NF{                                 ##Checking if found is SET and line is NOT NULL then do following.
  val=(val?val ORS:"")$0                     ##Creating va and keep adding its value in it.
}
END{                                         ##Starting END block of this code here.
  if(val){                                   ##Checking condition if val is set then do following.
    match(val,/.*Lecture [0-9]+/)            ##Matching regex till Lecture digits in its value.
    print substr(val,RSTART,RLENGTH)         ##Printing sub string of matched values here to print only matched values.
  }
}' Input_file                                ##Mentioning Input_file name here.
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
  • 1
    You are quick and I give you the nod for actually protecting against non-`Lecture` lines in between. – David C. Rankin Jun 21 '20 at 02:43
  • 1
    @Sachin, hey Sachin, I am sorry for changing questions after ask of original question is NOT recommended on SO. there are 2 main reasons for this. 1st- We all spend time on question and write answers if you keep on changing requirements it will be like spending more time for not known problem. 2nd- Please think about future users they will be confused that why answers are keep on changing, so its better NOT to change question once posted, you could always open new question with your efforts and with clear samples and you will be getting guidance on same cheers. – RavinderSingh13 Aug 17 '20 at 07:36
5

Simply using grep 'Lecture' file with the input you have shown in file will work:

$ grep 'Lecture' file
Estimation of Working Capital Lecture 1
Estimation of Working Capital Lecture 2
Estimation of Working Capital Lecture 3
Money Market Lecture 254
Money Market Lecture 255
Money Market Lecture 256
International Trade Lecture 257
International Trade Lecture 258
International Trade Lecture 259

(note: this simply grabs all the lines containing Lecture. See @RavinderSingh13 answer for protecting against non-Lecture lines in between)

David C. Rankin
  • 81,885
  • 6
  • 58
  • 85
  • Thanks for answering , works well , urs and earlier answer cover both scenarios , that's great – Sachin Jun 21 '20 at 02:48
2

You could replace matches of the following regular expression (with the multiline flag set) with empty strings using your tool of choice. The regex engine need only support negative lookaheads.

\A(?:^(?!.*\bLecture\b).*\r?\n)*|^\r?\n|^.*\r?\n(?![\s\S]*\bLecture\b)

Start your engine!

The regex engine performs the following operations.

\A                  : match beginning of string (not line)    
(?:                 : begin a non-capture group
  ^                 : match beginning of line
  (?!.*\bLecture\b) : assert the line does not contain 'Lecture'
  .*\r?\n           : match the line
)                   : end non-capture group
*                   : execute the non-capture group 0+ times
|                   : or
^\r?\n              : match an empty line
|                   : or
^.*\r?\n            : match a line
(?!                 : begin a negative lookahead
  [\s\S]*           : match 0+ characters, including line terminators
  \bLecture\b       : match 'Lecture'
)                   : end negative lookahead
Cary Swoveland
  • 106,649
  • 6
  • 63
  • 100
2

Print everything starting from the first occurrence of the pattern, reverse the file, print everything starting from the first occurrence of the pattern, then reverse the result:

awk "/Lecture/,0" file.txt | tac | awk "/Lecture/,0" | tac
Joe Jobs
  • 201
  • 1
  • 12