Extracting string between 2 strings with bash shell script

Question

I've seen questions similar to this, but none of the solutions seem to work in this case. I have a text file that looks something like this

START-OF-FILE
RUNDATE=20140910
FIRMNAME=dl
FILETYPE=pc
REPLYFILENAME=TEST
DERIVED=yes
PROGRAMFLAG=oneshot
SECID=ISIN
SECMASTER=yes
PROGRAMNAME=getdata
START-OF-FIELDS
ISSUER
START-OF-DATA
US345370CN85|0|4|FORD MOTOR COMPANY|FORD MOTOR COMPANY| | |
US31679BAC46|0|4|FIFTH STREET FINANCE COR|FIFTH STREET FINANCE COR| | |
END-OF-DATA
END-OF-FILE

I'm trying to write a bash shell script to extract only the text between "START-OF-DATA" and "END-OF-DATA" excluding both of these. So output i'm looking for would look like this

US345370CN85|0|4|FORD MOTOR COMPANY|FORD MOTOR COMPANY| | |
US31679BAC46|0|4|FIFTH STREET FINANCE COR|FIFTH STREET FINANCE COR| | |

The code i've written so far looks like this

while read line
do
    name=$line

    echo $name | sed -e 's/START-OF-DATA\(.*\)END-OF-DATA/\1/'

done < $1

and running it from bash like

./script.sh file.txt

where script.sh is what I have saved the shell script as and file.txt is the text file above that it reads. At the moment it just reads and echoes the entire file. I'm guessing its something silly in my syntax. Any pointers in the right direction would be much appreciated. Thanks

score 6 · Accepted Answer · answered Sep 11 '14 at 11:34

6

Using awk you can do:

awk '/START-OF-DATA/{p=1;next} /END-OF-DATA/{p=0;exit} p' file
US345370CN85|0|4|FORD MOTOR COMPANY|FORD MOTOR COMPANY| | |
US31679BAC46|0|4|FIFTH STREET FINANCE COR|FIFTH STREET FINANCE COR| | |

Or using sed:

sed -n '/START-OF-DATA/,/END-OF-DATA/{/START-OF-DATA\|END-OF-DATA/!p;}' file
US345370CN85|0|4|FORD MOTOR COMPANY|FORD MOTOR COMPANY| | |
US31679BAC46|0|4|FIFTH STREET FINANCE COR|FIFTH STREET FINANCE COR| | |

answered Sep 11 '14 at 11:34

anubhava

761,203
64
569
643

2

That's great. Exactly what I was looking for... You guys are quick off the mark I must say :) thanks again – tasslebear Sep 11 '14 at 11:47

score 2 · Answer 2 · answered Sep 11 '14 at 11:40

2

In order to make your solution work you could make a marker when you hit "START-OF-DATA" that reads "True" (or similar), and then end it when you hit "END-OF-DATA". Using this marker you could tell echo to print when the marker reads "True" (when you are inside the relevant block of text).

...or you could use sed:

sed -n '/START-OF-DATA/,/END-OF-DATA/ { //!p }' file.txt

answered Sep 11 '14 at 11:40

bryn

3,155
1
16
15

Thanks for the reply bryn. Your solution works perfectly. I had to upvote @anubhava as his reply was a little bit quicker. Thanks though. Have the script working fine now. Regards – tasslebear Sep 11 '14 at 11:52
Hi bryn. Your [tag:sed] command line is nicer than [anubhava's one](http://stackoverflow.com/a/25786380/938111). But a bit cryptic for me: I am wondering what `//` means in `{ //!p }`. Please give some explanations or links to websites explaning this. Cheers ;-) – oHo Sep 12 '14 at 08:49

score 1 · Answer 3 · edited May 23 '17 at 11:44

1

I'd like to add the perlish grep way, as mentioned here:

grep -Pzo "(?s)START-OF-DATA.*END-OF-DATA" "$1"

This still includes the START-OF-DATA and END-OF-DATA markers. To get rid of them, the pattern has to become a bit less readable:

grep -Pzo "(?s)(?<=START-OF-DATA\n).*(?=\nEND-OF-DATA)"

(?<=START-OF-DATA\n) and (?=\nEND-OF-DATA) are look-around assertions as described in perlre, i.e. they are used for matching, but not included in the result.

edited May 23 '17 at 11:44

Community

1
1

answered Sep 11 '14 at 12:08

Michael Jaros

4,586
1
22
39

1

Nice to use prep, but the lines `START-OF-DATA` and `END-OF-DATA` are printed :-/ Please try to improve your command line to avoid printing these two lines. Have fun :-) Cheers – oHo Sep 11 '14 at 12:29
1

@olibre: Thanks for pointing that out. I added the improved command line. – Michael Jaros Sep 11 '14 at 16:53

Extracting string between 2 strings with bash shell script

3 Answers3