Extract text between two string and perform operation on it

Question

I have a file which contains following text

<MY_TEXT="XYZ" PATH="MNO"       #First occurrence of MY_TEXT
<location= "XYZ" path="ABC"
\location>
<R_DATA = MNOP
<Mylocation ="ghdf" stime=20150301 etime=20150401 >
<Mylocation ="ghdf" stime=20150401 etime=20150501 >
\R_DATA>
<Blah>
\MY_TEXT>                       #Second occurrence of MY_TEXT
<MY_TEXT="ABC" PATH="EFG"       #Third occurrence of MY_TEXT
<location= "QQQ" path="LLL"
\location>
<R_DATA = MNOP     
<Mylocation ="ghdf" stime=20150301 etime=20150401 >
<Mylocation ="ghdf" stime=20150401 etime=20150501 >
\R_DATA>
<Blah>
\MY_TEXT>         #Fourth occurrence of MY_TEXT

My task is to find a text in line which has <MY_TEXT="XYZ", it may have spaces in start and then find its closing \MY_TEXT So output is kind of

<MY_TEXT="XYZ" PATH="MNO"       #First occurrence of MY_TEXT
<location= "XYZ" path="ABC"
\location>
<R_DATA = MNOP
<Mylocation ="ghdf" stime=20150301 etime=20150401 >  #First occurrence of Mylocation
<Mylocation ="ghdf" stime=20150401 etime=20150501 >  #Second occurrence of Mylocation
\R_DATA>
<Blah>
\MY_TEXT>

Then it finds last occurrence of Mylocation i.e #Second occurrence of Mylocation here and modified the text etime=20150501 to something and append a new line after it inline in the file.

I came across this link Sed to extract text between two strings . But using sed command here either fetches me nothing when I use -n option or prints entire file when i remove -n . So I am not able to process the text further as I am not able to extract the text I want in the first place.

I also tried sed -n '/^ *START=A *$/,/^ *END *$/p' yourfile . But of no use. Can you guys help me as my scripting is not great. Thanks in advance.

score 1 · Accepted Answer · answered Mar 21 '15 at 16:33

This is a little tricky with sed, but I'll have a go at it.

Important note: This looks like a well-defined file format, but I don't recognize it. It might be prudent to see if there are tools that work on this format directly rather than treating it like a flat file the way sed must. It is very probable that such a solution would be shorter, easier to understand, and more robust than direct-text hackery.

That said, you can use

sed -n '/<MY_TEXT="XYZ"/ { :a /\\MY_TEXT>/! { N; ba }; s/\(.*\)\(<Mylocation\)/\1\\MY_TEXT>\n\2/; h; s/.*\\MY_TEXT>\n//; s/etime=[0-9]\+/etime=something/; s/\n/\n\n/; s/$/\\MY_TEXT>/; G; s/\(.*\)\\MY_TEXT>\n\(.*\)\\MY_TEXT>\n\(.*\)/\2\1/; p }' filename

Output:

<MY_TEXT="XYZ" PATH="MNO"       #First occurrence of MY_TEXT
<location= "XYZ" path="ABC"
\location>
<R_DATA = MNOP
<Mylocation ="ghdf" stime=20150301 etime=20150401 >
<Mylocation ="ghdf" stime=20150401 etime=something >

\R_DATA>
<Blah>
\MY_TEXT>

The most confusing bit of this is the use of \MY_TEXT>\n as a marker to separate the working chunks; this is done because we know it doesn't appear anywhere else in the text. \MY_TEXT> first appears in the last line of the block we're working on, so there's never going to be a newline after it in the input data. (The code might be clearer with something else that doesn't appear in the text, but I don't know that of anything more obvious for certain).

The code works as follows:

#!/bin/sed -nf

/<MY_TEXT="XYZ"/ {                                    # If we find the starter
                                                      # line:
  :a
  /\\MY_TEXT>/! {                                     # fetch the rest of the
    N                                                 # block into the
    ba                                                # pattern space
  }
  s/\(.*\)\(<Mylocation\)/\1\\MY_TEXT>\n\2/           # mark the place before
                                                      # the last Mylocation tag
  h                                                   # copy that to the hold
                                                      # buffer
  s/.*\\MY_TEXT>\n//                                  # remove the stuff before
                                                      # the marker
  s/etime=[0-9]\+/etime=something/                    # replace  the etime
                                                      # attribute
  s/\n/\n\n/                                          # insert the new line
  s/$/\\MY_TEXT>/                                     # put a marker at the end
  G                                                   # fetch back the stuff
                                                      # from the hold buffer
  s/\(.*\)\\MY_TEXT>\n\(.*\)\\MY_TEXT>\n\(.*\)/\2\1/  # replace the end chunk
                                                      # with the edited version
  p                                                   # print the result.
}

This has been wonderful piece of sed , and you have explained it to best possible thing . But it misses out my last requirement of appending a new line after last MyLocation in place in the file. But then again i am overwhelmed. — Invictus, Mar 21 '15 at 18:02
It does insert a line, it's just empty in the script because I didn't know what you wanted to put there. Replace `s/\n/\n\n/` with `s/\n/\nyour line here\n/` to fix that. — Wintermute, Mar 21 '15 at 18:06
yes changing that does the trick but when i put -i in front of sed it modifies the entire file with new output instead of inline modification of the text we have processed. I mean the file just contain the output instead of the original file with modified text. — Invictus, Mar 21 '15 at 18:14
I thought isolating it was one of the requirements. Anyway, to do it that other way, remove the `-n` option from the sed call and remove the `p` command from the end of the sed code. — Wintermute, Mar 21 '15 at 18:19
That does the trick but sadly not in file but with output on console. — Invictus, Mar 21 '15 at 18:26
If you want to edit the file in place, you'll still have to give sed the `-i` option; just don't pass `-n` as well. `-n` suppresses the auto-printing of the pattern space after the code finished running, which was good when I tried to isolate the range but is not good when you want to print the rest of the file as well. — Wintermute, Mar 21 '15 at 18:29

score 1 · Answer 2 · answered Mar 21 '15 at 16:43

Simple solution is to use range

awk '/<MY_TEXT="XYZ"/,/\\MY_TEXT/' file
<MY_TEXT="XYZ" PATH="MNO"       #First occurrence of MY_TEXT
<location= "XYZ" path="ABC"
\location>
<R_DATA = MNOP
<Mylocation ="ghdf" stime=20150301 etime=20150401 >
<Mylocation ="ghdf" stime=20150401 etime=20150501 >
\R_DATA>
<Blah>
\MY_TEXT>                       #Second occurrence of MY_TEXT

Or sed

sed -n '/<MY_TEXT="XYZ"/,/\\MY_TEXT/p' file
<MY_TEXT="XYZ" PATH="MNO"       #First occurrence of MY_TEXT
<location= "XYZ" path="ABC"
\location>
<R_DATA = MNOP
<Mylocation ="ghdf" stime=20150301 etime=20150401 >
<Mylocation ="ghdf" stime=20150401 etime=20150501 >
\R_DATA>
<Blah>
\MY_TEXT>                       #Second occurrence of MY_TEXT

Extract text between two string and perform operation on it

2 Answers2