21

i have a text file which looks like this:

random useless text 
<!-- this is token 1 --> 
para1 
para2 
para3 
<!-- this is token 2 --> 
random useless text again

I want to extract the text in between the tokens (excluding the tokens of course). I tried using ## and %% to extract the data in between but it didn't work. I think it is not meant for manipulating such large text files. Any suggestions how i can do it ? maybe awk or sed ?

tripleee
  • 175,061
  • 34
  • 275
  • 318
tapan
  • 1,776
  • 2
  • 18
  • 31

7 Answers7

41

No need for head and tail or grep or to read the file multiple times:

sed -n '/<!-- this is token 1 -->/{:a;n;/<!-- this is token 2 -->/b;p;ba}' inputfile

Explanation:

  • -n - don't do an implicit print
  • /<!-- this is token 1 -->/{ - if the starting marker is found, then
    • :a - label "a"
      • n - read the next line
      • /<!-- this is token 2 -->/q - if it's the ending marker, quit
      • p - otherwise, print the line
    • ba - branch to label "a"
  • } end if
Dennis Williamson
  • 346,391
  • 90
  • 374
  • 439
  • In your sed script you used `b` to exit the loop, but in your explanations you used `q` (I noticed this when using your instructions, `q` seems to make sed quit immediately whereas `b` will just exit the loop but continue looking for the next `token 1` marker. – Frerich Raabe Oct 16 '12 at 21:56
  • 1
    Another thing I noticed: with the FreeBSD sed, `sed -n '/^----$/{n;/^----$/q;p;}' /dev/null` works fine (no output), but adding the loop (i.e. `sed -n '/^----$/{:a;n;/^----$/q;p;ba}' /dev/null`) makes sed yield `unexpected EOF (pending }'s)`. I *have to* write out the version using the loop in multiple lines. :-( – Frerich Raabe Oct 16 '12 at 21:59
  • @FrerichRaabe: For the example text from the question, on my system, `b` and `q` have the same effect. The fact that I posted it both ways was accidental. Sed varies quit a bit from system to system. It is possible that this will work for you on FreeBSD (all on one line): `sed -n -e '//{' -e ':a' -e 'n' -e '//b' -e 'p' -e 'ba' -e '}'` – Dennis Williamson Oct 17 '12 at 01:20
  • @DennisWilliamson: Heh, splitting the script is a clever workaround. I'll give it a try! +1 for your answer by the way, I think 'sed' is terribly underrated! – Frerich Raabe Oct 17 '12 at 07:17
  • @DennisWilliamson How to have a bash variable instead of ``? – Ilia Ross Sep 25 '17 at 13:06
  • @IliaRostovtsev: Incorporating Bash variables into `sed` commands can be tricky since you'll need to use double quotes in order to have the variable to be evaluated. You may have to avoid unintended evaluations by escaping some characters. You should also use curly braces with the variable name in order to separate it from other characters which may be considered as part of the name. In this particular case: `var=''; sed -n "//{:a;n;/${var}/b;p;ba}" inputfile` should work (untested). – Dennis Williamson Sep 25 '17 at 13:43
  • @DennisWilliamson Dennis thank you. I'm aware of that. The problem is that my string contain `/*` and `*/` It fails with the error `sed: -e expression #1, char 3: unknown command: *'` but if I escape them with `\` then I have whole bunch of other errors, like `sed: -e expression #1, char 20: unknown command: u` and `sh: p: command not found` and `sh: ba}: command not found`. It seems that it's executing it as literal. Any idea? – Ilia Ross Sep 25 '17 at 13:50
  • I figured out that my string in variable contained `/` and those actually made it broken. After replacing `/` with `\/`everything seems to work. Thanks again, Dennis. ;) – Ilia Ross Sep 25 '17 at 18:50
  • @IliaRostovtsev: You can also use alternate delimiters (at least in GNU and OS X`sed`): `sed 's|replace me|with this|` or `sed -n '\|find me|p'`. In the second case (matching), the first delimiter must be escaped if it's not a slash and the second delimiter must not be escaped. Using alternate delimiters allows you to use strings (or variables) which contain slashes without having to modify them. – Dennis Williamson Sep 25 '17 at 21:08
26

You can extract it, including the tokens with sed. Then use head and tail to strip the tokens off.

... | sed -n "/this is token 1/,/this is token 2/p" | head -n-1 | tail -n+2
Peter Taylor
  • 4,918
  • 1
  • 34
  • 59
1

no need to call mighty sed / awk / perl. You could do it "bash-only":

#!/bin/bash
STARTFLAG="false"
while read LINE; do
    if [ "$STARTFLAG" == "true" ]; then
            if [ "$LINE" == '<!-- this is token 2 -->' ];then
                    exit
            else
                    echo "$LINE"
            fi
    elif [ "$LINE" == '<!-- this is token 1 -->' ]; then
            STARTFLAG="true"
            continue
    fi
done < t.txt

Kind regards

realex

realex
  • 11
  • 1
1

Try the following:

sed -n '/<!-- this is token 1 -->/,/<!-- this is token 2 -->/p' your_input_file
        | egrep -v '<!-- this is token . -->'
aioobe
  • 413,195
  • 112
  • 811
  • 826
1

Maybe sed and awk have more elegant solutions, but I have a "poor man's" approach with grep, cut, head, and tail.

#!/bin/bash

dataFile="/path/to/some/data.txt"
startToken="token 1"
stopToken="token 2"

startTokenLine=$( grep -n "${startToken}" "${dataFile}" | cut -f 1 -d':' )
stopTokenLine=$( grep -n "${stopToken}" "${dataFile}" | cut -f 1 -d':' )

let stopTokenLine=stopTokenLine-1
let tailLines=stopTokenLine-startTokenLine

head -n ${stopTokenLine} ${dataFile} | tail -n ${tailLines}
0

For anything like this, I'd reach for Perl, with its combination of (amongst others) sed and awk capabilities. Something like (beware - untested):

my $recording = 0;
my @results = ();
while (<STDIN>) {
   chomp;
   if (/token 1/) {
      $recording = 1;
   }
   else if (/token 2/) {
      $recording = 0;
   }
   else if ($recording) {
      push @results, $_;
   }
}
Brian Agnew
  • 268,207
  • 37
  • 334
  • 440
0
sed -n "/TOKEN1/,/TOKEN2/p" <YOUR INPUT FILE> | sed -e '/TOKEN1/d' -e '/TOKEN2/d'
Kelly Beard
  • 684
  • 1
  • 8
  • 20