How to use sed/grep to extract text between two words?

Question

I am trying to output a string that contains everything between two words of a string:

input:

"Here is a String"

output:

"is a"

Using:

sed -n '/Here/,/String/p'

includes the endpoints, but I don't want to include them.

What should be the result if the input is `Here is a Here String`? Or `I Hereby Dub Thee Sir Stringy`? — ghoti, Nov 06 '12 at 00:17
FYI. Your command means to print everything between the line that has the word Here and the line that has the word String -- not what you want. — Hai Vu, Nov 06 '12 at 00:54
The other common `sed` FAQ is "how can I extract text between particular lines"; this is https://stackoverflow.com/questions/16643288/sed-to-extract-text-between-two-strings — tripleee, Jul 30 '20 at 05:44

anishsane · Answer 1 · 2019-02-21T04:03:28.343

248

GNU grep can also support positive & negative look-ahead & look-back: For your case, the command would be:

echo "Here is a string" | grep -o -P '(?<=Here).*(?=string)'

If there are multiple occurrences of Here and string, you can choose whether you want to match from the first Here and last string or match them individually. In terms of regex, it is called as greedy match (first case) or non-greedy match (second case)

$ echo 'Here is a string, and Here is another string.' | grep -oP '(?<=Here).*(?=string)' # Greedy match
 is a string, and Here is another 
$ echo 'Here is a string, and Here is another string.' | grep -oP '(?<=Here).*?(?=string)' # Non-greedy match (Notice the '?' after '*' in .*)
 is a 
 is another

edited Feb 21 '19 at 04:03

answered Nov 06 '12 at 06:58

anishsane

20,270
5
40
73

42

Note that GNU grep's `-P` option does not exist in the `grep` included in *BSD, or the ones that come with any SVR4 (Solaris, etc). In FreeBSD, you can install the `devel/pcre` port which includes `pcregrep`, which supports PCRE (and look-ahead/behind). Older versions of OSX used GNU grep, but in OSX Mavericks, `-P` is derived from FreeBSD's version, which does not include the option. – ghoti May 05 '14 at 02:18
1

Hi, How do I extract distinct content only ? – Durgesh Suthar Sep 16 '15 at 09:44
4

This doesn't work because if your ending string "string" occurs more than once, it will get the *last* occurrence, not the *next* occurrence. – Buttle Butkus Oct 27 '16 at 00:43
6

In case of `Here is a string a string`, **both** `" is a "` and `" is a string a "` are valid answers (ignore the quotes), as per the question requirements. It depends on you which one of these **you** want and then answer can be different accordingly. Anyway, for your requirement, this will work: `echo "Here is a string a string" | grep -o -P '(?<=Here).*?(?=string)'` – anishsane Oct 27 '16 at 03:31
Maybe that should be spelled out in the answer properly, though. – tripleee Sep 15 '17 at 12:10
@ghoti `grep -P` flag does not work on MacOS X 10.9 and above. But you can convert it to a `perl` command using the guide here: https://stackoverflow.com/questions/16658333/grep-p-no-longer-works-how-can-i-rewrite-my-searches – Mr-IDE Feb 27 '19 at 19:37
@Mr-IDE, when I made that comment, 10.9 was the most recent version available. Yes, as I said, the `-P` option does not work because macOS's `grep` comes from FreeBSD. You can get a `pcregrep` binary via the pcre1 package in [brew](https://formulae.brew.sh/formula/pcre) or [macports](https://www.macports.org/ports.php?by=name&substr=pcre). – ghoti Feb 28 '19 at 02:24
This is more way user-friendly that this shitty `sed` - I couldn't get to work `sed`, but `grep` work like a charm for the first time! – Mariusz Jan 10 '20 at 12:08
how would you use this on an xml file? – kRazzy R Jan 24 '20 at 21:12
@kRazzyR, First, you should avoid using text manipulation commands (like awk, sed) for parsing XML. Use `xmlstarlet` instead. It is much easier and works better for XML. Having said that, you can still use sed/awk in the same way for XML as any other text file. – anishsane Jan 27 '20 at 04:29
How to you include `Here` and `string` in the results? – Smeterlink Apr 11 '20 at 11:45
@Smeterlink, that is much simpler than the original question. `grep -o 'Here.*string'` or `grep -oP 'Here.*?string'` based on whether you want match to be greedy or non greedy. – anishsane Apr 12 '20 at 13:59
Will this work if the two patterns are on different lines. Tried but not working. – BND May 08 '20 at 08:36
4

@BND, you need to enable [multi-line search feature of pcregrep](https://stackoverflow.com/a/7167115). `echo $'Here is \na string' | grep -zoP '(?<=Here)(?s).*(?=string)'` – anishsane May 08 '20 at 08:48
@anishsane a sedic solution too `sed -n '/Here/,/string/p'` :). This includes the patterns though. – BND May 08 '20 at 10:14
1

@BND, no. That would print the ENTIRE line containing the two words. Try e.g. `echo $'Hello there. Here is \n a string. goodbye there.' | sed ...` – anishsane May 08 '20 at 10:16
How to stop at the first occurance? – Suryaprakash Pisay Nov 06 '20 at 09:39
Thanks for your answer! What if the "Here" is present several time and I would like to catch only the text between the last "Here" and "string"? – TheLazyFox May 06 '21 at 13:54
@TheLazyFox, I have to check. I don't have an answer off-hand. But a simple hack would be to reverse the input and search string; perform a non-greedy search and reverse the result. (`rev` command does it.) – anishsane May 08 '21 at 16:59
@TheLazyFox, you can use sed or perl for this: `perl -nE 'say /.*(?<=Here)(.*)String/'` / `sed -r 's/.*Here(.*)String/\1/'` – anishsane May 20 '21 at 04:48

score 154 · Accepted Answer · answered Nov 06 '12 at 00:14

154

sed -e 's/Here\(.*\)String/\1/'

answered Nov 06 '12 at 00:14

Brian Campbell

322,767
57
360
340

2

Thanks! What if I wanted to find everything between "one is" and "String" in "Here is a one is a String"? (sed -e 's/one is$.*$String/\1/' ? – user1190650 Nov 06 '12 at 00:31
8

@user1190650 That would work if you want to see the "Here is a" as well. You can test it out: `echo "Here is a one is a String" | sed -e 's/one is$.*$String/\1/'`. If you just want the part between "one is" and "String", then you need to make the regex match the whole line: `sed -e 's/.*one is$.*$String.*/\1/'`. In sed, `s/pattern/replacement/` say "substitute 'replacement' for 'pattern' on each line". It will only change anything that matches "pattern", so if you want it to replace the whole line, you need to make "pattern" match the whole line. – Brian Campbell Nov 06 '12 at 13:59
9

This breaks when the input is `Here is a String Here is a String` – Jay D May 19 '15 at 01:09
1

Would be great to see the solution for a case : "Here is a blah blah String Here is 1 a blah blah String Here is 2 a blash blash String" output should pick up only the first substring between Here and String" – Jay D May 19 '15 at 01:10
1

@JayD sed does not support non-greedy matching, see [this question](https://stackoverflow.com/questions/1103149/non-greedy-regex-matching-in-sed) for some recommended alternatives. – Brian Campbell May 19 '15 at 14:11
@BrianCampbell Thanks Brian for the reference. – Jay D May 19 '15 at 21:26
1

What about if I want also start and end strings? Like [this regex](https://regex101.com/r/4Kmw09/3). I tried multiple regex, e.g `echo "before text START some text END more text" | sed -n '/START.*?END/g'` – Mikel Nov 11 '16 at 16:09
1

This answer does not work if there is text before the `Here` and after the `String`. See [my answer](http://stackoverflow.com/a/43795984/1902896) for a solution. – wheeler May 05 '17 at 03:24
@wheeler A similar case is discussed in the first and second comment on this answer already (the "one is" case), where I discuss what you need to do if your strings are not the exact prefix and suffix of the line. – Brian Campbell May 08 '17 at 19:02
This will not work: `echo Here is a String, and Here is b String! | sed -e 's/Here$.*$String/\1/'` The answer from @anishsane is better. – Cyborg Feb 20 '19 at 14:19
1

What is the meaning of `\1`? – Aug 12 '20 at 15:13

score 95 · Answer 3 · answered May 05 '17 at 03:23

95

The accepted answer does not remove text that could be before Here or after String. This will:

sed -e 's/.*Here\(.*\)String.*/\1/'

The main difference is the addition of .* immediately before Here and after String.

answered May 05 '17 at 03:23

wheeler

2,823
3
27
43

Your answer is promising. One issue though. How can I extract it to the first seen String if there are multiple String in the same line? Thanks – Dr. Mian Jun 26 '18 at 08:55
@MianAsbatAhmad You would want to make the `*` quantifier, between `Here` and `String`, non-greedy (or lazy). However, the type of regex used by sed does not support lazy quantifiers (a `?` immediately after `.*`) according to [this](https://stackoverflow.com/questions/1103149/non-greedy-reluctant-regex-matching-in-sed) Stackoverflow question. Usually to implement a lazy quantifier you would just match against everything except the token you didn't want to match, but in this case, there isn't just a single token, instead its a whole string, `String`. – wheeler Jun 26 '18 at 21:30
Thanks, I got the answer using awk, https://stackoverflow.com/questions/51041463/how-to-extract-line-portion-on-the-basis-of-start-substring-and-end-substring-us/51047792#51047792 – Dr. Mian Jun 27 '18 at 04:25
1

Unfortunately this doesn't work if the string has line breaks – WitaloBenicio Jun 06 '19 at 10:47
It's not supposed to. `.` doesn't match line breaks. If you want to match line breaks, you can replace `.` with something like `[\s\s]`. – wheeler Jun 18 '19 at 14:58
@wheeler replacing . with [\s\s] not removing line breaks . – sreekanth balu Jul 22 '20 at 04:21
Whoops, it is meant to be `[\s\S]`. – wheeler Aug 05 '20 at 02:55

ghoti · Answer 4 · 2016-10-20T20:35:38.593

47

You can strip strings in Bash alone:

$ foo="Here is a String"
$ foo=${foo##*Here }
$ echo "$foo"
is a String
$ foo=${foo%% String*}
$ echo "$foo"
is a
$

And if you have a GNU grep that includes PCRE, you can use a zero-width assertion:

$ echo "Here is a String" | grep -Po '(?<=(Here )).*(?= String)'
is a

edited Oct 20 '16 at 20:35

answered Nov 06 '12 at 00:19

ghoti

45,319
8
65
104

why is this method so slow? when stripping a large html page using this method it takes like 10 seconds. – Adam Johns Jan 22 '14 at 15:12
@AdamJohns, which method? The PCRE one? PCRE is fairly complex to parse, but 10 seconds seems extreme. If you're concerned, I recommend you [pose a question](http://stackoverflow.com/questions/ask) including example code, and see what the experts say. – ghoti Jan 27 '14 at 06:01
1

I think it was so slow for me because it was holding a very large html file's source in a variable. When I wrote contents to file and then parsed the file the speed dramatically increased. – Adam Johns Jan 27 '14 at 14:14
Should be the accepted answer, because it uses pure Bash. – Akito Jul 01 '21 at 19:40

score 33 · Answer 5 · edited Jun 05 '15 at 09:18

33

If you have a long file with many multi-line ocurrences, it is useful to first print number lines:

cat -n file | sed -n '/Here/,/String/p'

edited Jun 05 '15 at 09:18

Juve

10,584
14
63
90

answered Jun 08 '13 at 13:11

alemol

8,058
2
24
29

5

Thanks! This is the only solution which worked in my case (multiple line text file, rather than a single string with no line breaks). Obviously, to have it without line numbering, the `-n` option in `cat` must be omitted. – Jeffrey Lebowski Jun 02 '16 at 13:39
2

... in which case `cat` can be entirely omitted; `sed` knows how to read a file or standard input. – tripleee Sep 15 '17 at 12:07

Avinash Raj · Answer 6 · 2014-08-19T15:19:41.087

30

Through GNU awk,

$ echo "Here is a string" | awk -v FS="(Here|string)" '{print $2}'
 is a

grep with -P(perl-regexp) parameter supports \K, which helps in discarding the previously matched characters. In our case , the previously matched string was Here so it got discarded from the final output.

$ echo "Here is a string" | grep -oP 'Here\K.*(?=string)'
 is a 
$ echo "Here is a string" | grep -oP 'Here\K(?:(?!string).)*'
 is a

If you want the output to be is a then you could try the below,

$ echo "Here is a string" | grep -oP 'Here\s*\K.*(?=\s+string)'
is a
$ echo "Here is a string" | grep -oP 'Here\s*\K(?:(?!\s+string).)*'
is a

edited Aug 19 '14 at 15:19

answered Aug 19 '14 at 15:07

Avinash Raj

172,303
28
230
274

This does not work for: `echo "Here is a string dfdsf Here is a string" | awk -v FS="(Here|string)" '{print $2}'`, it only returns `is a` instead of should be `is a is a`@Avinash Raj – alper Jan 06 '18 at 12:09

score 12 · Answer 7 · 2020-07-03T14:44:03.230

To understand sed command, we have to build it step by step.

Here is your original text

user@linux:~$ echo "Here is a String"
Here is a String
user@linux:~$

Let's try to remove Here string with substition option in sed

user@linux:~$ echo "Here is a String" | sed 's/Here //'
is a String
user@linux:~$

At this point, I believe you would be able to remove String as well

user@linux:~$ echo "Here is a String" | sed 's/String//'
Here is a
user@linux:~$

But this is not your desired output.

To combine two sed commands, use -e option

user@linux:~$ echo "Here is a String" | sed -e 's/Here //' -e 's/String//'
is a
user@linux:~$

Hope this helps

Thank you for this - the explanation on what exactly it's doing was very helpful for me to understand — Kosz, Jul 06 '23 at 18:49

score 10 · Answer 8 · answered Feb 05 '20 at 09:56

You can use two s commands

$ echo "Here is a String" | sed 's/.*Here//; s/String.*//'
 is a

Also works

$ echo "Here is a StringHere is a String" | sed 's/.*Here//; s/String.*//'
 is a

$ echo "Here is a StringHere is a StringHere is a StringHere is a String" | sed 's/.*Here//; s/String.*//'
 is a

potong · Answer 9 · 2012-11-06T00:50:37.637

9

This might work for you (GNU sed):

sed '/Here/!d;s//&\n/;s/.*\n//;:a;/String/bb;$!{n;ba};:b;s//\n&/;P;D' file

This presents each representation of text between two markers (in this instance Here and String) on a newline and preserves newlines within the text.

edited Nov 06 '12 at 00:50

answered Nov 06 '12 at 00:42

potong

55,640
6
51
83

Gary Dean · Answer 10 · 2015-06-17T06:19:29.483

8

All the above solutions have deficiencies where the last search string is repeated elsewhere in the string. I found it best to write a bash function.

    function str_str {
      local str
      str="${1#*${2}}"
      str="${str%%$3*}"
      echo -n "$str"
    }

    # test it ...
    mystr="this is a string"
    str_str "$mystr" "this " " string"

edited Jun 17 '15 at 06:19

answered Jun 17 '15 at 04:45

Gary Dean

105
1
6

score 4 · Answer 11 · edited Aug 19 '14 at 21:14

4

You can use \1 (refer to http://www.grymoire.com/Unix/Sed.html#uh-4):

echo "Hello is a String" | sed 's/Hello\(.*\)String/\1/g'

The contents that is inside the brackets will be stored as \1.

edited Aug 19 '14 at 21:14

Peter Mortensen

30,738
21
105
131

answered Nov 06 '12 at 00:19

mvairavan

129
1
11

This removes strings instead of output something in between. Try removing "Hello" with "is" in the sed command and it will output "Hello a" – Jonathan May 26 '19 at 16:19

score 1 · Answer 12 · answered Dec 01 '17 at 22:51

Problem. My stored Claws Mail messages are wrapped as follows, and I am trying to extract the Subject lines:

Subject: [SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular
 link in major cell growth pathway: Findings point to new potential
 therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is
 Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as
 a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway
 identified [Lysosomal amino acid transporter SLC38A9 signals arginine
 sufficiency to mTORC1]]
Message-ID: <20171019190902.18741771@VictoriasJourney.com>

Per A2 in this thread, How to use sed/grep to extract text between two words? the first expression, below, "works" as long as the matched text does not contain a newline:

grep -o -P '(?<=Subject: ).*(?=molecular)' corpus/01

[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key

However, despite trying numerous variants (.+?; /s; ...), I could not get these to work:

grep -o -P '(?<=Subject: ).*(?=link)' corpus/01
grep -o -P '(?<=Subject: ).*(?=therapeutic)' corpus/01
etc.

Solution 1.

Per Extract text between two strings on different lines

sed -n '/Subject: /{:a;N;/Message-ID:/!ba; s/\n/ /g; s/\s\s*/ /g; s/.*Subject: \|Message-ID:.*//g;p}' corpus/01

which gives

[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular link in major cell growth pathway: Findings point to new potential therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway identified [Lysosomal amino acid transporter SLC38A9 signals arginine sufficiency to mTORC1]]

Solution 2.*

Per How can I replace a newline (\n) using sed?

sed ':a;N;$!ba;s/\n/ /g' corpus/01

will replace newlines with a space.

Chaining that with A2 in How to use sed/grep to extract text between two words?, we get:

sed ':a;N;$!ba;s/\n/ /g' corpus/01 | grep -o -P '(?<=Subject: ).*(?=Message-ID:)'

which gives

[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular  link in major cell growth pathway: Findings point to new potential  therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is  Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as  a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway  identified [Lysosomal amino acid transporter SLC38A9 signals arginine  sufficiency to mTORC1]]

This variant removes double spaces:

sed ':a;N;$!ba;s/\n/ /g; s/\s\s*/ /g' corpus/01 | grep -o -P '(?<=Subject: ).*(?=Message-ID:)'

giving

[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular link in major cell growth pathway: Findings point to new potential therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway identified [Lysosomal amino acid transporter SLC38A9 signals arginine sufficiency to mTORC1]]

score 1 · Answer 13 · answered Apr 20 '22 at 13:14

1

`ripgrep`

Here is the example using rg:

$ echo Here is a String | rg 'Here\s(.*)\sString' -r '$1'
is a

answered Apr 20 '22 at 13:14

kenorb

155,785
88
678
743

score 0 · Answer 14 · answered Aug 11 '22 at 07:10

Here is my not-so-elegant but working solution:

$ echo 'Here is a String' | sed 's/Here/\n/g'| sed 's/String/\n/g'| sed -r '/^[[:space:]]*$/d'

is a

but works with Here is a String Here is a second String also:

$ echo 'Here is a String Here is a second String' | sed 's/Here/\n/g'| sed 's/String/\n/g'| sed -r '/^[[:space:]]*$/d'

is a
is a second

or:

$ echo 'Here is a String Here is a second String Here is last String' | sed 's/Here/\n/g'| sed 's/String/\n/g'| sed -r '/^[[:space:]]*$/d'

is a
is a second
is last

How to use sed/grep to extract text between two words?

14 Answers14

`ripgrep`

Linked

Related