Select a string between two others strings in a file BASH

Question

I know some ppl asked the same question but I can't get any result, here is my text:

<html>
<head>
<title>emdee five for life</title>
</head>
<body style="background-color:powderblue;">
<h1 align='center'>MD5 encrypt this string</h1><h3 align='center'>PeKPATbxnupBGgWTIg5B</h3><center><form action="" method="post">
<input type="text" name="hash" placeholder="MD5" align='center'></input>
</br>
<input type="submit" value="Submit"></input>
</form></center>
</body>
</html>

I would like to extract 'PeKPATbxnupBGgWTIg5B' from it. I'm doing sed -n "/^h3 align ='center'>$/,/^<h3$/p" thefile but it does not return anything. Please help me :(

[Don't Parse XML/HTML With Regex.](https://stackoverflow.com/a/1732454/3776858) I suggest to use an XML/HTML parser (xmlstarlet, xmllint ...). — Cyrus, Dec 22 '19 at 22:11
We don't need the HTML to be valid, i just want to extract a string from it. It would be the same if my text was "This is a text" and I would extract the string between "This" and "text". (sorry if my english is bad, I'm from France) — shinobi-y, Dec 22 '19 at 22:20
I just want to know why my bash command doesn't extract "PeKPATbxnupBGgWTIg5B" from my text. — shinobi-y, Dec 22 '19 at 22:21

score 2 · Answer 1 · answered Dec 22 '19 at 22:50

The correct way would be to use an XML/HTML parser.

If your text was

...
<h1 align='center'>MD5 encrypt this string</h1><h3 align='center'>PeKPATbxnupBGgWTIg5B
</h3><center><form action="" method="post">
...

then

sed -n "/<h3 align='center'>/,/^<\/h3>/p" thefile

would return

<h1 align='center'>MD5 encrypt this string</h1><h3 align='center'>PeKPATbxnupBGgWTIg5B
</h3><center><form action="" method="post">

which is not what you want. The form /<start>/,/<end>/ matches everything between <start> and <end> on different lines.

You could use a substitution using a backreference to match your desired string like

sed -n "s/.*<h3 align='center'>\(.*\)<\/h3>.*/\1/p" thefile

which returns

PeKPATbxnupBGgWTIg5B

Using grep that supports perl compatible regular expressions (PCRE), you could use

grep -P -o "<h3 align='center'>\K.*(?=</h3>)" thefile

-P enable perl compatible regular expressions
-o only print matching parts
<h3 align='center'>\K use a positive lookbehind, the \K is used to match <h3 align='center'> which is not included in the match
.* match any characters
(?=</h3>) use a positive lookahead to match </h3> which is not included in the match

score 0 · Accepted Answer · answered Dec 23 '19 at 04:12

The sed command sed -n '/pattern1/,/pattern2/p does work to extract lines between pattern1 and pattern2 inclusive if they are located in the separate lines.
For instance, the following test code:

cat <<EOS | sed -n '/pattern1/,/pattern2/p'
foo
bar
pattern1
These lines
are printed.
pattern2
baz
EOS

outputs:

pattern1
These lines
are printed.
pattern2

However, the sed command above does not work if the patterns are located in the same line. Moreover, the caret sign ^ and the dollar sign $ match the start and end of the line respectively. They do not indicate the positions of the substring to match.

Would you try the following instead:
(Needless to say I don't intend to parse XML files with sed. This is just a case study of substring extraction with sed.)

sed -n "s/.*h3 align='center'>\([^<]*\)<\/h3.*/\1/p" thefile

The pattern .*h3 align='center'>$[^<]*$<\/h3.* matches with:

A substring which includes h3 align='center' and any preceding characters back to the start of the string.
Followed by a series of any character excluding <.
Followed by a substring which includes </h3 and any trailing characters up to the end of the line.

Then the s (substitute) command replaces the matched pattern with the second substring above. It works to extract the second substring from the matched line.

Let me go in detail about the second patten $[^<]*$.

The character class [^<] matches any character other than <.
The concept other than < is necessary to anchor the pattern matching just before the following substring </h3. Otherwise the matching may run over it for the next substring </h3 due to the nature of greedy match.
The asterisk sign * is a quantifier to determine the number of repetitions of the previous atom. In this case it matches a substring longer than 0 composed of any character other than <.
The surrounding parens $ and $ create capture group and the surrounded substring can be referred with \n (where n is a number in the order of appearance) as a replacement.

Hope this helps.

Select a string between two others strings in a file BASH

2 Answers2