0

I know some ppl asked the same question but I can't get any result, here is my text:

<html>
<head>
<title>emdee five for life</title>
</head>
<body style="background-color:powderblue;">
<h1 align='center'>MD5 encrypt this string</h1><h3 align='center'>PeKPATbxnupBGgWTIg5B</h3><center><form action="" method="post">
<input type="text" name="hash" placeholder="MD5" align='center'></input>
</br>
<input type="submit" value="Submit"></input>
</form></center>
</body>
</html>

I would like to extract 'PeKPATbxnupBGgWTIg5B' from it. I'm doing sed -n "/^h3 align ='center'>$/,/^<h3$/p" thefile but it does not return anything. Please help me :(

shinobi-y
  • 3
  • 2
  • Please post valid HTML. – Cyrus Dec 22 '19 at 22:11
  • 1
    [Don't Parse XML/HTML With Regex.](https://stackoverflow.com/a/1732454/3776858) I suggest to use an XML/HTML parser (xmlstarlet, xmllint ...). – Cyrus Dec 22 '19 at 22:11
  • We don't need the HTML to be valid, i just want to extract a string from it. It would be the same if my text was "This is a text" and I would extract the string between "This" and "text". (sorry if my english is bad, I'm from France) – shinobi-y Dec 22 '19 at 22:20
  • I just want to know why my bash command doesn't extract "PeKPATbxnupBGgWTIg5B" from my text. – shinobi-y Dec 22 '19 at 22:21

2 Answers2

2

The correct way would be to use an XML/HTML parser.

If your text was

...
<h1 align='center'>MD5 encrypt this string</h1><h3 align='center'>PeKPATbxnupBGgWTIg5B
</h3><center><form action="" method="post">
...

then

sed -n "/<h3 align='center'>/,/^<\/h3>/p" thefile

would return

<h1 align='center'>MD5 encrypt this string</h1><h3 align='center'>PeKPATbxnupBGgWTIg5B
</h3><center><form action="" method="post">

which is not what you want. The form /<start>/,/<end>/ matches everything between <start> and <end> on different lines.

You could use a substitution using a backreference to match your desired string like

sed -n "s/.*<h3 align='center'>\(.*\)<\/h3>.*/\1/p" thefile

which returns

PeKPATbxnupBGgWTIg5B

Using grep that supports perl compatible regular expressions (PCRE), you could use

grep -P -o "<h3 align='center'>\K.*(?=</h3>)" thefile
  • -P enable perl compatible regular expressions
  • -o only print matching parts
  • <h3 align='center'>\K use a positive lookbehind, the \K is used to match <h3 align='center'> which is not included in the match
  • .* match any characters
  • (?=</h3>) use a positive lookahead to match </h3> which is not included in the match
Freddy
  • 4,548
  • 1
  • 7
  • 17
0

The sed command sed -n '/pattern1/,/pattern2/p does work to extract lines between pattern1 and pattern2 inclusive if they are located in the separate lines.
For instance, the following test code:

cat <<EOS | sed -n '/pattern1/,/pattern2/p'
foo
bar
pattern1
These lines
are printed.
pattern2
baz
EOS

outputs:

pattern1
These lines
are printed.
pattern2

However, the sed command above does not work if the patterns are located in the same line. Moreover, the caret sign ^ and the dollar sign $ match the start and end of the line respectively. They do not indicate the positions of the substring to match.

Would you try the following instead:
(Needless to say I don't intend to parse XML files with sed. This is just a case study of substring extraction with sed.)

sed -n "s/.*h3 align='center'>\([^<]*\)<\/h3.*/\1/p" thefile

The pattern .*h3 align='center'>\([^<]*\)<\/h3.* matches with:

  • A substring which includes h3 align='center' and any preceding characters back to the start of the string.
  • Followed by a series of any character excluding <.
  • Followed by a substring which includes </h3 and any trailing characters up to the end of the line.

Then the s (substitute) command replaces the matched pattern with the second substring above. It works to extract the second substring from the matched line.

Let me go in detail about the second patten \([^<]*\).

  • The character class [^<] matches any character other than <.
  • The concept other than < is necessary to anchor the pattern matching just before the following substring </h3. Otherwise the matching may run over it for the next substring </h3 due to the nature of greedy match.
  • The asterisk sign * is a quantifier to determine the number of repetitions of the previous atom. In this case it matches a substring longer than 0 composed of any character other than <.
  • The surrounding parens \( and \) create capture group and the surrounded substring can be referred with \n (where n is a number in the order of appearance) as a replacement.

Hope this helps.

tshiono
  • 21,248
  • 2
  • 14
  • 22