Cutting certain string of variable

Question

I'd like to cut off some special strings of a variable. The variable contains the following, including a lot of blank space before <div... and a class attribute:

           <div data-href="/www.somewebspace.com" class="class1 class2">

I would like to extract the contents of the data-href attribute i.e have this output /www.somewebspace.com

I tried out the following code, the output starts with the contents of the data-href attribute and the class attribute.

echo $Test | grep -oP '(?<=<div data-href=").*(?=")'

How can I get rid of the class attribute?

Kind regards and grateful for every reply, X3nion

P.S. Some other question arouse. I've got this strings I'd like to extract from a text file:

                <div class="aditem-addon">
                   Today, 23:23</div>`

What would be the correct command to extract only the "Today, 23:23" without any spaces and spaces before and after the term? Maybe I would have to delete the black spaces before?

While this one is probably solvable, but obligatory: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — JNevill, Aug 21 '20 at 23:56
[Don't Parse XML/HTML With Regex.](https://stackoverflow.com/a/1732454/3776858) I suggest to use an XML/HTML parser (xmlstarlet, xmllint ...). — Cyrus, Aug 22 '20 at 05:18
@X3nion if any of the answers solve your problem mark them as the selected answer. — Abdessabour Mtk, Aug 25 '20 at 02:03

Abdessabour Mtk · Answer 1 · 2020-08-22T14:54:19.453

0

your regex is correct, you only need to adjust the greediness of the * quantifier:

* is a greedy quantifier : match as much as possible whilst getting a match
*? is a reluctant quantifier : match the minimum characters to get a match

# Correct
Test='<div data-href="/www.somewebspace.com" class="fdgks"></div>'
echo $Test | grep -oP '(?<=<div data-href=").*?(?=")'
#> /www.somewebspace.com
# the desired output

# WRONG
echo $Test | grep -oP '(?<=<div data-href=").*(?=")'
#> /www.somewebspace.com" class="fdgks
# didn't stop until it matched the last quote `"`
echo $Test$Test | grep -oP '(?<=<div data-href=").*(?=")'
#> /www.somewebspace.com" class="fdgks"></div><div data-href="/www.somewebspace.com" class="fdgks
# same as the last one

for a more detailed explanation about the difference between greedy, reluctant and possessive quantifiers (see)

EDIT

echo $Test$Test | grep -Poz '(?<=<div class="aditem-addon">\n ).*?(?=<\/div>)'
#> Today, 23:23
#> Today, 23:23

\n matches a newline an a leading space.

if the string you're looking for contains the newline character \n you'll need to add the z option to grep i.e the call will be grep -ozP

edited Aug 22 '20 at 14:54

answered Aug 22 '20 at 00:01

Abdessabour Mtk

3,895
2
14
21

Hey Abdessabour Mtk, Thanks for your reply! That worked out very well. How did you see the missing adjustment regarding the greediness? – X3nion Aug 22 '20 at 00:08
if you don't tell the regex engine to be non greedy it will gobble up as much characters i.e it won't stop until the last `"`. for example if your div had another attribute after the class `grep` will return it too. – Abdessabour Mtk Aug 22 '20 at 00:12
@X3nion I added an explanation and some references – Abdessabour Mtk Aug 22 '20 at 00:23
@ Abdessabour Mtk Thanks for your detailed reply! I added something as P.S. in my message. Could you maybe answer to it? – X3nion Aug 22 '20 at 00:31
Thanks for your reply! It works well with -oP although the Today is in a new line. Shall I better use -ozP to be on the safe side? And there is one single whitespace in front of Today. Do you have an idea how to fix this? – X3nion Aug 22 '20 at 12:19
@X3nion check the new answer – Abdessabour Mtk Aug 22 '20 at 14:54

score 0 · Answer 2 · answered Aug 24 '20 at 04:53

Unless the input is very simple, considering using xmllint or other html parsing tool. For the very simple cases, you can use bash solution:

#! /bin/sh
s='           <div data-href="/www.somewebspace.com" class="class1 class2"> '

s1=${s##*data-href=\"}
s1=${s1%%\"*}
echo "$s1"

Which will print

/www.somewebspace.com

Cutting certain string of variable

2 Answers2

EDIT