-1

I'd like to cut off some special strings of a variable. The variable contains the following, including a lot of blank space before <div... and a class attribute:

           <div data-href="/www.somewebspace.com" class="class1 class2"> 

I would like to extract the contents of the data-href attribute i.e have this output /www.somewebspace.com

I tried out the following code, the output starts with the contents of the data-href attribute and the class attribute.

echo $Test | grep -oP '(?<=<div data-href=").*(?=")'

How can I get rid of the class attribute?

Kind regards and grateful for every reply, X3nion

P.S. Some other question arouse. I've got this strings I'd like to extract from a text file:

                <div class="aditem-addon">
                   Today, 23:23</div>`

What would be the correct command to extract only the "Today, 23:23" without any spaces and spaces before and after the term? Maybe I would have to delete the black spaces before?

Abdessabour Mtk
  • 3,895
  • 2
  • 14
  • 21
X3nion
  • 181
  • 1
  • 1
  • 9

2 Answers2

0

your regex is correct, you only need to adjust the greediness of the * quantifier:

  • * is a greedy quantifier : match as much as possible whilst getting a match
  • *? is a reluctant quantifier : match the minimum characters to get a match
# Correct
Test='<div data-href="/www.somewebspace.com" class="fdgks"></div>'
echo $Test | grep -oP '(?<=<div data-href=").*?(?=")'
#> /www.somewebspace.com
# the desired output

# WRONG
echo $Test | grep -oP '(?<=<div data-href=").*(?=")'
#> /www.somewebspace.com" class="fdgks
# didn't stop until it matched the last quote `"`
echo $Test$Test | grep -oP '(?<=<div data-href=").*(?=")'
#> /www.somewebspace.com" class="fdgks"></div><div data-href="/www.somewebspace.com" class="fdgks
# same as the last one

for a more detailed explanation about the difference between greedy, reluctant and possessive quantifiers (see)


EDIT

echo $Test$Test | grep -Poz '(?<=<div class="aditem-addon">\n ).*?(?=<\/div>)'
#> Today, 23:23
#> Today, 23:23

  • \n matches a newline an a leading space.

if the string you're looking for contains the newline character \n you'll need to add the z option to grep i.e the call will be grep -ozP

Abdessabour Mtk
  • 3,895
  • 2
  • 14
  • 21
  • Hey Abdessabour Mtk, Thanks for your reply! That worked out very well. How did you see the missing adjustment regarding the greediness? – X3nion Aug 22 '20 at 00:08
  • if you don't tell the regex engine to be non greedy it will gobble up as much characters i.e it won't stop until the last `"`. for example if your div had another attribute after the class `grep` will return it too. – Abdessabour Mtk Aug 22 '20 at 00:12
  • @X3nion I added an explanation and some references – Abdessabour Mtk Aug 22 '20 at 00:23
  • @ Abdessabour Mtk Thanks for your detailed reply! I added something as P.S. in my message. Could you maybe answer to it? – X3nion Aug 22 '20 at 00:31
  • Thanks for your reply! It works well with -oP although the Today is in a new line. Shall I better use -ozP to be on the safe side? And there is one single whitespace in front of Today. Do you have an idea how to fix this? – X3nion Aug 22 '20 at 12:19
  • @X3nion check the new answer – Abdessabour Mtk Aug 22 '20 at 14:54
0

Unless the input is very simple, considering using xmllint or other html parsing tool. For the very simple cases, you can use bash solution:

#! /bin/sh
s='           <div data-href="/www.somewebspace.com" class="class1 class2"> '

s1=${s##*data-href=\"}
s1=${s1%%\"*}
echo "$s1"

Which will print

/www.somewebspace.com
dash-o
  • 13,723
  • 1
  • 10
  • 37