Get span content value witrh regex (bash)

Question

Hello I'm trying to get the content of my class top. All I need is the link (without any tags) and and the value of the span class title in bash. I do something like this (for test) but this dose not give any answer. What I am doing wrong ?

curl -s  https://www.website.com/q?search=violet | grep -e "^<span class=\"top\">(.*?)</span>"


                        <div class="video-item-list">
                            <span class="age0" title="0"></span>
                            <span class="hsa" title="tex"></span>
                            <span class="Encour" title="test"></span>
                            <a href="https://www.website.com/a/1973">
                                <img class="image lazy" width="100" height="40"
                                    data-original="https://img.com/i?jpg=123">
                            </a>
                            <span class="top">
                                <a href="https://www.website.com/a/1973">
                                    <span class="title">Violet test</span>
                                </a>
                                <span class="episode"> 250
                                </span>
                                <a class="team"></a>
                            </span>
                            <span class="info"> 2017</span>
                        </div>
                        <div id="n" class="video-item-list-days">
                            <h5>Letter n</h5>
                        </div>

Please [Don't Parse XML/HTML With Regex.](https://stackoverflow.com/a/1732454/3776858). I suggest to use an XML/HTML parser (xmlstarlet, xmllint ...). — Cyrus, Mar 27 '22 at 01:56
@Cyrus I suggest that you *not* post the link to that answer, because chances are that OP won't understand it. You and I may laugh at it because we understand what it's saying, but rookies looking for help won't get it. Instead, point to a something that actually explains the problem. For example, I created http://htmlparsing.com/regexes.html to give examples of why HTML+regexes are painful. — Andy Lester, Mar 27 '22 at 04:17

score 0 · Answer 1 · answered Mar 27 '22 at 04:39

As mentioned in comments, regular expressions are the wrong tool for working with HTML. One approach using a XSLT stylesheet and xsltproc:

example.xslt:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">                                                                                                                                                                   
  <xsl:output method="text" />
  <xsl:template match="/">                                                                                                                                                                                                                        
    <xsl:for-each select="//span[@class='top']">
      <xsl:value-of select="a[@href]/@href" />
      <xsl:text>&#09;</xsl:text>
      <xsl:value-of select="a[@href]/span[@class='title']" />
      <xsl:text>&#10;</xsl:text>
    </xsl:for-each>
</xsl:template>
</xsl:stylesheet>

Usage:

$ curl -s  https://www.website.com/q?search=violet | xsltproc --html example.xslt -
https://www.website.com/a/1973  Violet test

Dudi Boy · Answer 2 · 2022-03-27T12:35:08.873

0

Suggesting RegExp pattern to match FIRST span class only.

grep -oP '(?<=<span class=")[^"]+'

Tested for your sample:

age0
hsa
Encour
top
title
episode
info

Not sure if that was your intention.

If you need only FIRST span classes closed element in same line.

grep -oP '(?<=<span class=")[^"]+(?=".*</span>)' input.1.txt

Tested for your sample:

age0
hsa
Encour
title
info

edited Mar 27 '22 at 12:35

answered Mar 27 '22 at 10:19

Dudi Boy

4,551
1
15
30

score 0 · Answer 3 · answered Mar 27 '22 at 19:11

0

Thanks everyone, I do this and it working. May be it's a bad idea but I will see later

(?<=<span class="top">).*?(?=<\/span>)

answered Mar 27 '22 at 19:11

meteor314

73
4

Get span content value witrh regex (bash)

3 Answers3

Tested for your sample:

Tested for your sample: