-2

Hello I'm trying to get the content of my class top. All I need is the link (without any tags) and and the value of the span class title in bash. I do something like this (for test) but this dose not give any answer. What I am doing wrong ?

curl -s  https://www.website.com/q?search=violet | grep -e "^<span class=\"top\">(.*?)</span>"

                        <div class="video-item-list">
                            <span class="age0" title="0"></span>
                            <span class="hsa" title="tex"></span>
                            <span class="Encour" title="test"></span>
                            <a href="https://www.website.com/a/1973">
                                <img class="image lazy" width="100" height="40"
                                    data-original="https://img.com/i?jpg=123">
                            </a>
                            <span class="top">
                                <a href="https://www.website.com/a/1973">
                                    <span class="title">Violet test</span>
                                </a>
                                <span class="episode"> 250
                                </span>
                                <a class="team"></a>
                            </span>
                            <span class="info"> 2017</span>
                        </div>
                        <div id="n" class="video-item-list-days">
                            <h5>Letter n</h5>
                        </div>
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
meteor314
  • 73
  • 4
  • 3
    Please [Don't Parse XML/HTML With Regex.](https://stackoverflow.com/a/1732454/3776858). I suggest to use an XML/HTML parser (xmlstarlet, xmllint ...). – Cyrus Mar 27 '22 at 01:56
  • 1
    @Cyrus I suggest that you *not* post the link to that answer, because chances are that OP won't understand it. You and I may laugh at it because we understand what it's saying, but rookies looking for help won't get it. Instead, point to a something that actually explains the problem. For example, I created http://htmlparsing.com/regexes.html to give examples of why HTML+regexes are painful. – Andy Lester Mar 27 '22 at 04:17

3 Answers3

0

As mentioned in comments, regular expressions are the wrong tool for working with HTML. One approach using a XSLT stylesheet and xsltproc:

example.xslt:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">                                                                                                                                                                   
  <xsl:output method="text" />
  <xsl:template match="/">                                                                                                                                                                                                                        
    <xsl:for-each select="//span[@class='top']">
      <xsl:value-of select="a[@href]/@href" />
      <xsl:text>&#09;</xsl:text>
      <xsl:value-of select="a[@href]/span[@class='title']" />
      <xsl:text>&#10;</xsl:text>
    </xsl:for-each>
</xsl:template>
</xsl:stylesheet>

Usage:

$ curl -s  https://www.website.com/q?search=violet | xsltproc --html example.xslt -
https://www.website.com/a/1973  Violet test
Shawn
  • 47,241
  • 3
  • 26
  • 60
0

Suggesting RegExp pattern to match FIRST span class only.

grep -oP '(?<=<span class=")[^"]+'

Tested for your sample:

age0
hsa
Encour
top
title
episode
info

Not sure if that was your intention.

If you need only FIRST span classes closed element in same line.

grep -oP '(?<=<span class=")[^"]+(?=".*</span>)' input.1.txt    

Tested for your sample:

age0
hsa
Encour
title
info
Dudi Boy
  • 4,551
  • 1
  • 15
  • 30
0

Thanks everyone, I do this and it working. May be it's a bad idea but I will see later

(?<=<span class="top">).*?(?=<\/span>)
meteor314
  • 73
  • 4