0

I curl a html page and stock output into variable, so I try to extract a word between two value, but I failed.

 </tr> <tr> <td><a <a href="https://test/one/AAA">AAA</a></td>
 <td>Thu Aug 30 09:59:36 UTC 2018</td> <td align="right"> 2247366 </td>
 <td></td> </tr> <tr> <td><a
 href="https://test/one/1.1.22">1.1.22</a></td> <td>Thu Aug 30 09:59:36
 UTC 2018</td> <td align="right"> 5 </td> <td></td> </tr> </table>
 </body> </html>

 content=$(curl -s https://test/one/)
 echo $content | sed -E 's_.*one/([^"]+).*_\1_'

I try to catch value after one/ and before ", so I want to extract AAA, 1.1.22,...

locklockM
  • 129
  • 1
  • 2
  • 10

2 Answers2

0
$ ... | sed -E 's_.*one/([^"]+).*_\1_'

AAA
BBB

since you have slash in your content, better to choose a different delimiter, here I used _.

UPDATE Since you changed the input file format dramatically, here is the updated script

$ echo "$contents" | sed -nE '/one/s_.*one/([^"]+).*_\1_p'
AAA
1.1.22
karakfa
  • 66,216
  • 7
  • 41
  • 56
0

Don't parse XML/HTML with regex, use a proper XML/HTML parser and a powerful query.

theory :

According to the compiling theory, XML/HTML can't be parsed using regex based on finite state machine. Due to hierarchical construction of XML/HTML you need to use a pushdown automaton and manipulate LALR grammar using tool like YACC.

realLife©®™ everyday tool in a :

You can use one of the following :

xmllint often installed by default with libxml2, xpath1 (check my wrapper to have newlines delimited output

xmlstarlet can edit, select, transform... Not installed by default, xpath1

xpath installed via perl's module XML::XPath, xpath1

xidel xpath3

saxon-lint my own project, wrapper over @Michael Kay's Saxon-HE Java library, xpath3

or you can use high level languages and proper libs, I think of :

's lxml (from lxml import etree)

's XML::LibXML, XML::XPath, XML::Twig::XPath, HTML::TreeBuilder::XPath

, check this example

DOMXpath, check this example


Check: Using regular expressions with HTML tags


Example using :

//a[contains(@href, "https://test/sites/two/one")]
Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223