0

I would like to extract a string value from a curl returned webpage in a bash script but am unsure how to go about this?

The value I am interested in is always returned by curl looks like this:

    <head>
    <title>UKIPVPN.COM FREE VPN Service</title>
    <style type='text/css'>
      #button {
        width:180px;
        height:60px;
        font-family:verdana,arial,helvetica,sans-serif;
        font-size:20px;
        font-weight: bold;
      }
    </style>
  </head>
  <br>
  <br>
     <font color=blue><center>  <h1>Welcome to Free UK IP VPN Service</h1>               </center></font>

     <form method='post' action='http://www.ukipvpn.com'>
  <center><input type='hidden' name='sessionid' value='4b5q43mhhgl95nsa9v9lg8kac7'></center><br>
  <center><input id='button' type='submit' value='  I AGREE  ' /><br><br>     <h2> Your TOS Let me use the Free VPN Service</h2></center>
     </form>



       <br><center><font size='2'>No illegal activities allowed. In case of abuse, users' VPN access log is subjected to expose to related authorities.</font></center>
       </html>

The value I would like to extract to a variable in Bash is the value='this is the value i am interested in'.

Thanks for any help;

Andy

andy
  • 391
  • 13
  • 33
  • `grep -oP "\bvalue\s*=\s*'\K[^']*" file` – Avinash Raj Feb 25 '15 at 07:15
  • Sorry, I am new to bash scripting. Should I allocate the entire curl return to a variable and then run your grep command on that variable? Could you expand slightly please? – andy Feb 25 '15 at 07:16
  • Your question is not how to parse curl, but how to parse HTML. As explained by @that other guy, using regex (grep) is generaly not adapted to parse arbitrary HTML (it's sometimes appropriate to parse a limited, known set of HTML). Follow the context, you should consider to request another url that returns a more structured type as XML or JSON. – mcoolive Feb 25 '15 at 10:37

2 Answers2

1

You could try the below.

$ val=$(curl somelink | grep -oP "name='sessionid'[^<>]*\bvalue\s*=\s*'\K[^']*")
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
1

There are some arguments against using regex to parse HTML.

Here's a more robust XPath based version using tidy and xmlstarlet:

var=$(curl someurl | 
  tidy -asxml 2> /dev/null | 
  xmlstarlet sel -t -v '//_:input[@name="sessionid"]/@value' 2> /dev/null); 
Community
  • 1
  • 1
that other guy
  • 116,971
  • 11
  • 170
  • 194