Need help understanding python snippet with regex and cURL

Question

EDIT - Just added entire cURL function for reference/more information but need help with if statements - regex

Looking for help to understand the if statements in this cURL. I've read through some python documentation and I understand each of the pieces, that this is searching with regex and replacing. Just hoping someone might be able to help give a bigger picture explanation. I don't really understand the .groups.

To give a little more background this script is accessing another site via cURL it stores a cookie and when ran checks if cookie is valid, if not it grabs a new one after posting username/password. The site recently changed and I'm trying to figure out what I need to change to get this working again.

#get auth cookie for sso
def getAuthCookie( self ):
    buffer = BytesIO()
    c = pycurl.Curl()
    c.setopt(c.SSL_VERIFYPEER, False)
    c.setopt(c.FOLLOWLOCATION, True)
    c.setopt(c.TIMEOUT, 60)
    c.setopt(c.USERPWD, self.user+":"+cred.getpasswd( self.encPasswd ) )
    c.setopt(c.URL, 'https://sso.sample.com')
    c.setopt(c.COOKIEJAR, self.cookieDir)
    c.setopt(c.COOKIEFILE, self.cookieDir )
    c.setopt(c.WRITEFUNCTION, buffer.write)
    c.perform()
    c.unsetopt(c.USERPWD)
    c.setopt(c.URL, 'https://sample.com')
    c.perform()
    html = str(buffer.getvalue())    

----------------------------------------------------------
if "RelayState" in html:
    rex = re.compile( "input type=\"hidden\" name=\"RelayState\" value=\"(.*)\"" )
    RELAY = rex.search( html ).groups()[0]
if "SAMLResponse" in html:
    rex = re.compile( "input type=\"hidden\" name=\"SAMLResponse\" value=\"(.*)\"" )
    SAML =  rex.search( html ).groups()[0]
    datastuff = {'SAMLResponse':SAML,'RelayState':RELAY,'redirect':'Redirect','show_button':'true'}
if "form method=\"POST\" action=" in html:
    rex = re.compile( "form method=\"POST\" action=\"(.*)\" " )
    postUrl = rex.search( html ).groups()[0]
---------------------------------------------------------- 

#post our saml obtained, get to our final dest
    c.setopt(c.URL, postUrl )
    c.setopt(c.POST, True)
    c.setopt(c.POSTFIELDS, urlencode( datastuff ))
    c.perform()
    c.close()

I would not be too worried about not understanding it, I would be more worried about parsing html with a regex. A proper parser will be more benefit https://www.crummy.com/software/BeautifulSoup/bs4/doc/, if you share the link I would bet bs4 and requests can achieve what you want quite easily — Padraic Cunningham, May 24 '16 at 00:26
`RELAY = rex.search( html ).groups()[0]` - `RELAY` now contains the regex first match. Also, check what @PadraicCunningham said... You may want to use BeautifulSoup -> https://www.crummy.com/software/BeautifulSoup/bs4/doc/ — Pedro Lobito, May 24 '16 at 00:31
Thanks for you response @pedroLobito I just added more details, looks like that HTML parser is the way to go but still trying to understand how this script was managing to authenticate with the site in the first place — Ryan Litwiller, May 24 '16 at 00:53

score 1 · Accepted Answer · answered Jun 11 '16 at 13:26

See the comments I have injected in the code:

#get auth cookie for sso
def getAuthCookie( self ):
    buffer = BytesIO()
    c = pycurl.Curl()
    c.setopt(c.SSL_VERIFYPEER, False)
    c.setopt(c.FOLLOWLOCATION, True)
    c.setopt(c.TIMEOUT, 60)
    c.setopt(c.USERPWD, self.user+":"+cred.getpasswd( self.encPasswd ) )
    # curling sso.sample.com, which I assume promts a login dialog box and curl will set that with the varible provide above
    c.setopt(c.URL, 'https://sso.sample.com')
    # save the cookie to cookieDir
    c.setopt(c.COOKIEJAR, self.cookieDir)
    c.setopt(c.COOKIEFILE, self.cookieDir )
    c.setopt(c.WRITEFUNCTION, buffer.write)
    # perform all the previous curl commands
    c.perform()
    c.unsetopt(c.USERPWD)
    # curl new site sample.com
    c.setopt(c.URL, 'https://sample.com')
    c.perform()
    # save output as html var
    html = str(buffer.getvalue())    

----------------------------------------------------------
# The following three if statments
# if "some string is found" in varible-html: then do the lines indented lines that follow
if "RelayState" in html:
    # setup a regex to look for "input type="hidden" name="RelayState" value="[and captures everything here this will become the RELAY var]"
    rex = re.compile( "input type=\"hidden\" name=\"RelayState\" value=\"(.*)\"" )
    # this executes the regex expression on the html var
    RELAY = rex.search( html ).groups()[0]
if "SAMLResponse" in html:
    rex = re.compile( "input type=\"hidden\" name=\"SAMLResponse\" value=\"(.*)\"" )
    # same thing is happening here capturing the value as SAML
    SAML =  rex.search( html ).groups()[0]
    # contructing a new var with strings and the newly contructed vars
    datastuff = {'SAMLResponse':SAML,'RelayState':RELAY,'redirect':'Redirect','show_button':'true'}
if "form method=\"POST\" action=" in html:
    rex = re.compile( "form method=\"POST\" action=\"(.*)\" " )
    # again action="[postURL]"
    postUrl = rex.search( html ).groups()[0]
---------------------------------------------------------- 

#post our saml obtained, get to our final dest
    c.setopt(c.URL, postUrl ) # setup curl with url found above
    c.setopt(c.POST, True) # use post method
    c.setopt(c.POSTFIELDS, urlencode( datastuff )) # post fields found above with newly contructed vars
    c.perform()
    c.close()

If something changed and you are now getting an error, I would try print html after the html = str(buffer.getvalue()) to see if your still hitting the same page where it is expecting to find the regex's performed.

thanks for breaking this down, the extra comments really helped — Ryan Litwiller, Jun 11 '16 at 14:05

Need help understanding python snippet with regex and cURL

1 Answers1