python Get a link from string

Question

I need to use a python script to take a email and fine a link from it and them open use that link to send a packet to a server that has that verification link inside of it so it verifies an account. How would I use python to take the

https://www.boomlings.com/database/accounts/activate.php?uid=8722046actcode=xLCReGjLdkWmINt1GY9e

out of

{'Sender': 'Geometry Dash', 'Subject': 'Please activate your account.', 'body': b'<style type="text/css">\n#google_translate_element{\n  float: right;\n  padding:0 0 10px 10px;\n}\n/* twitter do\xc4\x9frulama linki fix */\n.bulletproof-btn-1 a {\n  font-size: 20px!important;\n  color: #fff!important;\n  padding: 20px!important;\n  line-height: 33px!important;\n  text-decoration: none!important;\n}\n</style>\n<div id="google_translate_element"></div><script type="text/javascript">\nfunction googleTranslateElementInit() {\n  new google.translate.TranslateElement({pageLanguage: \'en\', layout: google.translate.TranslateElement.InlineLayout.SIMPLE, autoDisplay: false, multilanguagePage: true}, \'google_translate_element\');\n}\n</script><script type="text/javascript" src="//translate.google.com/translate_a/element.js?cb=googleTranslateElementInit"></script>\n\r\n\r\n<html>\r\n<head>\r\n\t<title></title>\r\n</head>\r\n<body>\r\n<p>Thank you for registering a Geometry Dash account</p>\r\n\r\n<p>Your account information:<br />\r\nUsername:&nbsp; SUKAFUTCUCK</p>\r\n\r\n<p>Please click the link below to activate your account:<br />\r\n<a href="http://www.boomlings.com/database/accounts/activate.php?uid=8722046&actcode=xlCReGjLdkWmINt1GY9e" target="_blank">Click\r\nHere</a></p>\r\n\r\n<p>Please contact support@robtopgames.com if you have any questions or\r\nneed assistance.</p>\r\n\r\n<p>If you did not send an account request using this email, then you\r\ncan safely disregard this message and nothing will happen.</p>\r\n\r\n<p>Regards,<br />\r\nRobTop Games</p>\r\n</body>\r\n</html>\r\n\r\n\r\n'}

The link will be different in different emails so I need something that can do this.

https://www.boomlings.com/database/accounts/activate.php?uid=*actcode=*

When the * means that string at any length can go there because it will be a different activate.php cod

Mauricio Cortazar · Answer 1 · 2018-03-04T05:17:39.647

2

You can use regex for that with something like:

import re
c = re.search("<a href=\".*?(?=\")", yourDict["body"].decode("utf-8"))
print(c.group())

but is much better if you find a package like parsel because you extract the html with xpath and not with regex, check this

EDIT

I use the regular expression because is the shortest and the fastest way with no need of download a package, but if your response changes drastically I recommend parsel for that. Example:

from parsel import Selector
sel = Selector(text=yourDict["body"].decode("utf-8"))
url = sel.xpath('//a[@target="_blank"]/@href').extract_first()

edited Mar 04 '18 at 05:17

answered Mar 04 '18 at 04:27

Mauricio Cortazar

4,049
2
17
27

Sorry to ask for more but I guess that wasn't the problem I had. I kept getting an error saying inconsistent use of tabs and spaces in indentation. Here is my little part. while 1: result = m.mailBox() if result: c = re.search(" – LoopTurn Mar 04 '18 at 04:47
@LoopTurn well as it said, your indentation is wrong, check that you are using tabs instead of spaces. Some editors like sublime, use spaces when you use the tab button so take care of that – Mauricio Cortazar Mar 04 '18 at 04:50
Assuming that you're Python IDE, Press **Alt+6** and then from the pop up window , untabify the whole region replacing tabs from spaces. – Ubdus Samad Mar 04 '18 at 04:57
@MauricioCortazar Isn't the regex a bit too broad if the only thing OP thinks will change in the url is where these asterisks are, `uid=*actcode=*` ? – G_M Mar 04 '18 at 05:17
@DeliriousLettuce you're right but he asked for the whole url in the question, anyway OP will use the entire url not jus the parameters – Mauricio Cortazar Mar 04 '18 at 05:19
@MauricioCortazar You could make the regex more specific and return the whole thing in a group so I'm not exactly sure what you mean? – G_M Mar 04 '18 at 05:21

G_M · Answer 2 · 2018-03-04T14:55:21.630

1

Assuming that dict from your description is now in a variable named d (it was just a bit long to put in here):

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(d['body'], 'lxml')
>>> link = soup.find('a', target='_blank')
>>> link['href']
'http://www.boomlings.com/database/accounts/activate.php?uid=8722046&actcode=xlCReGjLdkWmINt1GY9e'

BeautifulSoup docs

edited Mar 04 '18 at 14:55

answered Mar 04 '18 at 04:55

G_M

3,342
1
9
23

Please add the [bs4 documentation link](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) in your answer. It would help others. – Keyur Potdar Mar 04 '18 at 06:36

sonus21 · Answer 3 · 2018-03-04T06:13:39.667

0

The email could in HTML or text format. If it's in HTML format then use libraries like bs4, pyquery etc.

If it's text then use regex to search the URL using the following regex

regex = ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

Refer: http://www.ietf.org/rfc/rfc3986.txt

Use re module to search the string as

import re
regex = r"^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?"
urls = re.findall( regex, text )
print(urls)

Use pyquery module

from pyquery import pyQuery as pq
q = pq( text )
a_list = q( "a" )
urls = [ a.attr[ 'href' ] for a in a_list ]
print(urls)

EDIT:

Instead of using generic URL we can use specific URL, for example https?:\/\/www\.boomlings\.com\/database\/accounts\/activate\.php\?uid=.*&actcode=.*

https://ideone.com/NFj90L

edited Mar 04 '18 at 06:13

answered Mar 04 '18 at 04:59

sonus21

5,178
2
23
48

@DeliriousLettuce Not sure, why It's ridiculous, this any regex can be used. This being a generic solution. – sonus21 Mar 04 '18 at 05:12
your regex is right but this isn't the case to use that one when you have a shorter way to do it – Mauricio Cortazar Mar 04 '18 at 05:12
@SonuKumar I'm not sure if you read the question but the only part of the url that OP seems to think will change is where the asterisks are `uid=*actcode=*`. This regex would match a ton of urls that OP doesn't seem to be looking for at all. – G_M Mar 04 '18 at 05:13

python Get a link from string

3 Answers3