Web Scraping - How to get a specific part of a weblink

Question

i have the following link: https://webcache.googleusercontent.com/search?q=cache:jAc7OJyyQboJ:https://cooking.nytimes.com/learn-to-cook+&cd=5&hl=en&ct=clnk

I have multiple links in a dataset. Each link is of same pattern. I want to get a specific part of the link, for the above link i would be the bold part of the link above. I want text starting from 2nd http to before first + sign.

I don't know how to do so using regex. I am working in python. Kindly help me out.

Fernando Irarrázaval G · Accepted Answer · 2017-04-15T17:40:40.140

If each link has the same pattern you do not need regex. You can use string.find() and string cutting

link = "https://webcache.googleusercontent.com/search?q=cache:jAc7OJyyQboJ:https://cooking.nytimes.com/learn-to-cook+&cd=5&hl=en&ct=clnk"

# This finds the second occurrence of "https://" and returns the position
second_https = link.find("https://", link.find("https://")+1)
# Index of the end of the link
end_of_link = link.find("+")

new_link = link[second_https:end_of_link]

print(new_link)

This will return "https://cooking.nytimes.com/learn-to-cook" and will work if the link follows the same pattern as described (it is the second https:// in the link and ends with + sign)

score 0 · Answer 2 · answered Apr 15 '17 at 18:18

I'd go with urlparse (Python 2) or urlparse (Python 3) and a little bit of regex:

import re
from urlparse import urlparse

url_example = "https://webcache.googleusercontent.com/search?q=cache:jAc7OJyyQboJ:https://cooking.nytimes.com/learn-to-cook+&cd=5&hl=en&ct=clnk"
parsed = urlparse(url_example)
result = re.findall('https?.*', parsed.query)[0].split('+')[0]
print(result)

Output:

https://cooking.nytimes.com/learn-to-cook

Web Scraping - How to get a specific part of a weblink

2 Answers2