regex python simple findall start and end points known

Question

import bs4
from urllib.request import urlopen
import re
import os
html=urlopen('https://www.flickr.com/search/?text=dog')
soup=bs4.BeautifulSoup(html,'html.parser')
print(soup.title)
x=soup.text
y=[]
for i in re.findall('c1.staticflickr.com\.jpg',x):
    print(i)

i know images start with c1.staticflickr.com and end with .jpg,how can i print each image link,(i am bit rusty on regex i tried adding some stuff but didn't work)

You are essentially trying to parse HTML with regex, which is a big no-no. You already know you are searching for images, so why don't use `BeautifulSoup` (which you are already using) to find all `img` tags? — DeepSpace, Dec 03 '18 at 11:21
@DeepSpace the tag has a lot of other stuff as well (style,height,other stuff too) — wishmaster, Dec 03 '18 at 11:22
You should escape the backslashes in `c1.staticflickr.com\.jpg` — vrintle, Dec 03 '18 at 11:23
@timmy After `BeautifulSoup` finds all the relevant tags you can then use only the stuff you care about — DeepSpace, Dec 03 '18 at 11:23
Did you mean `re.findall('c1\.staticflickr\.com/.*\.jpg',x)`? Better, use something like `[^"]*` instead of `.*`. Even better, don't use regex for this at all. — tobias_k, Dec 03 '18 at 11:26
This answer may be useful: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#1732454 — Jean-François Corbett, Dec 04 '18 at 14:38
@Jean-FrançoisCorbett with all do respect that answer maybe true at that time,but nowadays all sites uses java script(jquiry,google),with java at some point during parsing you must use regex to get exactly what you are looking for(of course you can mix json in the end to get it but it will slow down the crawling and makes it a bit harder) — wishmaster, Dec 04 '18 at 17:51

KC. · Accepted Answer · 2018-12-04T14:34:12.860

You have two way to gather what you desire, but it seems regex would be better because the urls have a canonical format. But if you use bs4 to extract the urls, which will be a bit complex, since they inside style.

import bs4
import requests
import re

resp = requests.get('https://www.flickr.com/search/?text=dog')
html = resp.text
result = re.findall(r'c1\.staticflickr\.com/.*?\.jpg',html)
print(len(result))
print(result[:5])

soup=bs4.BeautifulSoup(html,'html.parser')
result2 = [ re.findall(r'c1\.staticflickr\.com/.*?\.jpg',ele.get("style"))[0]
            for ele in soup.find_all("div",class_="view photo-list-photo-view requiredToShowOnServer awake")]
print(len(result2))
print(result2[:5])

Edit: you can gain extra information through the special URL, instead of using selenium. And i did not check if it can get the information which in page one.

import requests

url = "https://api.flickr.com/services/rest?sort=relevance&parse_tags=1&content_type=7&extras=can_comment,count_comments,count_faves,description,isfavorite,license,media,needs_interstitial,owner_name,path_alias,realname,rotation,url_c,url_l,url_m,url_n,url_q,url_s,url_sq,url_t,url_z&per_page={per_page}&page={page}&lang=en-US&text=dog&viewerNSID=&method=flickr.photos.search&csrf=&api_key=352afce50294ba9bab904b586b1b4bbd&format=json&hermes=1&hermesClient=1&reqId=c1148a88&nojsoncallback=1"

with requests.Session() as s:
    #resp = s.get(url.format(per_page=100,page=1))
    resp2 = s.get(url.format(per_page=100,page=2))

    for each in resp2.json().get("photos").get("photo")[:5]:
        print(each.get("url_n_cdn"))
        print(each.get("url_m")) # there are more url type in JSON, url_q url_s url_sq url_t url_z

thanks this works,not sure though why you limit the amount of images to 5 ? also if you notice, the site uses java(google analytics) so when you scroll down more images will show,is there a way of solving this and keep getting the iamges ? — wishmaster, Dec 04 '18 at 13:56
Because it is too long, i just verify them if they are same by 5. About scroll down, it will be easier than you think about(i will add it). — KC., Dec 04 '18 at 14:09
@timmy check my edit, i found there are several kinds of url. I am not sure which one is the best(this json is a bit long that i do not have time to watch it all). — KC., Dec 04 '18 at 14:36

regex python simple findall start and end points known

1 Answers1