0

I'm writing a short Python script that finds all the URLs that points to pictures hosted in Photobucket in a phpbb forum database dumb and pass them to a download manager (in my case Free Download Manager) in order to save the images in the local computer and then move them on another host (now Photobucket began to ask for a yearly subscription to embed in other sites the pictures hosted in its servers). I've managed to search all the pictures using a regex with lookarounds, when I tested my regex on two text editors with regex search support i found what I wanted but in my script it gives me troubles.

import re
import os

main_path = input("Enter a path to the input file:")
with open(main_path, 'r', encoding="utf8") as file:
    file_cont = file.read()
pattern = re.compile(r'(?!(<IMG src=""))http:\/\/i[0-9][0-9][0-9]\.photobucket\.com\/albums\/[^\/]*\/[^\/]*\/[^\/]*(?=("">))')
findings = pattern.findall(file_cont)
for finding in findings:
    print(finding)
os.system("pause")

I tried to debug it removing the download part and printing all the matches and I get a long list of ('', '"">') instead of URLs similar to this one: http://i774.photobucket.com/albums/myalbum/Emi998/mypicture.jpg Where I'm wrong?

shA.t
  • 16,580
  • 5
  • 54
  • 111
Emiliano S.
  • 249
  • 1
  • 2
  • 10
  • Python's regex engine is probably different from theirs. I'd recommend testing it with [regex101](http://www.regex101.com), which you can switch into python – TemporalWolf Aug 27 '17 at 10:28
  • You're right in other testing system it worked, regex101 in Python mode failed to match the strings. I will use it in future. – Emiliano S. Aug 27 '17 at 13:53

2 Answers2

1

Your regex pattern is not good.

I'm not sure what you tried to do and I would advise you to use BeautifulSoup instead of playing with regex if you needs to parse HTML (because Regex can not really parse HTML).


But anyway - with regex - this should works:

r'<IMG src=\"(https?:\/\/i[0-9]{3}\.photobucket\.com\/albums[^\"]+)\"[^>]+\/>'

The https?:\/\/i[0-9]{3}\.photobucket\.com\/albums is done to filter non photobucket images, [^\"]+ is more generic and just extract everything until the last " character of the attribute.

Example:

<IMG src="http://i774.photobucket.com/albums/myalbum/Emi998/mypicture.jpg" foo="bar"/>

Gives at .group(1):

http://i774.photobucket.com/albums/myalbum/Emi998/mypicture.jpg
Arount
  • 9,853
  • 1
  • 30
  • 43
0

I think below version of your regex should work:
Note that I use \" instead of "" ,
and I replace img src with img.+src to support img alt="" src also,
and instead of [^\/]* I use [^\/]+ to remove validating of \\,
and for last part of URL I also check for not occurrence of ",
then instead of checking for > followed exactly after " I check optional other characters after " by .*.

(?!(<img.+src=\"))http:\/\/i\d{3}\.photobucket\.com\/albums\/[^\/]+\/[^\/]+\/[^\/\"]+(?=\".*/>)
                                                                                   ^^       ^^^

You can use \d\d\d or [0-9]{3} or \d{3} instead of [0-9][0-9][0-9],

[Regex Demo]

shA.t
  • 16,580
  • 5
  • 54
  • 111