I'm writing a short Python script that finds all the URLs that points to pictures hosted in Photobucket in a phpbb forum database dumb and pass them to a download manager (in my case Free Download Manager) in order to save the images in the local computer and then move them on another host (now Photobucket began to ask for a yearly subscription to embed in other sites the pictures hosted in its servers). I've managed to search all the pictures using a regex with lookarounds, when I tested my regex on two text editors with regex search support i found what I wanted but in my script it gives me troubles.
import re
import os
main_path = input("Enter a path to the input file:")
with open(main_path, 'r', encoding="utf8") as file:
file_cont = file.read()
pattern = re.compile(r'(?!(<IMG src=""))http:\/\/i[0-9][0-9][0-9]\.photobucket\.com\/albums\/[^\/]*\/[^\/]*\/[^\/]*(?=("">))')
findings = pattern.findall(file_cont)
for finding in findings:
print(finding)
os.system("pause")
I tried to debug it removing the download part and printing all the matches and I get a long list of (''
, '"">'
) instead of URLs similar to this one: http://i774.photobucket.com/albums/myalbum/Emi998/mypicture.jpg
Where I'm wrong?