-1

I'm really new to python but i really want to make a webscraping application which would look up a popular picture sharing website's gallery and download the first 15 latest thing people uploaded. I got as far as getting the urls directed to the jpgs and saving them into a txt file. Then i open the file and try to read line by line then download the jpgs with requests and save them into different files using uuid to generate their random filenames. My final goal is to write something that will automatically categorize pictures uploaded by random people like cats,dogs,furniture etc.

I've tried researching the topic but i'm really confused. Would love some feedback.

import requests
from bs4 import BeautifulSoup
import re

link = 'link'

ip = '176.88.217.170:8080'
proxies = {
  'http': ip,
  'https': ip,
}

r = requests.get(link, proxies=proxies)

import uuid

unique_filename = str(uuid.uuid4())

print(unique_filename)

#r = requests.get(link)
c = r.content

bs = BeautifulSoup(c, 'html.parser')

images = bs.find_all('img', {'src':re.compile('_tn.jpg')})
with open('data.txt', 'w') as f:
    for image in images:
        f.write(image['src']+'\n')
        print('done')

for mentes in images:
    with open('data.txt', 'r+')  as read:
        cnt = 0
        for line in read:
            line = line.strip()
            line = read.readline()
            cnt += 1
            print(cnt)
            print(line)

   with open(unique_filename +'.jpg' , 'wb') as kep:
            kep.write(requests.get(line , proxies=proxies).content)
            print(line)
            kep.close()
            print('saved')

I want to save the scraped images with a randomly generated name as jpgs for future use.

I'm mainly asking for a direction or a suggestion for what should i look up more because my logic and skills are lacking.

Chathuranga Chandrasekara
  • 20,548
  • 30
  • 97
  • 138

2 Answers2

0

Do you need the data.txt file? Can't you just save the 15 URLS in memory? Anyways, if I understand the question correctly, the main problem is getting an image from an url towards that image. In that case, this answer will probably help you.
One way to do it is this:

import urllib.request

with open('data.txt', 'r+') as data_file:
    urls = data_file.read()

for url in urls:
    unique_filename = str(uuid.uuid4()) + '.jpg'
    with open(unique_filename, 'wb') as jpeg_file:
        online_file = urllib.request.urlopen(url)
        jpeg_file.write(online_file.read())
Rens Oliemans
  • 96
  • 1
  • 7
  • Thank you for the reply. Yes you understood my problem correctly.. I store the files in txt because... uhh... i wanted to learn how to manipulate files.. turns out it is harder than expected. Your reply helped out a lot :) I'm going to keep on reading and trying. Thanks again – laktozmentes Jun 16 '19 at 11:13
-1
import requests
from bs4 import BeautifulSoup
import re

link = 'link'

ip = '176.88.217.170:8080'
proxies = {
  'http': ip,
  'https': ip,
}

r = requests.get(link, proxies=proxies)

import uuid

unique_filename = str(uuid.uuid4())

print(unique_filename)

#r = requests.get(link)
c = r.content

bs = BeautifulSoup(c, 'html.parser')

images = bs.find_all('img', {'src':re.compile('_tn.jpg')})
with open('data.txt', 'w') as f:
    for image in images:
        f.write(image['src']+'\n')
        print('done')

for mentes in images:
    with open('data.txt', 'r+')  as read:
        cnt = 0
        for line in read:
            line = line.strip()
            line = read.readline()
            cnt += 1
            print(cnt)
            print(line)

   with open(unique_filename +'.jpg' , 'wb') as kep:
            kep.write(requests.get(line , proxies=proxies).content)
            print(line)
            kep.close()
            print('saved')
DaveL17
  • 1,673
  • 7
  • 24
  • 38
  • 1
    As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jul 31 '23 at 00:38