-2

I'm web scraping a site for data using beautifulsoup4, and I'm not sure how to be specific to the data I want, without calling an unwanted object. I've failed to get rid of it.

import requests
from bs4 import BeautifulSoup

headers = {'User-agent': 'Mozilla/5.0 (Windows 10; Win64; x64; rv:101.0.1) Gecko/20100101 Firefox/101.0.1'}
url = "https://elitejobstoday.com/job-category/education-jobs-in-uganda/"

r = requests.get(url, headers = headers)
c = r.content
soup = BeautifulSoup(c, "html.parser")

table = soup.find("div", attrs={"article": "loadmore-item"})

def jobScan(link):
    
    the_job = {}

    job = requests.get(url, headers = headers)
    jobC = job.content
    jobSoup = BeautifulSoup(jobC, "html.parser")
  
    name = jobSoup.find("h3", attrs={"class": "loop-item-title"})
    title = name.a.text
    
    the_job['title'] = title
    print('The job is: {}'.format(title))

    print(the_job)

    return the_job

jobScan(table)

this is the result it fetches

PS C:\Users\MUHUMUZA IVAN\Desktop\JobPortal> py absa.py
The job is: 25 Credit Officers (Group lending) at ENCOT Microfinance Ltd
{'urlLink': 'https://elitejobstoday.com/job-category/education-jobs-in-uganda/', 'title': '25 Credit Officers (Group lending) at ENCOT Microfinance Ltd'}

I want to be able to retain "The job is: 25 Credit Officers (Group lending) at ENCOT Microfinance Ltd" and drop "{'urlLink': 'https://elitejobstoday.com/job-category/education-jobs-in-uganda/', 'title': '25 Credit Officers (Group lending) at ENCOT Microfinance Ltd'}"

Grismar
  • 27,561
  • 4
  • 31
  • 54
Muhumuza
  • 49
  • 8
  • 2
    In your own words, where the code says `print(the_job)`, what do you think this means? Which part of the output do you think is produced here? What do you think would happen if you removed it? Do you understand why? Where the code says `return the_job`, what do you think is the purpose of this? Where does the `jobScan` function get called? What will it do with the value that is returned? Please read [ask] and https://ericlippert.com/2014/03/05/how-to-debug-small-programs, and try to reason out basic logic errors yourself by carefully studying the code. – Karl Knechtel Jun 30 '22 at 22:25
  • 1
    Also, please only use tags to describe the *problem you are asking about*, not the overall task or context. A Django expert clearly has no special advantage in answering a question like this, for example. It actually also *doesn't matter* for this problem that the data comes from using BeautifulSoup. You would, hopefully, have realized this, if you had attempted to create a [mre]. Finally, please try to use terminology correctly - in the long run, you will find that it improves your thinking. "calling an unwanted object" does not make any sense here. – Karl Knechtel Jun 30 '22 at 22:26
  • 2
    You simply need to remove `print(the_job)` - however, it also looks like you may be confused about the difference between printing something and returning something. Your function does both, it prints to the screen, but then it also returns the value of `the_job`, it just doesn't do anything with that value where you're calling the function. Typically you wouldn't want to print in the function, but just return the relevant result and then use it wherever you called the function (for printing, for example) – Grismar Jun 30 '22 at 22:32
  • I'm only printing to see the results in the terminal. I left those in there so that you could see exactly what i was working with. – Muhumuza Jun 30 '22 at 23:02

1 Answers1

1

if you just want the desired output to be printed, you don't need the dicitonary or any return. just print the title and remove the second print.

import requests
from bs4 import BeautifulSoup

headers = {'User-agent': 'Mozilla/5.0 (Windows 10; Win64; x64; rv:101.0.1) Gecko/20100101 Firefox/101.0.1'}
url = "https://elitejobstoday.com/job-category/education-jobs-in-uganda/"

r = requests.get(url, headers=headers)
c = r.content
soup = BeautifulSoup(c, "html.parser")
table = soup.find("div", attrs={"article": "loadmore-item"})


def jobScan(link):
    job = requests.get(url, headers=headers)
    jobC = job.content
    jobSoup = BeautifulSoup(jobC, "html.parser")
    name = jobSoup.find("h3", attrs={"class": "loop-item-title"})
    title = name.a.text

    print('The job is: {}'.format(title))

jobScan(table)
Qdr
  • 703
  • 5
  • 13