0

i am scraping a series of URL's with this code :

df1 = pd.DataFrame()
url = 'https://www.welcometothejungle.com/fr/jobs? 
page=1&refinementList%5Bprofession_name.fr.Tech%5D%5B%5D=Data%20Science'
path = '/Users/jdkj/desktop/chromedriver 3'
options = webdriver.ChromeOptions()
driver = webdriver.Chrome((path), chrome_options=options)
html = driver.get(url)
time.sleep(3)
elems = driver.find_elements_by_xpath("//article/div[2]/header/a['href']")

for elem in elems:
    urls = elem.get_attribute("href")
    print(urls)

This returns the correct results that i want to see, the problem is that when i try to put this "urls" in my empty dataframe "df1" with the following code :

df_test = df1.append({'URLS' : urls}, ignore_index = True)
df_test.head()

It does not show me the urls that i want (it doesn't return an error but the result doesn't really make sense)

I am beginning at python so there is probably some simple answer to my question i guess, i hope i was clear

Hichhich
  • 1
  • 3

1 Answers1

0

The problem with your code is that you are overwriting the urls variable and then appending to the DataFrame only the last scraped URL. Move the df1.append statement to inside the for block:

df1 = pd.DataFrame()
url = 'https://www.welcometothejungle.com/fr/jobs? 
page=1&refinementList%5Bprofession_name.fr.Tech%5D%5B%5D=Data%20Science'
path = '/Users/jdkj/desktop/chromedriver 3'
options = webdriver.ChromeOptions()
driver = webdriver.Chrome((path), chrome_options=options)
html = driver.get(url)
time.sleep(3)
elems = driver.find_elements_by_xpath("//article/div[2]/header/a['href']")

for elem in elems:
    url = elem.get_attribute("href")  # <--- get the url from the <a> tag
    df1 = df1.append({'URLS': url}, ignore_index=True) # <--- add the url to the dataframe in the URLS column
  • Almost there ! it indeed returns the outpu row by row, but strangely it doesn't return the whole URL, for example : the original url => https://www.welcometothejungle.com/fr/companies/cdiscount/jobs/stage-data-analyst-f-h_bordeaux-33_CDISC_85y0eQQ , the OUTPUT url => https://www.welcometothejungle.com/fr/companie... – Hichhich Oct 26 '21 at 14:15
  • Try putting `pd.set_option('display.max_colwidth', None)` after your `import pandas`, see https://stackoverflow.com/a/25352191. –  Oct 26 '21 at 14:18
  • @Hichhich the problem could be the way that you are scraping your elements. The a tag doesn't have any class or ID that you could use that to scrap? And please edit your question with the original URLs and the output URLs – dsenese Oct 26 '21 at 14:26
  • 1
    yes of course, sorry, newbie on this forum :D – Hichhich Oct 26 '21 at 15:00