0

I am trying to web scrape using selenium and beautiful soupe but i cannot get selenium to find the element I need and return the text.

here is the html:

<span class="t-14 t-normal">
            <span aria-hidden="true"><!---->Crédit Agricole CIB · Full-time<!----></span><span class="visually-hidden"><!---->Crédit Agricole CIB · Full-time<!----></span>
          </span>

Do you know how to get the text 'Crédit Agricole CIB Full-time' from this html?

I am trying to do something like this:

src = driver.page_source
soup = BeautifulSoup(src, 'lxml')                                    # Now using beautiful soup
intro = soup.find('div', {'class': 'pv-text-details__left-panel'})

text_loc = intro.find( ???? )                                        # Extracting the text
text = text_loc.get_text().strip()                                   # Removing extra blank space

I do not know what to put in the ????

LaC
  • 5
  • 4

1 Answers1

0

I can't confirm without knowing exactly what the full HTML looks like - there might be other very similarly nested elements before the snippet shared in the question, but if there aren't then you can use soup.select_one with the css selectors used below:

spanTxt1 = soup.select_one('span.t-14.t-normal span[aria-hidden="true"]')
if spanTxt1 is not None: spanTxt1 = spanTxt1.get_text(strip=True)

spanTxt2 = soup.select_one('span.t-14.t-normal span.visually-hidden')
if spanTxt2 is not None: spanTxt2 = spanTxt2.get_text(strip=True)

print(f' Text1: "{spanTxt1}" \n Text2: "{spanTxt2}" ')

should give the output

 Text1: "Crédit Agricole CIB · Full-time" 
 Text2: "Crédit Agricole CIB · Full-time" 


EDIT:

I think the ember.. section ids are dynamically generated and might be different every time. A more reliable selector for the jobs listed in the experience section might be

expSel = 'div#experience ~ div.pvs-list__outer-container ul.pvs-list li'

(It's going for the list next to the [empty] div id="experience" anchor)

You can even choose a specific experience from the list by changing the end to li:nth-child(2) for the second experience, li:last-child for the last experience, li:nth-last-child(2) for the second-to-last experience, etc...

You could directly add on to the selector to get the first company:

c1span =  soup.select_one(expSel+' span.t-14.t-normal span')
if c1span is not None:
    print(c1span.get_text(strip=True))

and that should print Crédit Agricole CIB · Full-time


You could also use expSel to get all the listed experience:

expSelRef = {
    'Position': 'span.mr1.t-bold',  
    'Company+Type': 'span.t-14.t-normal',
    'Dates': 'span.t-14.t-normal.t-black--light', 
    'Location': 'span.t-14.t-normal.t-black--light + span'
}
for e in soup.select(expSel):
    for r in expSelRef:
        eDet = e.select_one(expSelRef[r]+' span[aria-hidden="true"]')
        if eDet is not None: 
            print(f' [ {r}: "{eDet.get_text(strip=True)}" ] ', end='')
    print()

output:

 [ Position: "Structured Products & Equity Derivatives Sales" ]  [ Company+Type: "Crédit Agricole CIB · Full-time" ]  [ Dates: "Jan 2020 - Present · 2 yrs 10 mos" ]  [ Location: "Paris, Île-de-France, France" ] 
 [ Position: "Equity Sales Trader Assistant" ]  [ Company+Type: "ODDO BHF · Internship" ]  [ Dates: "Jun 2019 - Jan 2020 · 8 mos" ]  [ Location: "Paris, Île-de-France, France" ] 
 [ Position: "Wealth Management Analyst" ]  [ Company+Type: "HSBC · Internship" ]  [ Dates: "Mar 2018 - Sep 2018 · 7 mos" ]  [ Location: "Paris, Île-de-France, France" ] 
 [ Position: "Business Developper" ]  [ Company+Type: "Capgemini · Internship" ]  [ Dates: "Jan 2017 - Aug 2017 · 8 mos" ] 
Driftr95
  • 4,572
  • 2
  • 9
  • 21
  • Hi, it works but it returns another 'span text' above, as you expected it to. Do you know how to select the second one? The text I want is exactly the same but the second one. – LaC Oct 26 '22 at 06:21
  • The span class above appears to be different from the first one: Structured Products & Equity Derivatives Sales – LaC Oct 26 '22 at 06:24
  • @HugoChikli To print "Structured Products & Equity Derivatives Sales" from the second span, use `print(soup.select_one('.mr1.t-bold').get_text(strip=True))` – Driftr95 Oct 26 '22 at 07:03
  • @HugoChikli or did you mean that you're getting "Structured Products & Equity Derivatives Sales" but you don't want to? Because then you might try my answer with `spanTxt2 = soup.select_one('span.t-14.t-normal > span.visually-hidden')` [the added `>` specifies direct descendants (children) only] - can't really tell for sure without seeing full html though – Driftr95 Oct 26 '22 at 07:04
  • When I use your code, I get a text from a section a little before on the website that has precisely the same division HTML. The website is a LinkedIn profile page where I wanted to get the information about the experience. – LaC Oct 26 '22 at 07:23
  • I want the text from this section
    and not ember940
    – LaC Oct 26 '22 at 07:42
  • @HugoChikli try `spanTxt2 = soup.select_one('section#ember941 span.t-14.t-normal > span.visually-hidden')` ? – Driftr95 Oct 26 '22 at 07:49
  • I am getting a 'None' – LaC Oct 26 '22 at 08:25
  • spanTxt1 = soup.select_one('span.t-14.t-normal span[aria-hidden="true"]') if spanTxt1 is not None: spanTxt1 = spanTxt1.get_text(strip=True) spanTxt2 = soup.select_one('section#ember941 span.t-14.t-normal span.visually-hidden') if spanTxt2 is not None: spanTxt2 = spanTxt2.get_text(strip=True) print(f' Text1: "{spanTxt1}" \n Text2: "{spanTxt2}" ') – LaC Oct 26 '22 at 08:25
  • Text1: "Thibault’s recent posts and comments will be displayed here." Text2: "None" – LaC Oct 26 '22 at 08:25
  • @HugoChikli I can't debug without the actual link/ full html – Driftr95 Oct 26 '22 at 13:58
  • here is the link: https://www.linkedin.com/in/thibault-arrighi-a43296130/ – LaC Oct 26 '22 at 14:09
  • So I want to get the first experience and after ideally the company and start date ! Let me know if you can help – LaC Oct 26 '22 at 14:10
  • @HugoChikli please see my edits and let me know if it works. I addeda few more details than asked for; feel free to ignore the second part of my edit – Driftr95 Oct 26 '22 at 18:27
  • @HugoChikli ....I do wish you had shared the link earlier - I actually once posted [another answer](https://stackoverflow.com/a/73957068/12652373) about the experience section of LinkedIn which might have helped you – Driftr95 Oct 26 '22 at 18:27
  • Yes, it is working now thank you!! I am just not sure what refers to what? So if I want to export this in pandas data frame should I just use expSelRef? – LaC Oct 26 '22 at 21:35
  • Hi, don't bother i figured it out !! Big thanks @Driftr95 !! This truly helps – LaC Oct 26 '22 at 22:06