1

Following is the HTML:

<div class="ajaxcourseindentfix">
    <h3>CPSC 353 - Introduction to Computer Security (3) </h3>
    <hr>Security goals, security systems, access controls, networks and security, integrity, cryptography fundamentals, authentication. Attacks: software, network, website; management considerations, security standards in government and industry; security issues in requirements, architecture, design, implementation, testing, operation, maintenance, acquisition, and services.
    <br>
    <br>Prerequisite: <a href="preview_course_nopop.php?catoid=16&amp;coid=96570" onclick="acalogPopup()">CPSC 253U</a>
    <span style="display: none !important">&nbsp;</span>&nbsp;or <a href="#" onclick="acalogPopup()" target="_blank">CPSC 254</a>
    <span style="display: none !important">&nbsp;</span>&nbsp;and <a href="#" onclick="acalogPopup()" target="_blank">CPSC 351</a>
    <span style="display: none !important">&nbsp;</span>
    , declared major/minor in CPSC, CPEN, or CPEI
    <br>
</div>

I need to fetch the following text from this HTML:

From Line 6 - or
From Line 7 - and
, declared major/minor in CPSC, CPEN, or CPEI

I am able to get the href [Course number: CPSC 254 etc...] with the following XPath:

 # This xpath gives me all the tags followed by h3 and then I iterate through them in my script.  
//div[@class='ajaxcourseindentfix']/h3/following-sibling::text()[2]/following-sibling::*

Update

And, then the text with the following XPath:

# This xpath gives me all the text after the h3 tag.  
//div[@class='ajaxcourseindentfix']/h3/following-sibling::text()[2]/following-sibling::text()

I need to have these course name/prerequisite in the same way they are at URL 1.

enter image description here

In this approach I am getting all the HREF first, then all text. Is there a better way to achieve this? I don't want to iterate over 2 XPaths to get the HREF first, then Text and after that club them to form the prerequisite string.

1 http://catalog.fullerton.edu/ajax/preview_course.php?catoid=16&coid=99648&show

miserable
  • 697
  • 1
  • 12
  • 31
  • Try this one https://stackoverflow.com/questions/3442394/using-text-to-retrieve-only-text-not-nested-in-child-tags – Vijay Sankhat Mar 17 '18 at 07:37
  • So your desired output is just `"or", "and", ", declared major/minor in CPSC, CPEN, or CPEI"`, right? – Andersson Mar 17 '18 at 07:54
  • My desired output is: CPSC 253U or CPSC 254 and CPSC 351, declared major/minor in CPSC, CPEN, or CPEI. Just in the form of text. – miserable Mar 17 '18 at 07:57

1 Answers1

2

Try to use below code to get required output:

div = soup.select("div.ajaxcourseindentfix")[0]
" ".join([word for word in div.stripped_strings]).split("Prerequisite: ")[-1]

The output is

'CPSC 253U or CPSC 254 and CPSC 351 , declared major/minor in CPSC, CPEN, or CPEI'
Andersson
  • 51,635
  • 17
  • 77
  • 129
  • I tried it on http://catalog.fullerton.edu/ajax/preview_course.php?catoid=16&coid=99347&show, but following is what I am getting in div.text: '    [Add to Portfolio] [Print Course] ' – miserable Mar 17 '18 at 09:19
  • ? This is not what I've suggested. Did you try my code? – Andersson Mar 17 '18 at 09:22
  • Yeah. Here is how I tried: `course_preview_page = requests.get(course_preview_URL) soup = BeautifulSoup(course_preview_page.content, 'lxml') div = soup.select("div.ajaxcourseindentfix")[0] " ".join([word for word in div.stripped_strings]).split("Prerequisite: ")[-1]` – miserable Mar 17 '18 at 09:32
  • Oh, yep. There are two `div` nodes with same class name. Try `div = soup.select("div.ajaxcourseindentfix")[1]` – Andersson Mar 17 '18 at 09:39
  • Yup. Got it. Thanks so much, Andersson. – miserable Mar 17 '18 at 09:50
  • Is there a way to put multiple strings in split()? At some places, its corerequisite and some have prerequisite. Some have corerequisite`s` and prerequisite`s` – miserable Mar 21 '18 at 09:18
  • Hm... AFAIk, no, but you can create new question regarding this issue. You can also try `" ".join([word for word in div.stripped_strings]).split("Prerequisite: ")[-1].split("Corerequisite: ")[-1]`. – Andersson Mar 21 '18 at 09:35
  • Yeah but this way it will find the Corerequisite inside Prerequisite, correct? Created new question - https://stackoverflow.com/questions/49402888/how-to-get-text-which-has-no-html-tag-add-multiple-delimiters-in-split – miserable Mar 21 '18 at 09:46