Find all occurances of 'Php' on page ignoring case BeautifulSoup

Question

Im looking to find all the occurances of Php on a page (ignoring case) with BeautifulSoup in Python3

Php (regardless of case) could occur anywhere on the page, so I am trying to basically just find the string representation, and not within a specific div, or class.

I currently have:

from BeautifulSoup import BeautifulSoup
import requests
    school_urls = ['somesite1.com','somesite2.com']
    posting_keywords = ['PHP', 'Php', 'php']

    for school in school_urls:

school contains html markup from requesting a url with words like php in it.

How does this look to you? Is there a way to do this in Beautiful soup to find all variations of php ignoring the case instead of having to loop through posting_keywords?

Thanks

Have you tried running this? Does it work? Does it fail? What's your question? — Sean McSomething, Jan 25 '17 at 20:06
I did test it, the problem is, if 'Php' is in say a link, it finds it. I want it only if its text e.g. `Php Rocks`, not a link e.g. `a href="somesite.com/php-rocks">some text` — Jshee, Jan 25 '17 at 21:00

score 0 · Answer 1 · answered Jan 25 '17 at 19:13

0

Does posting_keywords.lower() work for you.

answered Jan 25 '17 at 19:13

thinkvitamin

81
1
12

That would look for only `php`. I want to find `Php` or `PHP` in html output if it exists too – Jshee Jan 25 '17 at 19:14
Could this `lower` method be applied to `res` above ? – Jshee Jan 25 '17 at 19:15

score 0 · Answer 2 · answered Jan 26 '17 at 02:14

import re, bs4
text = '''"""
<html><head><title>The Dormouse's story php</title></head>
<body>
<p class="title"><b>The Dormouse's story PHP</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">php</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Php</a> and
<a href="http://example.com/tillie" class="sister" id="link3">php Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""'''
soup = bs4.BeautifulSoup(text, 'lxml')
soup.find_all(text=re.compile(r'php', re.IGNORECASE))

out:

["The Dormouse's story php",
 "The Dormouse's story PHP",
 'php',
 'Php',
 'php Tillie']

Document

Find all occurances of 'Php' on page ignoring case BeautifulSoup

2 Answers2