Solution
Long story short:
import re
import requests
pattern = "opening soon"
# This gives you the count of a pattern in a URL
len(re.findall(
pattern=pattern,
string=requests.get(url).text.lower(),
flags=re.IGNORECASE
))
Problem in your approach:
What you were trying that returns multiple results for the same occurrence of a given keyword/search-pattern. Try the following:
url = "https://www.happyboxstudio.com/shop"
r = requests.get(url=url)
soup = BeautifulSoup(r.content, 'html.parser')
results = soup.find_all(True,text=re.compile(r'Opening Soon', re.I))
print(f'Total matches: {len(results)}') # Total matches: 3
# This shows that there is duplication
# for the same single occurrence
# the 3rd result is present in the 2nd result
# the 2nd result is present in the 1st result
(results[1] in results[0], results[2] in results[1]) # (True, True)
A. Suggested Solution
So, I will suggest you to use requests.get(url).text
as the string to search within. Here is how you could do it as an option. I will use the data from the Dummy Data
section below.
A.1.
This shows you how I am counting presence using the convenience function count_presence()
.
url = "https://www.happyboxstudio.com/shop"
r = requests.get(url=url)
(
count_presence(text=r.text, pattern=r"opening soon"), # 1
count_presence(text=r.text, pattern=r"coming soon"), # 0
count_presence(text=r.text, pattern=r"under construction") # 0
) # (1, 0, 0)
A.2.
I have created another convenience function get_counts()
that you can use to get counts for all three search-strings.
patterns = ["opening soon", "coming soon", "under construction"]
columns = [re.sub("\s+", "_", str(x)) for x in patterns]
## If you want to use your own column names
## uncomment the following line. ⚡
# columns = ["openingSoon", "comingSoon", "underConstruction"]
# Using the url
get_counts(text=r.text, patterns=patterns, columns=columns)
# {'coming_soon': 0, 'opening_soon': 1, 'under_construction': 0}
# Using the text (C.2) in the dummy section
get_counts(text=text, patterns=patterns, columns=columns)
# {'coming_soon': 3, 'opening_soon': 2, 'under_construction': 2}
A.3.
⚡ Now you are using a dataframe, with one URL in each row. So, you could use .apply()
on the dataframe and concatenate the result with the original dataframe as follows.
dfcounts = df.apply(lambda row: pd.Series(get_counts(
text=requests.get(url=row.url).text.lower(),
patterns=patterns,
columns=columns)
),
axis=1
)
df2 = pd.concat([df, dfcounts], axis=1)
# print(df2.to_markdown())
Output:
Possible improvements ⚡⚡
You could consider parallelizing this as calls to each website can be easily parallelized. But I will leave that as a fun endeavor to you.
See: Make Pandas DataFrame apply() use all cores?
A.4.
How to convert the counts to +
or -
signs
The columns xA, XB, xC
correspond to the counts in columns A, B, C
.
# Make some dummy data (dfx) which only has the counts
dfx = pd.DataFrame(
np.random.randint(low=0, high=3, size=(6,3)),
columns=list('ABC')
)
# Convert the counts to + or - signs and then concatenate to dfx
dfy = pd.concat([
dfx,
pd.DataFrame(
np.where(dfx.values > 0, '+', '-'),
columns=[f'x{x}' for x in dfx.columns])
],
axis=1
)
print(dfy.to_markdown())
Output:
|
A |
B |
C |
xA |
xB |
xC |
0 |
0 |
1 |
1 |
- |
+ |
+ |
1 |
0 |
1 |
0 |
- |
+ |
- |
2 |
1 |
1 |
0 |
+ |
+ |
- |
3 |
1 |
2 |
2 |
+ |
+ |
+ |
4 |
0 |
2 |
1 |
- |
+ |
+ |
5 |
2 |
2 |
2 |
+ |
+ |
+ |
B. Code ⚡⚡
import re
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import requests
from typing import Union, List
def count_presence(text: str, pattern: str = "opening soon", ignore_case: bool=True):
if ignore_case:
result = re.findall(pattern=pattern, string=text.lower(), flags=re.IGNORECASE)
else:
result = re.findall(pattern=pattern, string=text)
return len(result)
def get_counts(
text: str,
patterns: Union[str, List[str]]=["opening soon", "coming soon", "under construction"],
columns: List[str]=None,
ignore_case: bool=True
):
results = dict()
if isinstance(patterns, str):
patterns = [patterns]
if isinstance(columns, str):
columns = [columns]
if columns:
# If columns is provided, the length
# must match the length of patterns
if len(columns) != len(patterns):
columns = None
if columns is None:
columns = [re.sub("\s+", "_", str(x)) for x in patterns]
for pattern, column in zip(patterns, columns):
results.update({column: count_presence(text=text, pattern=pattern)})
return results
C. Dummy Data
C.1.
DataFrame with URLs
url = "https://www.happyboxstudio.com/shop"
df = pd.DataFrame(index=np.arange(5))
df['url'] = url
print(df)
# url
# 0 https://www.happyboxstudio.com/shop
# 1 https://www.happyboxstudio.com/shop
# 2 https://www.happyboxstudio.com/shop
# 3 https://www.happyboxstudio.com/shop
# 4 https://www.happyboxstudio.com/shop
C.2.
Sample Text with Keywords/Patterns in Interest
The following text has:
- "coming soon": 3
- "opening soon": 2
- "under construction": 2
text = """
abcd coming soon.
...
123 "414-898-7667"
under construction. Bill Gates. Some other text. MSFT, AAPL, AMZN. Coming soon.
Opening later. Or may be opening soon.
Are you sure? Opening soon?
I bet they are still under construction. The builder said that the truck is coming Soon.
"""

References