How to crawl website body contents in order to confirm if string exists

Question

I have a csv file with bunch of websites and I would like to check if they have strings like 'Opening Soon', 'Coming Soon', 'Under Construction' listed in its body contents. It should flag each of those strings despite upper or lower case letters.

My code so far:

df = pd.read_csv('/path/to/myScanResults.csv') 

openingSoon = []
comingSoon = []
underConstruction = []

for url in df['Urls']:
    r = requests.get(url, verify=False)
    soup = BeautifulSoup(r.content, 'html.parser')
    if url in (soup.find_all(True,text=re.compile(r'Opening Soon', re.I))):
        openingSoon.append("+")
    else:
        openingSoon.append("-")
    if url in (soup.find_all(True,text=re.compile(r'coming soon', re.I))):
        comingSoon.append("+")
    else:
        comingSoon.append("-")
    if url in (soup.find_all(True,text=re.compile(r'under construction', re.I))):
        underConstruction.append("+")
    else:
        underConstruction.append("-")

df["openingSoon"] = openingSoon
df["comingSoon"] = comingSoon 
df["underConstruction"] = underConstruction

However, it always gives me the same result for all appended lists '-' (even though I'm scanning page that contains 'opening soon' string: https://www.happyboxstudio.com/shop).

My output:

>>> underConstruction
['-', '-', '-', '-', '-', '-', '-', '-']

Why are you checking if `url` is in the list of elements returned by `soup.find_all(...)`. Perhaps you need, e.g., `if soup.find_all(True,text=re.compile(r'under construction', re.I)):` instead. — , Dec 10 '20 at 18:22

CypherX · Answer 1 · 2020-12-10T23:57:36.907

Solution

Long story short:

import re
import requests

pattern = "opening soon"
# This gives you the count of a pattern in a URL
len(re.findall(
   pattern=pattern, 
   string=requests.get(url).text.lower(), 
   flags=re.IGNORECASE
))

Problem in your approach:

What you were trying that returns multiple results for the same occurrence of a given keyword/search-pattern. Try the following:

url = "https://www.happyboxstudio.com/shop"
r = requests.get(url=url)
soup = BeautifulSoup(r.content, 'html.parser')
results = soup.find_all(True,text=re.compile(r'Opening Soon', re.I))
print(f'Total matches: {len(results)}') # Total matches: 3

# This shows that there is duplication 
# for the same single occurrence
#    the 3rd result is present in the 2nd result
#    the 2nd result is present in the 1st result
(results[1] in results[0], results[2] in results[1]) # (True, True)

A. Suggested Solution

So, I will suggest you to use requests.get(url).text as the string to search within. Here is how you could do it as an option. I will use the data from the Dummy Data section below.

`A.1.`

This shows you how I am counting presence using the convenience function count_presence().

url = "https://www.happyboxstudio.com/shop"
r = requests.get(url=url)
(
    count_presence(text=r.text, pattern=r"opening soon"), # 1
    count_presence(text=r.text, pattern=r"coming soon"), # 0
    count_presence(text=r.text, pattern=r"under construction") # 0
) # (1, 0, 0)

`A.2.`

I have created another convenience function get_counts() that you can use to get counts for all three search-strings.

patterns = ["opening soon", "coming soon", "under construction"]
columns = [re.sub("\s+", "_", str(x)) for x in patterns]

## If you want to use your own column names
##    uncomment the following line. ⚡
# columns = ["openingSoon", "comingSoon", "underConstruction"]

# Using the url
get_counts(text=r.text, patterns=patterns, columns=columns)
# {'coming_soon': 0, 'opening_soon': 1, 'under_construction': 0}

# Using the text (C.2) in the dummy section
get_counts(text=text, patterns=patterns, columns=columns)
# {'coming_soon': 3, 'opening_soon': 2, 'under_construction': 2}

`A.3.`

⚡ Now you are using a dataframe, with one URL in each row. So, you could use .apply() on the dataframe and concatenate the result with the original dataframe as follows.

dfcounts = df.apply(lambda row: pd.Series(get_counts(
        text=requests.get(url=row.url).text.lower(), 
        patterns=patterns, 
        columns=columns)
    ), 
    axis=1
)
df2 = pd.concat([df, dfcounts], axis=1)
# print(df2.to_markdown())

Output:

	url	opening_soon
0	https://www.happyboxstudio.com/shop	1
1	https://www.happyboxstudio.com/shop	1
2	https://www.happyboxstudio.com/shop	1
3	https://www.happyboxstudio.com/shop	1
4	https://www.happyboxstudio.com/shop	1

Possible improvements ⚡⚡

You could consider parallelizing this as calls to each website can be easily parallelized. But I will leave that as a fun endeavor to you. See: Make Pandas DataFrame apply() use all cores?

`A.4.` How to convert the counts to `+` or `-` signs

The columns xA, XB, xC correspond to the counts in columns A, B, C.

# Make some dummy data (dfx) which only has the counts
dfx = pd.DataFrame(
    np.random.randint(low=0, high=3, size=(6,3)), 
    columns=list('ABC')
)
# Convert the counts to + or - signs and then concatenate to dfx
dfy = pd.concat([
        dfx, 
        pd.DataFrame(
            np.where(dfx.values > 0, '+', '-'), 
            columns=[f'x{x}' for x in dfx.columns])
    ], 
    axis=1
)
print(dfy.to_markdown())

Output:

	A	B	C	xA	xB	xC
0	0	1	1	-	+	+
1	0	1	0	-	+	-
2	1	1	0	+	+	-
3	1	2	2	+	+	+
4	0	2	1	-	+	+
5	2	2	2	+	+	+

B. Code ⚡⚡

import re
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import requests
from typing import Union, List

def count_presence(text: str, pattern: str = "opening soon", ignore_case: bool=True):
    if ignore_case:
        result = re.findall(pattern=pattern, string=text.lower(), flags=re.IGNORECASE)
    else:
        result = re.findall(pattern=pattern, string=text)

    return len(result)

def get_counts(
    text: str, 
    patterns: Union[str, List[str]]=["opening soon", "coming soon", "under construction"], 
    columns: List[str]=None, 
    ignore_case: bool=True
    ):
    results = dict()
    if isinstance(patterns, str):
        patterns = [patterns]
    if isinstance(columns, str):
        columns = [columns]
    if columns:
        # If columns is provided, the length 
        # must match the length of patterns
        if len(columns) != len(patterns):
            columns = None
    if columns is None:
        columns = [re.sub("\s+", "_", str(x)) for x in patterns]
    for pattern, column in zip(patterns, columns):         
        results.update({column: count_presence(text=text, pattern=pattern)})

    return results

C. Dummy Data

`C.1.` DataFrame with URLs

url = "https://www.happyboxstudio.com/shop"
df = pd.DataFrame(index=np.arange(5))
df['url'] = url
print(df)

#                                    url
# 0  https://www.happyboxstudio.com/shop
# 1  https://www.happyboxstudio.com/shop
# 2  https://www.happyboxstudio.com/shop
# 3  https://www.happyboxstudio.com/shop
# 4  https://www.happyboxstudio.com/shop

`C.2.` Sample Text with Keywords/Patterns in Interest

The following text has:

"coming soon": 3
"opening soon": 2
"under construction": 2

text = """
abcd coming soon.
...
123 "414-898-7667"
under construction. Bill Gates. Some other text. MSFT, AAPL, AMZN. Coming soon.
Opening later. Or may be opening soon.

Are you sure? Opening soon?

I bet they are still under construction. The builder said that the truck is coming Soon.
"""

References

Hi CypherX, I've changed `x` to `url` in line for `x in df['Urls']:`. However, it then throws error `File "", line 4, in NameError: name 'x' is not defined`. So I updated other `x` values to `url` in `if url in (soup.find_all(True,text=re.compile(r'Opening Soon', re.I))):` instances, but it continues to generate results only with `'-' `instead of `'+' ` when the key word matches with string on the website. Would you be able to tell what else am I doing wrong? Updated the question. — Baobab1988, Dec 10 '20 at 18:06
I see. The problem is you did not share any dummy data or website or dummy html to present your case. I suggest that you share a dummy html as text (in a block of triple quotes) so people can provide their answers based on the sample data. Currently, it is not possible to test a solution for the problem you are facing. — CypherX, Dec 10 '20 at 18:30
One more thing: have you tested individually each line of your code first? — CypherX, Dec 10 '20 at 18:30
So, what is your logic? If "opening soon" is present `>=1` times, then you count 1 or do you count how many times "opening soon" appears in the page? — CypherX, Dec 10 '20 at 18:47
@Baobab1988 I have updated the answer to explain what was the problem in your code and what you could possibly do. Take a look at it and let me know if you have any questions. — CypherX, Dec 10 '20 at 22:27

How to crawl website body contents in order to confirm if string exists

1 Answers1

Solution

A. Suggested Solution

A.1.

A.2.

A.3.

A.4. How to convert the counts to + or - signs