My python script:
- visits URLs from an Excel file
- extracts version information present on the webpage
- compares the extracted version with the version mentioned in the Excel file.
It creates a new file with an additional column 'latest version'. If the versions are the same, it returns 'same' in column 'latest version', else it returns the extracted version. But it is returning '8' in all rows of latest version.
Here is my function:
import requests
from bs4 import BeautifulSoup
def extract_version(url, current_version):
# Make HTTP request to URL
response = requests.get(url)
# Parse HTML content of webpage
soup = BeautifulSoup(response.content, 'html.parser')
# Extract version information using regular expressions
version_pattern = re.compile(r'\d+(?:\.\d+)*[a-zA-Z]*')
match = version_pattern.search(str(soup))
if match:
extracted_version = match.group()
if str(extracted_version) == str(current_version):
return 'Same'
else:
return extracted_version
else:
return ''
Here are a few URL's with their version as stated in my Excel file:
modyolo.com/lords-mobile.html, 2.97
apkmody.io/games/zombie-frontier-3-mod-apk, 2.52
modyolo.com/car-mechanic-simulator-21.html, 2.1.63
modyolo.com/roblox-2.html, 2.564.424c
I tried:
- writing
\d
differently for example as[0-9]
- replaced
+
with{1,}
- a
^
in the beginning of my regex
but it always gave the same output of 8 or it returned nothing in my latest version column (my third attempt).
How can I scrape the version information from these sites?