1

My python script:

  • visits URLs from an Excel file
  • extracts version information present on the webpage
  • compares the extracted version with the version mentioned in the Excel file.

It creates a new file with an additional column 'latest version'. If the versions are the same, it returns 'same' in column 'latest version', else it returns the extracted version. But it is returning '8' in all rows of latest version.

Here is my function:

import requests
from bs4 import BeautifulSoup

def extract_version(url, current_version):
    # Make HTTP request to URL
    response = requests.get(url)
    # Parse HTML content of webpage
    soup = BeautifulSoup(response.content, 'html.parser')
    # Extract version information using regular expressions
    version_pattern = re.compile(r'\d+(?:\.\d+)*[a-zA-Z]*')
    match = version_pattern.search(str(soup))
    if match:
        extracted_version = match.group()
        if str(extracted_version) == str(current_version):
            return 'Same'
        else:
            return extracted_version
    else:
        return ''

Here are a few URL's with their version as stated in my Excel file:

modyolo.com/lords-mobile.html, 2.97 
apkmody.io/games/zombie-frontier-3-mod-apk, 2.52 
modyolo.com/car-mechanic-simulator-21.html, 2.1.63 
modyolo.com/roblox-2.html, 2.564.424c

I tried:

  • writing \d differently for example as [0-9]
  • replaced + with {1,}
  • a ^ in the beginning of my regex

but it always gave the same output of 8 or it returned nothing in my latest version column (my third attempt).

How can I scrape the version information from these sites?

tripleee
  • 175,061
  • 34
  • 275
  • 318
No One
  • 23
  • 4
  • Do you maybe have an example of a few URL's it visits? And what debugging steps have you taken so far? You mention changing the regex and that it kept returning incorrect outputs, couls you [edit] your question to include what regexes you've tried and what their incorrect output was? – Saaru Lindestøkke Mar 08 '23 at 11:13
  • these are some of the urls and versions in my input file: https://modyolo.com/lords-mobile.html 2.97 https://apkmody.io/games/zombie-frontier-3-mod-apk 2.52 https://modyolo.com/car-mechanic-simulator-21.html 2.1.63 https://modyolo.com/roblox-2.html 2.564.424c – No One Mar 08 '23 at 11:18
  • i tried writing \d differently for example, [0-9], replaced + with {1,}, but it gave the same output as mentioned in the question. i also added a ^ in the beginning of my regex and that returned nothing in my latest version column. – No One Mar 08 '23 at 11:22
  • Maybe relevant: https://stackoverflow.com/a/1732454/1256347 – Saaru Lindestøkke Mar 08 '23 at 15:45

1 Answers1

2

In the example URLs you've posted, the webpage contains an element <script type="application/ld+json">. That element contains a neat JSON of all the info you need, e.g. on https://modyolo.com/roblox-2.html:

<script type="application/ld+json">
    {
        "@context": "https://schema.org/",
        "@type": "SoftwareApplication",
        "name": "Roblox",
        "applicationCategory": "GameApplication",
        "operatingSystem": "Android",
        "softwareVersion": "2.564.444",
        "offers": {
            "@type": "Offer",
            "price": "0",
            "priceCurrency": "USD"
        },
        "aggregateRating": {
            "@type": "AggregateRating",
            "bestRating": 5,
            "worstRating": 1,
            "ratingCount": 856,
            "ratingValue": 4.1      }
    }
    </script>

So, my approach would be to first filter out that element from the soup, and then extract the version info from there:

def extract_version(url, current_version):
    # Make HTTP request to URL
    response = requests.get(url)
    # Parse HTML content of webpage
    soup = BeautifulSoup(response.content, 'html.parser')
    # Only get tags that contain that specific type
    results = soup.findAll("script", {"type" : "application/ld+json"})
    # Filter out tags that only have that attribute and no others
    result = [x for x in results if x.attrs == {'type': 'application/ld+json'}]
    # Translate the scraped data to a dictionary
    data = json.loads(data[0].get_text())
    # Extract version information by getting the right key
    extracted_version = data['softwareVersion']
    etc...

You might need to try different keys to get the software version It's softwareVersion in this example, but it might be something slightly different on other websites.

Saaru Lindestøkke
  • 2,067
  • 1
  • 25
  • 51
  • 1
    Nah, using robust methods to parse and extract data from known structured formats is so boring. Better to hack together shorter regular expressions, and build a code base on a house of cards stuck together with tape, hopes and prayers. – MatBailie Mar 08 '23 at 12:03
  • the input file has many urls from different websites, this would not work. – No One Mar 08 '23 at 12:35
  • @NoOne How many different websites? 10? 20? 500? – Saaru Lindestøkke Mar 08 '23 at 12:48
  • around 30 to 40. it is for a friend, he says he will keep changing the data from time to time. – No One Mar 08 '23 at 13:34
  • To be honest it sounds doable to check if 30-40 websites have this (or another) nice underlying structure, and only in case they don't, fall back to a regex. You would need to create a custom solution per sitetype anyway, as a single regex would not be able to parse the version number of all these different sites. – Saaru Lindestøkke Mar 08 '23 at 14:04
  • I don't see why my regex is not working in the first place. It matches almost all of the version formats, or at least the ones that I am using for testing. – No One Mar 08 '23 at 15:35
  • Sure, the regex matches all of the version formats, but also many, many other things. Have you looked at how the data in your "soup" variable looks? [Here I've placed the contents of the "soup" variable](https://regex101.com/r/1Lboul/1) of one URL together with your regex. It has more than 7.2k matches. It immediately shows why you often get `8` as as result: the regex matches the `8` from the line `` – Saaru Lindestøkke Mar 08 '23 at 15:43
  • What's your end goal anyway? Why do you need to scrape software version numbers? – Saaru Lindestøkke Mar 08 '23 at 15:47
  • I see. I'm doing it for a mate. never asked him what's he up to. – No One Mar 08 '23 at 15:57