1

Using Python I'd like to scrape some information from a webpage and save the info to a .txt file named using the title of the page scraped.

Unfortunately many pages contain special characters that can't be used in file names, so ideally I want to extract the title you would get if you did file/save from a browser.

Is it possible to achieve this with BeautifulSoup or Selenium?

I can get the page title with soup, and then clean it, but if there is a more efficient way of getting the browser-cleaned title I'd love to know how.

EDIT:

So far I have achieved a workable result with the following code. I used YouTube as an example but really would prefer an all-purpose page-title retrieval in browser save format if possible. Probably doesn't exist, but there's always hope.

import re
import mechanize

br = mechanize.Browser()
br.open("https://www.youtube.com/watch?v=RvCBzhhydNk")

title = re.sub('[^A-Za-z0-9]+', ' ', br.title().replace("YouTube", "")).strip()

print(title)
pglove
  • 133
  • 1
  • 9
  • 1
    Selenium does not have that feature, you have to get the title, then write simple logic to avoid the special characters or change the special characters to something else. – AbiSaran Oct 02 '22 at 14:09
  • Having an example website and target would be very useful. – Rivered Oct 02 '22 at 15:41
  • ...you could use something like [uipath](https://dev.tutorialspoint.com/uipath/uipath_studio_recording.htm#) to have it nearly save a page and get the name, but that would be an insanely convoluted and inefficient process just to get a name – Driftr95 Oct 03 '22 at 16:25

1 Answers1

1

I'm afraid I don't know of any "all-purpose page-title retrieval in browser save format", but what you're doing so far is not too bad (though I particularly prefer the method suggested in this answer and its comments from @hardmooth and @AlexKrycek).

You can always save it as a little function if you'll be needing it often and use something like urlparse or tldextract to get the domain if you'll be using sites other than YouTube as well.

So something like:

# import tldextract

def cleanPageTitle(origTitle, pageUrl):
    domain = tldextract.extract(pageUrl).domain
    cleaned = "".join([
            x for x in origTitle if (x.isalnum() or x in "._- ")
        ]).replace(domain, "")
    return cleaned

so that you can just call as

title = cleanPageTitle(br.title(), "https://www.youtube.com/watch?v=RvCBzhhydNk")
Driftr95
  • 4,572
  • 2
  • 9
  • 21