I am scraping files from a website, and want to rename those files based on existing directory names on my computer (or if simpler, a list containing those directory names). This is to maintain a consistent naming convention.
For example, I already have directories named:
Barone Capital Management, Gabagool Alternative Investments, Aprile Asset Management, Webistics Investments
The scraped data consists of some exact matches, some "fuzzy" matches, and some new values:
Barone, Gabagool LLC, Aprile Asset Management, New Name, Webistics Investments
I want the scraped files to adopt the naming convention of the existing directories. For example, Barone
would become Barone Capital Management
, and Gabagool LLC
would be renamed Gabagool Alternative Investments
.
So what's the best way to accomplish this? I looked at fuzzywuzzy and some other libraries, but not sure what the right path is.
This is my existing code which just names the file based on the anchor:
import praw
import requests
from bs4 import BeautifulSoup
import urllib.request
url = 'https://old.reddit.com/r/test/comments/b71ug1/testpostr23432432/'
headers = {'User-Agent': 'Mozilla/5.0'}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')
table = soup.find_all('table')[0]
#letter_urls = []
for anchor in table.findAll('a'):
try:
if not anchor:
continue
fund_name = anchor.text
letter_link = anchor['href']
urllib.request.urlretrieve(letter_link, '2018 Q4 ' + fund_name + '.pdf')
except:
pass
Note that the list of directories are already created, and look something like this:
- /Users/user/Dropbox/Letters/Barone Capital Management
- /Users/user/Dropbox/Letters/Aprile Asset Management
- /Users/user/Dropbox/Letters/Webistics Investments
- /Users/user/Dropbox/Letters/Gabagool Alternative Investments
- /Users/user/Dropbox/Letters/Ro Capital
- /Users/user/Dropbox/Letters/Vitoon Capital