2

I am scraping files from a website, and want to rename those files based on existing directory names on my computer (or if simpler, a list containing those directory names). This is to maintain a consistent naming convention.

For example, I already have directories named:

Barone Capital Management, Gabagool Alternative Investments, Aprile Asset Management, Webistics Investments

The scraped data consists of some exact matches, some "fuzzy" matches, and some new values:

Barone, Gabagool LLC, Aprile Asset Management, New Name, Webistics Investments

I want the scraped files to adopt the naming convention of the existing directories. For example, Barone would become Barone Capital Management, and Gabagool LLC would be renamed Gabagool Alternative Investments.

So what's the best way to accomplish this? I looked at fuzzywuzzy and some other libraries, but not sure what the right path is.

This is my existing code which just names the file based on the anchor:

import praw
import requests
from bs4 import BeautifulSoup
import urllib.request

url = 'https://old.reddit.com/r/test/comments/b71ug1/testpostr23432432/'
headers = {'User-Agent': 'Mozilla/5.0'}
page = requests.get(url, headers=headers)

soup = BeautifulSoup(page.text, 'html.parser')
table = soup.find_all('table')[0]

#letter_urls = []
for anchor in table.findAll('a'):
    try:
        if not anchor:
            continue
        fund_name = anchor.text
        letter_link = anchor['href']
        urllib.request.urlretrieve(letter_link, '2018 Q4 ' + fund_name + '.pdf')
    except:
        pass

Note that the list of directories are already created, and look something like this:

 - /Users/user/Dropbox/Letters/Barone Capital Management
 - /Users/user/Dropbox/Letters/Aprile Asset Management
 - /Users/user/Dropbox/Letters/Webistics Investments
 - /Users/user/Dropbox/Letters/Gabagool Alternative Investments
 - /Users/user/Dropbox/Letters/Ro Capital
 - /Users/user/Dropbox/Letters/Vitoon Capital
user53526356
  • 934
  • 1
  • 11
  • 25
  • fuzzywuzzy looks interesting, thanks! Have a very similar problem. – Kai Aeberli Mar 29 '19 at 20:32
  • 1
    Windows, Linux, or Mac? Secondly, how are you determining which folder you want them to go into? I'm not seeing the list of download directories in your code. Thirdly, do you already have the download directives made? – FailSafe Mar 29 '19 at 20:45
  • 1. Mac 2. I haven't determined that yet as I figure it would be more suitable for a separate post. But obviously would love to make it so the scraped files find the matched directory, gets renamed, and then moved to that directory. 3. Yes, they are already made. – user53526356 Mar 29 '19 at 21:26

2 Answers2

1

As treated in Python: find closest string (from a list) to another string

you can use difflib.get_close_matches (https://docs.python.org/3/library/difflib.html#difflib.get_close_matches) to find the most similar string within a list. Your list would be the folders of your absolute paths you already have:

import difflib
best_options = get_close_matches(fund_name, candidates, n=1)

if best_options:
    directory = best_options[0]
else:
    directory = 'New Name'
Bravado
  • 137
  • 9
  • So ```candidates``` is the list of existing directories? And maybe I'm missing something obvious, but I keep getting ```name 'get_close_matches' is not defined``` even though I imported difflib. Is there another module I need to import too? Sorry if a dumb question, as I'm very new to this. – user53526356 Mar 30 '19 at 02:01
  • 1
    you have to add "from difflib import get_close_matches" – SanV Mar 30 '19 at 03:55
  • I'm not sure that this works as it's only matching against exact names and not close ones. I set ```candidates = [ ]``` to the existing directories, and then ```best_options = get_close_matches(fund_name, candidates, n=1)```. If I ```print(best_options)``` the only one that are printed are ones with an exact match to ```candidates``` (so ```['/Users/user/Desktop/Test/Aprile Asset Management']``` and ```['/Users/user/Desktop/Test/Webistics Investments']``` – user53526356 Mar 30 '19 at 13:54
  • 1
    that's what I thought as well based on the examples of difflib.get_close_matches. Hence, I proposed the approach to look at unique words in scraped names and ignoring generic words, such as Investments, Assets, Management, etc. – SanV Mar 30 '19 at 14:31
  • 1
    @MSD the problem is that the spaces mess up the match. Remove the spaces from both the candidates and the name of the file and compare them like that. You will have the space-less version of the best directory, but by using the index it has in the list of candidates you can easily get the original name. If you still dont get any result use the 'cutoff' parameter to relax the precision of the algorithm, because by default it must be over 0.6: difflib.get_close_matches('asset123', ['AssetManagement', 'WebisticsInvestments'], n=1, cutoff=0.3) – Bravado Mar 31 '19 at 19:32
  • 1
    Actually just by relaxing the precision it gives some results for the examples I try. Just check yourself – Bravado Mar 31 '19 at 19:33
  • Relaxing the cutoff does seem to work better (still not 100% though). Maybe I would have better success with creating a space-less version like you suggested. But I'm still not quite clear on how to rename the existing file to the matched directory name. You mention index position, but is that reliable if the list of ```candidates``` and ```fund_names``` will change over time? – user53526356 Apr 01 '19 at 02:41
  • I think I'm close. I use ```glob``` and ```os.rename``` to rename all the files in the directory to something like 1.pdf, 2.pdf, 3.pdf, etc. But instead of just incrementing the file name, how do I pass in the ```best_option``` to os.rename when it can only take a string? Right now it looks like ```os.rename(file, '/Users/derajfast/Desktop/Python/Mine Safety Disclosures/downloaded_files/' + str(i) + ".pdf") i += 1```, but instead of ```str(i)``` it should be the name of the ```best_option```. – user53526356 Apr 02 '19 at 15:37
0

Got it working:

best_options = get_close_matches(fund_name, candidates, n=1, cutoff=.5)

try:
     if best_options:
       fund_name = (downloads_folder + period + " " + fund_name + ".pdf")
       os.rename(fund_name, downloads_folder + period + " " + best_options[0] + ".pdf" )
    except:
        pass
user53526356
  • 934
  • 1
  • 11
  • 25