0

I have a dictionary that contains company name and the related Better Business Bureau link that accompanies that company. I also have a CSV file that has the BBB link attached to the phone number(s) for those companies. I need to somehow combine the two based on the BBB link that is associated with the company name.

My ultimate end goal is to have a dataframe that contains:

Company Name, Link, Phone Number(s)

DICTIONARY:

{'A. G. Builders, Inc.': 'https://www.bbb.org/us/nc/durham/profile/home-builders/ag-builders-inc-0593-6037923', 'A. R. Russell': 'https://www.bbb.org/us/nc/raleigh/profile/general-contractor/russell-l-judy-builder-inc-0593-90082691', 'A. R. Russell Builders, Inc.': 'https://www.bbb.org/us/nc/raleigh/profile/general-contractor/russell-l-judy-builder-inc-0593-90082691', 'A.C.A. Enterprises, LLC': 'https://www.bbb.org/us/fl/ponce-de-leon/profile/building-contractors/aca-enterprises-llc-0683-90029401', 'A.D. Myers Builders, LLC': 'https://www.bbb.org/us/nc/charlotte/profile/general-contractor/meyer-builders-llc-0473-219405', 'ABS Construction Group': 'https://www.bbb.org/us/nc/newport/profile/general-contractor/ab-building-remodeling-llc-0593-90293532', 'Absolute Construction Group, LLC': 'https://www.bbb.org/us/nc/durham/profile/home-improvement/absolute-construction-group-llc-0593-90282628'}

CODE:

phone_list = [] 
url_with_phone = []

def phone_numbers():
    driver = webdriver.Chrome()
    for url in url_list: #Looping through the list of the BBB links
        print(url) #Print the URL currently on
        driver.get(url)
        sleep(randint(4,6))
        phone = driver.find_elements_by_class_name("dtm-phone") #FINDS Phone num
        sleep(randint(4,8))
        print('looking for number')
        for p in phone:
            results = (p.text)
            print(results)
            sleep(randint(3,5))
            phone_list.append(results) # add phone number to phone_list
            sleep(randint(5,9))
            url_with_phone.append(url) #adds URL when phone num is found to match up with phone num

phone_numbers()

CSV OUTPUT OF LINKS & PHONE NUMBERS:

URL Searched,Phone Numbers
https://www.bbb.org/us/nc/durham/profile/home-builders/ag-builders-inc-0593-6037923,(919) 384-7005
https://www.bbb.org/us/nc/raleigh/profile/general-contractor/russell-l-judy-builder-inc-0593-90082691,(919) 625-7841
https://www.bbb.org/us/nc/raleigh/profile/general-contractor/russell-l-judy-builder-inc-0593-90082691,(919) 625-7841
https://www.bbb.org/us/fl/ponce-de-leon/profile/building-contractors/aca-enterprises-llc-0683-90029401,(850) 248-0597
https://www.bbb.org/us/fl/ponce-de-leon/profile/building-contractors/aca-enterprises-llc-0683-90029401,(850) 527-1767
https://www.bbb.org/us/nc/charlotte/profile/general-contractor/meyer-builders-llc-0473-219405,(704) 737-8409

For example, the first result in the CSV file belongs to AG Home Builders, is there a way I can add the key from the dictionary (Company Name) to the CSV based on matching the value?

I would want to add the company name to the CSV. What would be the best way to do this? I have read the following links to try and come up with my own results, but haven't had any luck trying the solutions on my own. (append multiple values for one key in a dictionary, list to dictionary conversion with multiple values per key?)

VLAZ
  • 26,331
  • 9
  • 49
  • 67
sjuice10
  • 29
  • 5
  • Is `aca-enterprises-llc` a company name? – Red Jun 23 '20 at 23:49
  • Yes, it is. Is there something specific about that i should be aware of? – sjuice10 Jun 24 '20 at 00:38
  • Should the numbers be included in the names? – Red Jun 24 '20 at 00:42
  • The numbers in the weblink should not be included. The names should match the name in the dictionary. For exmaple: 'A.C.A. Enterprises, LLC': 'https://www.bbb.org/us/fl/ponce-de-leon/profile/building-contractors/aca-enterprises-llc-0683-90029401' would have two entries [(850) 248-0597, (850) 527-1767 ] Both of those numbers belong to ACA Enterprises LLC – sjuice10 Jun 24 '20 at 00:46
  • Is my answer what you're looking for? – Red Jun 24 '20 at 00:51
  • why are you using selenium? you can do this with requests i think you could simply use `defaultdict `with list from the collections module – Umar.H Jun 24 '20 at 00:56
  • @Datanovice, valid question, I think I was just flowing off the code I had written to gather the links from google. I could have done this in requests. I am no familiar with `defaultdict` from the collections module. – sjuice10 Jun 24 '20 at 01:02
  • 1
    @AnnZen, yes it appears that will work! I figured extracting the names from the links would be an option. – sjuice10 Jun 24 '20 at 01:03
  • @AnnZen, i tried your solution in my code and it returned empty lists. I am thinking I will need to get educated on the defaultdict or regex more as I didnt include all results here as the post was too long. I greatly appreciate your time and explanation! – sjuice10 Jun 24 '20 at 01:30

1 Answers1

0

Here is how you can use re to extract all the business names from the string:

import re

a = '''URL Searched,Phone Numbers
https://www.bbb.org/us/nc/durham/profile/home-builders/ag-builders-inc-0593-6037923,(919) 384-7005
https://www.bbb.org/us/nc/raleigh/profile/general-contractor/russell-l-judy-builder-inc-0593-90082691,(919) 625-7841
https://www.bbb.org/us/nc/raleigh/profile/general-contractor/russell-l-judy-builder-inc-0593-90082691,(919) 625-7841
https://www.bbb.org/us/fl/ponce-de-leon/profile/building-contractors/aca-enterprises-llc-0683-90029401,(850) 248-0597
https://www.bbb.org/us/fl/ponce-de-leon/profile/building-contractors/aca-enterprises-llc-0683-90029401,(850) 527-1767
https://www.bbb.org/us/nc/charlotte/profile/general-contractor/meyer-builders-llc-0473-219405,(704) 737-8409'''

print([re.findall('.*c(?=\-\d)',n.split('/')[-1])[0] for n in a.split('\n')[1:]])

Output:

['ag-builders-inc',
 'russell-l-judy-builder-inc',
 'russell-l-judy-builder-inc',
 'aca-enterprises-llc',
 'aca-enterprises-llc',
 'meyer-builders-llc']

For a prettier presentation:

print([re.findall('.*c(?=\-\d)',n.split('/')[-1])[0].replace('-',' ').title() for n in a.split('\n')[1:]])

Output:

['Ag Builders Inc',
 'Russell L Judy Builder Inc',
 'Russell L Judy Builder Inc',
 'Aca Enterprises Llc',
 'Aca Enterprises Llc',
 'Meyer Builders Llc']

UPDATE:

import re

a = '''URL Searched,Phone Numbers
https://www.bbb.org/us/nc/durham/profile/home-builders/ag-builders-inc-0593-6037923,(919) 384-7005
https://www.bbb.org/us/nc/raleigh/profile/general-contractor/russell-l-judy-builder-inc-0593-90082691,(919) 625-7841
https://www.bbb.org/us/nc/raleigh/profile/general-contractor/russell-l-judy-builder-inc-0593-90082691,(919) 625-7841
https://www.bbb.org/us/fl/ponce-de-leon/profile/building-contractors/aca-enterprises-llc-0683-90029401,(850) 248-0597
https://www.bbb.org/us/fl/ponce-de-leon/profile/building-contractors/aca-enterprises-llc-0683-90029401,(850) 527-1767
https://www.bbb.org/us/nc/charlotte/profile/general-contractor/meyer-builders-llc-0473-219405,(704) 737-8409'''

for line in a.split('\n')[1:]: # Iterate through the string by each line, excluding the first line

    name = line.split('/')[-1] # The business name is the last substring in each line seperated by a slash

    name = re.findall('.*c(?=\-\d)',name) # .*c is the get all the characters behind c, including c. (?=something) will look forward to see if something is right in front of it. \-\d stands for a dash, and a digit

    print(name[0]) # The 0 is for getting the string, instead of the list

Output:

ag-builders-inc
russell-l-judy-builder-inc
russell-l-judy-builder-inc
aca-enterprises-llc
aca-enterprises-llc
meyer-builders-llc
Red
  • 26,798
  • 7
  • 36
  • 58
  • 1
    Could you go into a bit of detail on how this was constructed and why it worked? I am trying to become more familiar with re and don't understand how the outcome was achieved. – sjuice10 Jun 24 '20 at 01:07