0

I'm trying to fill an existing spreadsheet with values from a separate CSV file with Python.

I have this long CSV file with emails and matching domains that I want to insert into a spreadsheet of business contact information. Basically, insert email into the email column where the 'website' column matches up.

The spreadsheet I'm trying to populate looks like this:

| Index | business_name   | email| website           |
| ---   | --------------- |------| ----------------- |
| 0     | Apple           |      | www.apple.com     |
| 1     | Home Depot      |      | www.home-depot.com|
| 4     | Amazon          |      | www.amazon.com    |
| 6     | Microsoft       |      | www.microsoft.com |

The CSV file I'm taking contacts from looks like this:

steve@apple.com, www.apple.com
jeff@amazon.com, www.amazon.com
marc@amazon.com, www.amazon.com
john@amazon.com, www.amazon.com
marc@salesforce.com, www.salesforce.com
dan@salesforce.com, www.salesforce.com

in Python:

index = [0, 1, 4, 6]
business_name = ["apple", "home depot", "amazon", "microsoft"]
email = ["" for i in range(4)]
website = ["www.apple.com", "www.home-depot.com", "www.amazon.com", "www.microsoft.com"]

col1 = ["steve@apple.com", "jeff@amazon.com", "marc@amazon.com", "john@amazon.com", "marc@salesforce.com", "Dan@salesforce.com"]
col2 = ["www.apple.com", "www.amazon.com", "www.amazon.com", "www.amazon.com", "www.salesforce.com", "www.salesforce.com"]

# spreadsheet to insert values into
spreadsheet_df = pd.DataFrame({"index":index, "business_name":business_name, "email":email, "website":website})
# csv file that is read
csv_df = pd.DataFrame({"col1":col1, "col2":col2})

Desired Output:

| Index | business_name   | email               | website           |
| ---   | --------------- |---------------------| ----------------- |
| 0     | Apple           |  steve@apple.com    | www.apple.com     |
| 1     | Home Depot      |         NaN         | www.home-depot.com|
| 4     | Amazon          |  jeff@amazon.com    | www.amazon.com    |
| 6     | Microsoft       |         NaN         | www.microsoft.com |

I want to iterate through every row in the CVS file to find where the 2nd column (in the CSV) matches the fourth column of the spreadsheet, then insert the corresponding value from the CSV file (value in the first column) into the 3rd column of the spreadsheet.

Up until now, I've had to manually insert email contacts from the CSV file into the spreadsheet which has become very tedious. Please save me from this monotony.

I've scoured stack overflow for an identical or similar thread but cannot find one. I apologize if there is a thread with this same issue, and if my post is confusing or lacking information as it is my first. There are multiple entries for a single domain, so ideally I want to append every entry in the CSV file to its matching row and column in the spreadsheet. This seems like an easy task at first but has become a massive headache for me.

grahamrob
  • 15
  • 5
  • Please go through [why not upload images](https://meta.stackoverflow.com/questions/285551/why-not-upload-images-of-code-errors-when-asking-a-question/285557#285557) and the sections on [minimal reproducible code](https://stackoverflow.com/help/minimal-reproducible-example). – anurag Feb 10 '21 at 16:05
  • Could you provide an example of what it is that you would like to achieve, to clarify? If I understand you correctly, you have a CSV file of websites and their corresponding e-mails and another file (which you term spreadsheet) containing more information. Using these, what you're essentially saying is that you would like to fill in the e-mail addresses that correspond to the websites in your spreadsheet? Also, what file format is your spreadsheet? – RDoc Feb 10 '21 at 16:06
  • @RDoc Exactly right. I'm trying to fill in the email column in my spreadsheet with emails from a CSV file. The spreadsheet is an excel file (.xlsx) and I've been reading it into python with pandas. – grahamrob Feb 10 '21 at 16:31
  • @anurag I ended up posting a screenshot of the table because whenever I included an inline table (with correct formatting) I received an error saying that my code wasn't formatted properly... even though it was and the output showed it being presented correctly. – grahamrob Feb 10 '21 at 16:34
  • I see. Finally, how would you like to handle multiple contacts belonging to the same company? Include only one? All of them? If so, how should that look? Separated by a comma? – RDoc Feb 10 '21 at 17:13
  • @RDoc Including only one of them would fit my purposes best. – grahamrob Feb 10 '21 at 18:57
  • @grahamrob but which one? – Paul H Feb 10 '21 at 19:14
  • @PaulH The first occurrence would work fine – grahamrob Feb 10 '21 at 19:21

3 Answers3

0

welcome to Stackoverflow, in the future please kindly follow these guidelines. In this scenario, please follow the community PANDAS guidelines as well. Following these guidelines are important in how the community can help you and how you can help the community as well.

First you need to provide and create a minimal and reproducible example for those helping you:

# Setup
index = [0, 1, 4, 6]
business_name = ["apple", "home depot", "amazon", "microsoft"]
email = ["" for i in range(4)]
website = ["www.apple.com", "www.home-depot.com", "www.amazon.com", "www.microsoft.com"]

col1 = ["steve@apple.com", "jeff@amazon.com", "marc@amazon.com", "john@amazon.com", "marc@salesforce.com", "Dan@salesforce.com"]
col2 = ["www.apple.com", "www.amazon.com", "www.amazon.com", "www.amazon.com", "www.salesforce.com", "www.salesforce.com"]

# Create DataFrames
# In your code this is where you would read in the CSV and spreadsheet via pandas
spreadsheet_df = pd.DataFrame({"index":index, "business_name":business_name, "email":email, "website":website})
csv_df = pd.DataFrame({"col1":col1, "col2":col2})

This will also help others who are reviewing this question in the future.

If I understand you correctly, you're looking to provide an email address for every company you have on the spread sheet:

You can accomplish it by reading in the csv and spreadsheet into a dataframe and merging them:

# Merge my two dataframes
df = spreadsheet_df.merge(csv_df, left_on="website", right_on="col2", how="left")
# Only keep the columns I want
df = df[["index", "business_name", "email", "website", "col1"]]

output:

   index business_name email             website             col1
0      0         apple             www.apple.com  steve@apple.com
1      1    home depot        www.home-depot.com              NaN
2      4        amazon            www.amazon.com  jeff@amazon.com
3      4        amazon            www.amazon.com  marc@amazon.com
4      4        amazon            www.amazon.com  john@amazon.com
5      6     microsoft         www.microsoft.com              NaN

Because you didn't provide an expected output, I don't know if this is correct.

Simon
  • 1,201
  • 9
  • 18
  • I hugely appreciate the feedback! I will rewrite my original post with what you mentioned. the solution you provided was close but not quite what I'm looking for in terms of output. sorry for the confusion, and thank you again for the help. – grahamrob Feb 10 '21 at 18:54
0

If you want to associate only the first email for a business in the CSV file with a website, you can do groupby/first on that and then merge with the business dataframe. I'm also going to drop the original email column since it serves no purpose


import pandas

index = [0, 1, 4, 6]
business_name = ["apple", "home depot", "amazon", "microsoft"]
email = ["" for i in range(4)]
website = ["www.apple.com", "www.home-depot.com", "www.amazon.com", "www.microsoft.com"]

col1 = ["steve@apple.com", "jeff@amazon.com", "marc@amazon.com", "john@amazon.com", "marc@salesforce.com", "Dan@salesforce.com"]
col2 = ["www.apple.com", "www.amazon.com", "www.amazon.com", "www.amazon.com", "www.salesforce.com", "www.salesforce.com"]

# spreadsheet to insert values into
business = pandas.DataFrame({"index":index, "business_name":business_name, "email":email, "website":website})
# csv file that is read
email = pandas.DataFrame({"email":col1, "website":col2})

output = (
    business
        .drop(columns=['email'])  #  this is empty and needs to be overwritten
        .merge(
            email.groupby('website', as_index=False).first(),  # just the first email
            on='website', how='left'  # left-join -> keep all rows from `business`
        )
        .loc[:, business.columns] # get your original column order back
)

And I get:

 index business_name            email             website
     0         apple  steve@apple.com       www.apple.com
     1    home depot              NaN  www.home-depot.com
     4        amazon  jeff@amazon.com      www.amazon.com
     6     microsoft              NaN   www.microsoft.com
Paul H
  • 65,268
  • 20
  • 159
  • 136
  • EXACTLY what I've been trying to get done. I knew it was going to be something simple Thank you so much! – grahamrob Feb 10 '21 at 21:31
0

Assuming that the spreadsheet is also a pandas dataframe and that it looks exactly like your image, there is a straightforward way of doing this using boolean indexing. I advise you to read about it further here: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html

First, I suggest that you turn your so-called CSV-file into a dictionary where the website is the key and the e-mail addresses are the values. Seeing as you don't need more than one contact, this works well. The reason I asked that question is that a dictionary cannot contain identical keys and thus some e-mail addresses would disappear. Achieving this is easily done by reading in the CSV-file as a pandas Series and doing the following:

series = pd.Series.from_csv('path_to_csv')
contacts_dict = series.to_dict()

Note that your order here would now be incorrect, in that you would have the e-mail as a key and the domain as a value. As such, you can do the following to swap them around:

dict_contacts = {value:key for key, value in dict_contacts.items()}

The reason for this step is that I believe that it it easier to work with an expanding list of clients.

Having done that, what you could simply do is then:

For i in dict_contacts.keys():
df1['e-mail'][df1['website'] == i] = dict_contacts[i] #THERE SHOULD BE AN INDENT HERE

What this does is that it filters out only the e-mail addresses for each unique key in the dictionary (i.e. the domain) and assigns it the value of that key, i.e. the e-mail.

Finally, I have deliberately attempted to provide you with a solution that is general and thus wouldn't require additional work in case you were to have 2000 different clients with unique domains and e-mails.

RDoc
  • 346
  • 1
  • 10
  • Edited as I realized that I missed a square bracket at the end of my last code segment and misspelled the name of the dictionary. – RDoc Feb 10 '21 at 21:11
  • Also a very solid solution! Thank you so much for the help. Sorry again for the confusing presentation of the problem. I learned a lot from you guys in this one post. Really appreciate it. – grahamrob Feb 10 '21 at 21:22