0

I have 2 csv files with data given in this fashion. How can i perform a basic matching and produce a result like the output. I'm matching based on the websites field. that's the key i'm using here for matching.

I tried Efficiently find matching rows (based on content) in a pandas DataFrame and https://macxima.medium.com/python-retrieve-matching-rows-from-two-dataframes-d22ad9e71879

but i'm not getting my desired output. Any assistance would be helpful

file1.csv
| id | web_1  |
|----|------|
| 1  | google.com |
| 2  | microsoft.in |
| 3  | yahoo.uk |
| 4  | adobe.us |


file2.csv
| id | web_2 |
|----|-----|
|2| microsoft.in |
| 3  | yahoo.uk |
| 4  | adobe.us |


output 
| id | web_1  | web_2  |
|----|------|--------|
| 1  | google.com | |
| 2  | microsoft.in | microsoft.in |
| 3  | yahoo.uk | yahoo.uk |
| 4  | adobe.us | adobe.us |
Byte
  • 9
  • 5
  • just merge on the `ids`? `pd.merge(df1,df2,on=['id'],how='outer')` – Umar.H Jun 25 '21 at 11:09
  • Does this answer your question? [Pandas Merging 101](https://stackoverflow.com/questions/53645882/pandas-merging-101) – Anurag Dabas Jun 25 '21 at 11:42
  • no , loookin to merge based on urls, can the same logic be used? – Byte Jun 25 '21 at 12:06
  • @Umar.H your code is giving more records i guess the records from file2 is also getting added to the final output. i want my final output count to remain the same and there should be na extra col for the urls from file2 like ( web_2) in the output. the code here isn't working that way – Byte Jun 25 '21 at 12:18
  • that's because you have duplicates in one of your dataframes. – Umar.H Jun 25 '21 at 14:23
  • please upload the raw contents of your original csv files and the code you tried. i can help fix your code. – Joshua Jun 25 '21 at 14:28

1 Answers1

0

Based on your comment if you want to merge the dataframes in a way where the result only includes rows where the merge keys match you can do an inner join.

pandas.DataFrame.merge uses 'inner' as the default type of merge.

import pandas as pd

df1 = pd.DataFrame(
    {
        "id": [1, 2, 3, 4],
        "web_1": ["google.com", "microsoft.in", "yahoo.uk", "adobe.us"],
    }
)
df2 = pd.DataFrame(
    {
        "id": [2, 3, 4],
        "web_2": ["microsoft.in", "yahoo.uk", "adobe.us"],
    }
)

>>>  pd.merge(df1, df2)
   id         web_1         web_2
0   2  microsoft.in  microsoft.in
1   3      yahoo.uk      yahoo.uk
2   4      adobe.us      adobe.us

If you don't want to keep both web columns you can just drop one of them:

>>> pd.merge(df1, df2).drop(columns='web_2')
   id         web_1
0   2  microsoft.in
1   3      yahoo.uk
2   4      adobe.us

Drop and rename:

pd.merge(df1, df2).drop(columns='web_2').rename(columns={'web_1': 'web'})
   id           web
0   2  microsoft.in
1   3      yahoo.uk
2   4      adobe.us
5eb
  • 14,798
  • 5
  • 21
  • 65