-1

I have 4 CSV Files: CSV Files Picture

I want to combine the 4 files into one data frame. I have to Use the Invoices.Customer_ID and Customers.ID. When combining, I also have to make sure that the result set only contains customers and articles for which there are invoices and invoice items.

I have this simple code that reads the CSV file and displays the data.

from datetime import date, datetime
import os
import pandas as pd

article_csv = pd.read_csv('Input/Artikel.csv')  
Invoices_items_csv = pd.read_csv('Input/Rechnungen_Positionen.csv')
Customers_csv = pd.read_csv('Input/Kunden.csv')
Invoices_csv = pd.read_csv('Input/Rechnungen.csv')

Can someone help me here to achieve this goal? Thanks in advance

1 Answers1

0

According to this post, you can merge dataframes per column-definitions like this:

df = pd.merge(df1, df2, on=['document_id','item_id'])

So for your case, i think you would have to do this from left to right, meaning from article_csv to Invoices_items_csv to Customers_csv etc., because you can only merge 2 dataframes at a time. Keep in mind that this represents an inner-join, meaning only data with the specified columns existing on both dataframes will be merged, which is what you described. Then, you simply can use ID for the first 3 dataframes, and ID and Invoice_ID for the last merge.

import pandas as pd

# sample data
article_csv = pd.DataFrame({
    "ID": [1, 2, 3],
    "Code": ["abc", "def", "elk"],
    "Discount": [2, None, 1.5]
})

Customers_csv = pd.DataFrame({
    "ID": [1, 2, 3, 4, 5, 6, 7, 8],
    "Country": ['AT']*8
})

Invoices_csv = pd.DataFrame({
    "ID": [1, 3],
    "Customer_id": [7, 8]
})

Invoices_items_csv = pd.DataFrame({
    "Invoice_ID": [1, 1, 3, 3],
    "Quantity": [5, 2, 7, 8]
})

# merge dataframes
merged_df = pd.merge(Customers_csv, article_csv, on=["ID", "ID"])
merged_df = pd.merge(merged_df, Invoices_csv, on=["ID", "ID"])
merged_df = pd.merge(merged_df, Invoices_items_csv, left_on="ID", right_on="Invoice_ID")
mnikley
  • 1,625
  • 1
  • 8
  • 21
  • Why have you used ID instead of Invoices.Customer_ID in this line ----> merged_df = pd.merge(merged_df, Invoices_csv, on=["ID", "ID"]) @mnikley – CodeRunner Feb 12 '22 at 16:34
  • My mistake - you probably want to use `merged_df = pd.merge(merged_df, Invoices_csv, left_on="ID", right_on="Customer_id")` instead, but you should get the drill. – mnikley Feb 12 '22 at 18:01