3

I'm trying to merge duplicate rows using this Python script. I made one column comma-separated and then sum the rest and at the end drop the duplicates using pandas, but I need to exclude some rows from being sum. For example, I don't want poly_area and total_area to be sum. What should I do?

import pandas as pd

output = r'C:dummy'

    fieldlist = ["FID","total_area","POLY_AREA", "PERCENTAGE","C5_3","M1_4","M1_4_R6A","M1_4_R6B", "M1_4_R7A", "M1_5_R10",
                 "M1_5_R7_3","M1_5_R9","M1_6_R10","PARK","R6A", "R6B", "R7A"]

    #Create dataframe from cursor
    df = pd.DataFrame.from_records(data=arcpy.da.SearchCursor('calculations', fieldlist), columns = fieldlist)

    #Create a new dataframe of FIDS and comma-separated percentages
    df1 = df.groupby("FID")["PERCENTAGE"].apply(lambda x: ", ".join(x.astype(str))).reset_index()

    #Create a new dataframe of sums per FID
    df2 = df.groupby("FID").sum()
    df2.drop("PERCENTAGE", axis=1, inplace=True)

    #Merge/join them together and export as csv
    df1.merge(df2, left_on="FID", right_index=True).to_csv(path_or_buf=output, index=False)
Jose Vasquez
  • 159
  • 8

2 Answers2

1

This will do the job just replace what you have with this.

 #Create a new dataframe of FIDS and comma-separated percentages
df1 = df.groupby(["FID","total_area","POLY_AREA"])["PERCENTAGE"].apply(lambda x: ", ".join(x.astype(str))).reset_index()

#Create a new dataframe of sums per FID
df2 = df.groupby("FID").sum()
df2.drop(["total_area","POLY_AREA","PERCENTAGE"], axis=1, inplace=True)
Zesima29
  • 184
  • 12
0

You could try to take a subset of your columns when you are creating df2 so that you exclude the stuff you don't want. Specifically try creating df2 like this:

df2_cols = [col for col in fieldlist if col not in ['FID', 'total_area', 'POLY_AREA']]
df2 = df.groupby("FID")[df2_cols].sum()

You can also just drop the columns you don't want after you create your merged df.

Leo
  • 1,773
  • 12
  • 19
  • I'm trying to avoid loops, but I was trying to use this for df2 but it gives me errors. `df2 = df.groupby(["FID","total_area","POLY_AREA"]).sum()` ValueError: len(left_on) must equal the number of levels in the index of "right" – Jose Vasquez Jun 22 '18 at 15:59
  • I suppose you could do something like `df2_cols = list(set(fieldlist) - set(['FID', 'total_area', 'POLY_AREA']))` to avoid the loop but it will likely reorder your columns because sets do not preserve order. Look at [this](https://stackoverflow.com/a/2104348/4058364) answer for how to take subset of list which is not in another list. – Leo Jun 22 '18 at 16:06
  • Also, this is a bit out of scope for your question but the difference in performance between looping (in my answer) and using set difference (in my comment) is about 0.08 µs on my system. Small price to pay to preserve order :) – Leo Jun 22 '18 at 16:11