Summing columns by specific row identifier in a 2D array

Question

I am trying to create a function that adds the values of the last 12 columns of a 2D array if the first item in each row matches another. The first element of each row represents the department ID number so I need a cumulative total of how many hours the employees of each department worked for each month. My 2D array is a list of lists and a sample of the data looks like this:

[[12606.0, 74204.0, 56.0, 64.0, 72.0, 21.6, 18.0, 0.0, 0.0], 
 [12606.0, 105492.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 45.6], 
 [12606.0, 112151.0, 2.4, 4.0, 0.0, 0.0, 0.0, 0.0, 0.0], 
 [12606.0, 121896.0, 0.0, 0.0, 0.0, 0.0, 0.0, 60.8, 0.0]]

My code is as follows:

   def costcentersum(A):
    """Finds the sum of the monthly hours worked by employees in each cost center.
    """
    for i in range(len(A)):
        if A[i-1][0] == A[i][0]:
            A[i][2:] += A[i-1][2:]
    print A[i]

My output though is rather strange and is not the correct answer. I got this:

[12606.0, 121896.0, 0.0, 0.0, 0.0, 0.0, 0.0, 60.8, 0.0, 0.0, 0.0, 8.0, 15.2, 18.4, 2.4, 4.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 45.6, 32.0, 54.4, 52.0, 56.0, 43.2, 56.0, 64.0, 72.0, 21.6, 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

I want my output to not include the second column of the data (the employee ID). Just simply the dep ID and the cumulative hours for each month. Where am I going wrong in my code? Thanks for any help!

Again, you should be clear about what your data structure actually is. If you correctly format it as a list of lists (`[[12606.0, 74240.0, ...], ...]`), it makes if easier for other people to copy-paste it to test your code. — jonrsharpe, Jun 23 '14 at 21:44

jonrsharpe · Answer 1 · 2014-06-23T22:17:14.813

The line

A[i][2:] += A[i-1][2:]

does not add all of the columns individually to the row above, it adds the lists; [1, 2] + [3, 4] == [1, 2, 3, 4]. This is why your output appears so long.

Your function doesn't actually return anything useful, it just prints the last row (only the last, as the print is outside the for loop). Also, your code makes no attempt to account for cost centers not being consecutive in the file.

A much better structure would be a dictionary, where the key is the cost center ID and the values are the sums of the last twelve columns for all rows where the first column was that cost center.

def cost_center_sum(arr):
    out = {}
    for row in arr:
        cc = int(row[0])
        if cc not in out:
            out[cc] = [0] * 12
        out[cc] = list(map(sum, zip(out[cc], row[2:]))) # add by column
    return out

The highlighted line does the work of actually summing up column-by-column.

The output for your sample input:

{12606: [58.4, 68.0, 72.0, 21.6, 18.0, 60.8, 45.6]}

Thanks Jon. I keep on coming across a syntax error though for the word return specifically. Did you find that as well? — user3758443, Jun 23 '14 at 22:03
@user3758443 no, works fine for me. Do you have the right number of parentheses on the line before? — jonrsharpe, Jun 23 '14 at 22:04

score 1 · Answer 2 · edited May 23 '17 at 12:10

You can use defaultdict to group all rows by ID number, and a comprehension along with zip(*array) to sum the columns (here's a Live Ideone Example to play with):

Python:

A = [[12606.0, 74204.0, 56.0, 64.0, 72.0, 21.6, 18.0, 0.0, 0.0], 
     [12606.0, 105492.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 45.6], 
     [12606.0, 112151.0, 2.4, 4.0, 0.0, 0.0, 0.0, 0.0, 0.0], 
     [12606.0, 121896.0, 0.0, 0.0, 0.0, 0.0, 0.0, 60.8, 0.0],
     [12901.0, 74204.0, 25.0, 15.0, 45.0, 38.6, 18.0, 0.0, 0.0],
     [12901.0, 105492.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 45.6]]

from collections import defaultdict
d = defaultdict(list)
for row in A:                 
    d[row[0]].append(row[2:])   # append all rows with same ID number.

print {k:[sum(col) for col in zip(*rows)] for k, rows in d.items()}

Output:

{12606.0: [58.4, 68.0, 72.0, 21.6, 18.0, 60.8, 45.6],
 12901.0: [25.0, 15.0, 45.0, 38.6, 18.0, 0.0, 45.6]}

Note: The zip(*rows) is used to transpose the rows so that you can sum by column. See this SO post.

Summing columns by specific row identifier in a 2D array

2 Answers2