8

I'm having a fairly large dataset where I need to combine multiple entries into a single value. My dataset contains data on the combination of two datasets, each using their own ID's and keys.

I thought of using a Sumproduct() function like this:

=SUMPRODUCT(--('Raw data'!C:C=Landgebruik!A2);--('Raw data'!O:O={20;21;22;23;40});'Raw data'!S:S)

With Landgebruik!A2 holding an ID for the first dataset, which I need to aggregate the second dataset to.

'Raw data'!O:O contains the ID's from the second dataset. In the case above I need to sum the area (in 'Raw data'!S:S) when the value of the second ID is any of these values: {20;21;22;23;40}. (OR logic) The column only contains integer values.

Is there any other way of fixing this then duplicating --('Raw data'!O:O=20) for all values in the array?

EDIT:

I went with the work-around for now, which was: =SUMPRODUCT(--('Raw data'!C:C=Landgebruik!A2);--('Raw data'!O:O=20)+('Raw data'!O:O=20)+('Raw data'!O:O=21)+('Raw data'!O:O=22)+('Raw data'!O:O=23)+('Raw data'!O:O=40);'Raw data'!S:S). But I feel that there should be a more elegant way of doing this.

Luuklag
  • 3,897
  • 11
  • 38
  • 57
  • 1
    You could use `MATCH` in combination with a `IFERROR`. – MGP Feb 15 '19 at 13:36
  • 2
    Another option is to use `SUM(SUMIFS())` with array condition for O column. – BrakNicku Feb 15 '19 at 14:07
  • 1
    Could you add that as an answer @BrakNicku – Luuklag Feb 15 '19 at 14:14
  • 1
    Not sure about it. It is another possible solution to problem described in question, but it definitely does not answer the question in title. And - if you have a lot of rows (as your comments suggest), this solution might not perform well - 5 independent SUMIFS calculations. – BrakNicku Feb 15 '19 at 14:18
  • Assuming you can't have values like 20.5, you could reduce it to sumifs(...'Raw data'!O:O>=20,'Raw data'!O:O<=23...) and another sumifs(...'Raw data'!O:O=40...) – Tom Sharpe Feb 16 '19 at 09:02
  • @TomSharpe, yes only integers are allowed. – Luuklag Feb 16 '19 at 09:24
  • @TomSharpe, if you want you can add that as an answer. – Luuklag Feb 16 '19 at 10:14
  • 2
    If @BrakNicku were to post the SUM(SUMIFS)) as SUMPRODUCT(SUMIFS), then it would answer the question in the title :-) – Tom Sharpe Feb 16 '19 at 11:46
  • Just to make sure I understand this question properly, because I think I'm missing something important here. You want to sum values in `'Raw data'!S:S` only when in `'Raw data'!O:O` are any of these values: `{20;21;22;23;40}`? What I'm missing? Sorry, my english is not good and sometimes I do not comprend questions properly. – Foxfire And Burns And Burns Feb 19 '19 at 13:23
  • @FoxfireAndBurnsAndBurns And `'Raw data'!C:C` should match the value in `Landgebruik!A2` – Luuklag Feb 19 '19 at 14:20

7 Answers7

6

Even though this has been done hundreds of times before, hey maybe microsoft switched up the formulas or something.

I am partial to the method Jerry and Me suggested as they are simple as hell and concise, but you pay a heavy performance cost.

Tom's formula looks ugly to me but was fastest by far, about 4x faster than my initial example. We were able to incorporate the {}s with Tom's formula, but to get it to work we had to wrap the sumifs function with a sum function. This slowed down the formula considerably but made it prettier.

z32a7ul had a great solution too. I really like the use of -- and learned how to use |s to search for a text and only that text. At first glance I thought that it would not work on a number such as 2323 but it does.

Mock up example was as follows:

A1:A5000 was filled with LandgeBruik,

B1:B5000 was filled with 40's

C1:5000 was filled with 1's.


The results:

=SUMPRODUCT((A1:A5000="LandgeBruik")*(B1:B5000={20,21,22,23,40})*C1:C5000)

19.186031 seconds elapsed | 59,818,073 ticks

{=SUM(IF(A1:A5000="Landgebruik",1,0)*IF(B1:B5000={20,21,22,23,40},1,0)*C1:C5000)}

26.124411 seconds elapsed | 81,450,506 ticks

{=SUM((A1:A5000=""Landgebruik"")*(B1:B5000={20,21,22,23,40})*C1:C5000)}

21.111835 seconds elapsed | 65,822,330 ticks

"=SUMIFS(C1:C5000,B1:B5000,"">=20"",B1:B5000,""<=23"",A1:A5000,""=Landgebruik"")+SUMIFS(C1:C5000,B1:B5000,""=40"",A1:A5000,""=Landgebruik"")"

6.732804 seconds elapsed | 20,991,490 ticks

"=SUM(SUMIFS(C1:C5000,A1:A5000,"Landgebruik",B1:B5000,{21,22,23,24,40}))"

16.954528 seconds elapsed | 52,860,709 ticks

"=SUMPRODUCT(--(A1:A5000=""Landgebruik""),--NOT(ISERROR(FIND(""|""&B1:B5000&""|"",""|20|21|22|23|40|""))),C1:C5000)"

11.822379 seconds elapsed | 36,859,729 ticks

learnAsWeGo
  • 2,252
  • 2
  • 13
  • 19
  • 2
    By the way, you could also go with `{=SUM((B4:B9="Landgebruik")*(C4:C9={20,21,22,23,40})*D4:D9)}` since equality returns a boolean. – Jerry Feb 21 '19 at 08:07
  • 2
    @Jerry even better I am going to "steal" this comment and include – learnAsWeGo Feb 21 '19 at 14:46
  • 2
    Sure, it pretty much becomes the same as my proposed solution, with an "array-sum" instead of a "sum-array" xD – Jerry Feb 21 '19 at 17:03
6

You can use text search for this:

--NOT(ISERROR(FIND('Raw data'!O:O,"2021222340")))

But you have to be careful that a shorter ID is not found incorrectly in a longer ID, e.g. if you want to search among the IDs { 123, 456, 789 } then 12 is not considered to be among the IDs. So a simple text search like the above would not work. You need a delimiter character to break up the string of IDs. Usually I use the pipe character for this purpose, since I cannot remember any case when it occurred in the original text of an Excel file, and because it makes the formula human-readable:

--NOT(ISERROR(FIND("|"&'Raw data'!O:O&"|","|20|21|22|23|40|")))

Examples:

'Raw data'!O:O is 20 => |21| is found in |20|21|22|23|40|

'Raw data'!O:O is 2 => |2| is not found in |20|21|22|23|40|

(If your IDs may include the pipe character, then you can use CHR(1), a long forgotten ASCII code for SOH meaning start of header; of course, it's less readable.)

The whole formula:

=SUMPRODUCT(--('Raw data'!C:C=Landgebruik!A2),--NOT(ISERROR(FIND("|"&'Raw data'!O:O&"|","|20|21|22|23|40|"))),'Raw data'!S:S)

(Sorry, my Excel uses , instead of ;)

z32a7ul
  • 3,695
  • 3
  • 21
  • 45
  • 1
    I believe you can use `ISNUMBER()` instead of `NOT(ISERROR())`, though personally it feels a bit too much of roundabout way about this xD – Jerry Feb 21 '19 at 17:01
  • @Jerry Yes, you can, or you can use 1-ISERROR(...). I thought, however, that --NOT(ISERROR(...)) is the most readable. – z32a7ul Feb 21 '19 at 18:36
5

You could split it into two SUMIFS as mentioned in the comment. If all values are integers, then comparing 'Raw data'!O:O to 20,21,22 and 23 is the same as testing it for >=20 and <=23. The value 40 has to be done separately.

=SUMIFS('Raw Data'!S:S,'Raw Data'!C:C,Landgebruik!A2,'Raw Data'!O:O,">="&20,'Raw Data'!O:O,"<="&23)
+SUMIFS('Raw Data'!S:S,'Raw Data'!C:C,Landgebruik!A2,'Raw Data'!O:O,40)

in my locale

or

=SUMIFS('Raw Data'!S:S;'Raw Data'!C:C;Landgebruik!A2;'Raw Data'!O:O;">="&20;'Raw Data'!O:O;"<="&23)
+SUMIFS('Raw Data'!S:S;'Raw Data'!C:C;Landgebruik!A2;'Raw Data'!O:O;40)

in your locale.

This only works when several of the criteria are consecutive integers.

Speed considerations

SUMIFS is thought to be about five times faster than sumproduct so may be the preferred option for large datasets as demonstrated here

You could argue that the more general suggestion of (effectively) five SUMIFS within a SUM from @ BrakNicku should be about as fast as one SUMPRODUCT, but the SUM(SUMIFS) would probably still win because formulas like SUMIFS handle full-column references more efficiently than array formulas.

Tom Sharpe
  • 30,727
  • 5
  • 24
  • 37
  • 1
    I did not suggest using five `SUMIFS`, but one with `{20,21,22,23,40}` condition for `O` column, Calculation time in both cases would be the same, just easier for future maintenance. – BrakNicku Feb 16 '19 at 14:22
  • 1
    @BrakNicku, I'd love to see that written out. – Luuklag Feb 16 '19 at 15:36
  • 1
    @Luuklag "=SUM(SUMIFS(C1:C5000,A1:A5000,"Landgebruik",B1:B5000,{21,22,23,24,40}))" but it is appreciably slower than Toms forumla – learnAsWeGo Feb 22 '19 at 15:01
5

You could make a small change to your current formula; change the ; to * (-- are also unneeded in that particular case):

=SUMPRODUCT(('Raw data'!C:C=Landgebruik!A2)*('Raw data'!O:O={20;21;22;23;40})*'Raw data'!S:S)

And that should work.


When you feed separate parameters to SUMPRODUCT, each parameter has to be of the same size. But when you multiply them like this, it forces evaluation and the arrays expand.

For example, if you take two arrays, 5x1 and 1x5, you get a 5x5 resulting array:

enter image description here

Jerry
  • 70,495
  • 13
  • 100
  • 144
4

I want to give a shot to this question, after asking OP for some clarifications, because English is not my main language and I think I have misunderstood something.

So, what I did to simulate situation, made a new workbook with 2 sheets.

One sheet is named Landgebruik and got a value in A2 and I did this:

enter image description here

The second sheet is named Raw data. I hide some columns to use only columns C, O and S. In column S I input just values equal to 1. In column O I did randomly values equal to {20,21,22,23,40} and in Column C I did randomly values which were A or B. And it looks like this (please, note I hide some columns):

enter image description here

And question would like to sum values in column S but only If column O is equal to 20 or 21 or 22 or 23 o 40 and column C is equal to Landgebruik!A2 (in my test, value in there is letter A)

We can use an array formula to filter the data in column S and then, once filtered, sum values that meet requirements. In my test, the correct result would be 8, beause only 8 values in column S meet requirements of column C and O. In the image, the right rows are highlighted in yellow.

OP already did this, but wants to know if there is a shorter/elegant formula.

Shortest formula I found is like this:

=SUM(IF($O$2:$O$28={20;21;22;23;40};IF($C$2:$C$28=Landgebruik!$A$2;$S$2:$S$28)))

This is an array formula, so it must be inserted pressing CTRL+SHIFT+ENTER or it won't work!

HOW IT WORKS:

First IF takes all values in column S and ignores all where equivalent in Column O are not 20 or 21 or 22 or 23 or 40. Second IF takes that new array, and ignores all values where equivalent in column C are not equal to Landgebruik!$A$2. Final array is sumed up by the function SUM

I've tried to explain the best I can. I hope you can adapt this to your needs.

Luuklag
  • 3,897
  • 11
  • 38
  • 57
4

If you are interested in performance (calculation speed) and are not afraid of matrix calculation, you can use MMULT:

=SUMPRODUCT(--('Raw data'!C:C=Landgebruik!A2),MMULT(--('Raw data'!O:O={20,21,22,23,24}),TRANSPOSE({1,1,1,1,1})),'Raw data'!S:S)

Explanation:

First, you create a 1048576×5 matrix, where the value in the i-th row and j-th column is 1 if the ID in 'Raw data'!O:O's i-th line is the same as the j-th value in the enumeration {20,21,22,23,24}, 0 otherwise.

Second, you multiply this by a vector of 1s (5 1s because {20,21,22,23,24} contains five elements), which means that you accept all the five values.

Third, from the above you get a vector where the i-th element is 1 if the ID is among the accepted values, 0 otherwise, and you put this vector next to the others in your SUMPRODUCT.

(Sorry, my Excel uses ',' instead of ';'. If you want to shorten the formula, you may write {1;1;1;1;1} instead of TRANSPOSE({1,1,1,1,1}). But you have to find out what your Excel uses instead of ';' to separate rows, most probably '.'.)

Note: It may improve the speed of the calculation if you refer to the range which actualy contain values, not the whole column, e.g. 'Raw data'!C1:C123 instead of 'Raw data'!C:C.

If you insert new rows with Shift+Space Ctrl++ above the last row already included, then the references in your formulas will be updated automatically. Alternatively, you may use Names with special formulas that grow the Range referred to by determining the last non-empty cell.

Update

I made some measurements to compare the efficiency of these approaches. I used random data of 10000 rows and I recalculated each formula 1000 times. You can see the elapsed time in the second column.

Measurements

I commented out the other formulas while I ran this VBA code to measure the time:

Public Sub MeasureCalculationTime()
    Dim datStart As Date: datStart = Now

    Dim i As Long: For i = 1 To 1000
        Application.Calculate
    Next i

    Dim datFinish As Date: datFinish = Now
    Dim dblSeconds As Double: dblSeconds = (datFinish - datStart) * 24 * 60 * 60
    Debug.Print "Calculation finished at " & datFinish; " took " & dblSeconds & " seconds"
End Sub

In this scenario, MMULT was not the fastest.

However, I would like to point out that it is the most flexible because

  1. You may use it with switches: You refer to a cell range instead of the {1,1,1,1,1}, and you will be able to include / exclude IDs in the selection very quickly. Like you put into A1:A5 {20,21,22,23,24} and next to it, into B1:B5 {1,1,1,1,1}. If you want to exclude 21, then you rewrite B2 to 0, if you want to include it, you write it back to 1.

  2. You may use more complicated criteria, where you have to compare multiple levels. Like:

    =SUMPRODUCT(MMULT(--(CarId=CarOwner),--(CarOwner=ListOfJobs),--(ListOfJobs=JobsByDepartment),--(DepartmentIncludedInSelection=1)),FuelConsumption)

Note: The above line is just pseudocode, MMULT has only two parameters.

z32a7ul
  • 3,695
  • 3
  • 21
  • 45
  • 2
    Can you explain why this solution would have higher performance than the other solution you posted, maybe any of the other answers posted to this question? – Jerry Feb 21 '19 at 16:54
  • 1
    I would not say it's always the best but sometimes I measured their performance and MMULT was the fastest. Though those were really complicated scenarios, where I had e.g. 5 columns to evaluate as criteria and the comparison was not immediate but like CarId => EmployeeAssigned => JobTitle => DepartmentName. The best is always to measure. – z32a7ul Feb 21 '19 at 19:51
-1

This could work:

={SUMPRODUCT(--('Raw data'!C:C=Landgebruik!A2);--IFERROR(MATCH('Raw data'!O:O;{20;21;22;23;40};0)>0;0);'Raw data'!S:S)}

This needs to be entered as an array formula.

Luuklag
  • 3,897
  • 11
  • 38
  • 57
MGP
  • 2,480
  • 1
  • 18
  • 31
  • 1
    Also for future reference it is advised to explain on how and why your solution works. – Luuklag Feb 15 '19 at 13:55
  • 6
    I know how it works. But as we strive to build a repository of Q's and A's that are helpful to others besides OP I think it improves the quality of your answer when you add some explanation. Only if you feel like it of course. – Luuklag Feb 15 '19 at 14:02