1

I have a table like this:

+--------+--------+--------+--------+
| Value1 | Value2 | Value3 | Value4 |
+--------+--------+--------+--------+
|      1 |      2 |      3 |      4 |
|      1 |      2 |      3 |      4 |
|      1 |      2 |      5 |      7 |
|      2 |      7 |      5 |      6 |
|      1 |      2 |      3 |      4 |
|      8 |      5 |      1 |      3 |
|      2 |      7 |      5 |      6 |
+--------+--------+--------+--------+

I want to efficiently do this:

sum Value4 across all rows with the same value of Value1, Value2 and Value2 and create a new row (in another data table) with corresponding Value1, Value2, Value3 and sum of Value4's. So the output would be something like this:

+--------+--------+--------+--------+
| Value1 | Value2 | Value3 | Value4 |
+--------+--------+--------+--------+
|      1 |      2 |      3 |     12 |
|      1 |      2 |      5 |      7 |
|      2 |      7 |      5 |     12 |
|      8 |      5 |      1 |      3 |
+--------+--------+--------+--------+

I am actually able to do this in R, but not efficiently. Since my dataset has 24 000 000 rows with 13 variables, my code would take like 2 weeks to terminate.

edit: note that in my actual dataset, variables are not necessarily integers, but also characters.

  • 1
    Thank you! That's what I was looking for. Also, how do people find so quickly it is a duplicate? I was searching for 30 mins for a similar question. – Miroslav Výbošťok Aug 30 '18 at 17:58

0 Answers0