I have a large dataset and need to reconstruct some information - specifically, for each agent (row), how many of its variable number of neighbours match it on a specific dimension.
Relevant data look like this:
> head(sim[[1]] %>% select(who, identity, neighbourhood_string, time_point))
# A tibble: 6 x 4
who identity neighbourhood_string time_point
<dbl> <chr> <chr> <dbl>
1 10000 Conservative 1015, 46, 699 0
2 10001 Conservative 255, 1926, 1966, 473, 1864, 1199, 1544 0
3 10002 Liberal 716, 1807, 1366, 632, 1385 0
4 10003 Liberal 1151, 1035, 1048, 649, 217, 1889 0
5 10004 Liberal 944, 1497, 1834, 1309, 876 0
6 10005 Liberal 963, 1784, 1905, 432, 1452, 479 0
> tail(sim[[1]] %>% select(who, identity, neighbourhood_string, time_point))
# A tibble: 6 x 4
who identity neighbourhood_string time_point
<dbl> <chr> <chr> <dbl>
1 11964 Conservative 1259, 1682, 1858, 1741, 381, 1956, 1905, 361 100
2 11965 Conservative 1266, 927, 263, 936, 1709 100
3 11966 Liberal 1012, 1677, 102, 917, 1915, 616, 912, 1792 100
4 11967 Conservative 335, 155, 1369, 492, 1833 100
5 11968 Conservative 1295, 1795, 1238, 114, 1464 100
6 11969 Liberal 269, 994, 377, 1151, 746, 1959, 1348, 25 100
'who' is the agent (row) ID, with 10000 added to demarcate simulations apart (hence in the displayed data above, there are 1970 agents * 100 time points). Each agent's neighbours are uniquely identified by 'neighbourhood_string' (which added to 10000, matches their respective 'who' entries), and 'identity' is the feature to match on. Time_point indicates timestep in the simulation.
I am struggling to implement a rowwise operation that sums the number of matches between each agent's identity and those of it's variable number of neighbours. I can do this in an ugly, and more importantly, very slow nested for-loop (top layer is looping over simulations, interior is looping over rows within a given sim):
for (i in 1:20){
for (j in 1:length(sim[[i]]$who)){
# get neighbour ids
ids = strtoi(unlist(strsplit(sim[[i]]$neighbourhood_string[j], '[,]')))
# sum number of neighbours with same identity
sim[[i]]$real_neighbourhood[j] =
sum(sim[[i]]$identity[j] == sim[[i]]$identity[(sim[[i]]$who %in%
(ids + (i*10000))) &
sim[[i]]$time_point ==
sim[[i]]$time_point[j]])
}
}
However, it just isn't practical given the size of the data I'm working with (around 20GB). I'd dearly appreciate any advice on how to speed this up via, e.g., dplyr functions or some other very clever vectorized approach.
I have tried this:
for (i in 1:20){
test[[i]] = df1 %>% select(who, identity, neighbourhood_string, time_point) %>%
filter(who >= i*10000 & who < (i+1)*10000) %>%
group_by(time_point) %>%
rowwise() %>%
mutate(real_neighbourhood =
sum(identity == identity[who %in%
(strtoi(unlist(strsplit(neighbourhood_string, '[,]'))) + i*10000)]))
}
But it fails, I think, because I'm failing to appropriately tell it how to look 'down' the column of 'who' within a given time_point. I'm just not familiar enough with dplyr functionality to know how to do this, or if it will actually be faster (though I hardly think it would be slower!).
EDIT: Here is an example of the sort of output I'm trying to achieve, in dput format (first 200 rows):
dput(sim[[1]][1:200,]%>%select(who,identity,neighbourhood_string,time_point,real_neighbourhood))
structure(list(who = c(10000, 10001, 10002, 10003, 10004, 10005,
10006, 10007, 10008, 10009, 10010, 10011, 10012, 10013, 10014,
10015, 10016, 10017, 10018, 10019, 10020, 10021, 10022, 10023,
10024, 10025, 10026, 10027, 10028, 10029, 10030, 10031, 10032,
10033, 10034, 10035, 10036, 10037, 10038, 10039, 10040, 10041,
10042, 10043, 10044, 10045, 10046, 10047, 10048, 10049, 10050,
10051, 10052, 10053, 10054, 10055, 10056, 10057, 10058, 10059,
10060, 10061, 10062, 10063, 10064, 10065, 10066, 10067, 10068,
10069, 10070, 10071, 10072, 10073, 10074, 10075, 10076, 10077,
10078, 10079, 10080, 10081, 10082, 10083, 10084, 10085, 10086,
10087, 10088, 10089, 10090, 10091, 10092, 10093, 10094, 10095,
10096, 10097, 10098, 10099, 10100, 10101, 10102, 10103, 10104,
10105, 10106, 10107, 10108, 10109, 10110, 10111, 10112, 10113,
10114, 10115, 10116, 10117, 10118, 10119, 10120, 10121, 10122,
10123, 10124, 10125, 10126, 10127, 10128, 10129, 10130, 10131,
10132, 10133, 10134, 10135, 10136, 10137, 10138, 10139, 10140,
10141, 10142, 10143, 10144, 10145, 10146, 10147, 10148, 10149,
10150, 10151, 10152, 10153, 10154, 10155, 10156, 10157, 10158,
10159, 10160, 10161, 10162, 10163, 10164, 10165, 10166, 10167,
10168, 10169, 10170, 10171, 10172, 10173, 10174, 10175, 10176,
10177, 10178, 10179, 10180, 10181, 10182, 10183, 10184, 10185,
10186, 10187, 10188, 10189, 10190, 10191, 10192, 10193, 10194,
10195, 10196, 10197, 10198, 10199), identity = c("Conservative",
"Conservative", "Liberal", "Liberal", "Liberal", "Liberal", "Liberal",
"Liberal", "Liberal", "Conservative", "Liberal", "Liberal", "Liberal",
"Conservative", "Liberal", "Liberal", "Liberal", "Liberal", "Conservative",
"Conservative", "Liberal", "Liberal", "Conservative", "Liberal",
"Liberal", "Conservative", "Liberal", "Conservative", "Conservative",
"Conservative", "Conservative", "Liberal", "Conservative", "Conservative",
"Liberal", "Liberal", "Liberal", "Liberal", "Liberal", "Conservative",
"Conservative", "Conservative", "Conservative", "Liberal", "Liberal",
"Conservative", "Conservative", "Conservative", "Liberal", "Conservative",
"Liberal", "Liberal", "Conservative", "Conservative", "Liberal",
"Liberal", "Liberal", "Liberal", "Conservative", "Liberal", "Liberal",
"Conservative", "Liberal", "Conservative", "Liberal", "Conservative",
"Liberal", "Liberal", "Conservative", "Conservative", "Conservative",
"Liberal", "Conservative", "Liberal", "Conservative", "Liberal",
"Conservative", "Conservative", "Conservative", "Conservative",
"Liberal", "Conservative", "Conservative", "Conservative", "Conservative",
"Liberal", "Liberal", "Liberal", "Liberal", "Conservative", "Conservative",
"Conservative", "Liberal", "Liberal", "Liberal", "Conservative",
"Liberal", "Liberal", "Conservative", "Liberal", "Conservative",
"Liberal", "Liberal", "Conservative", "Liberal", "Liberal", "Conservative",
"Conservative", "Liberal", "Conservative", "Conservative", "Conservative",
"Liberal", "Conservative", "Conservative", "Liberal", "Conservative",
"Conservative", "Conservative", "Conservative", "Conservative",
"Conservative", "Conservative", "Conservative", "Liberal", "Liberal",
"Liberal", "Conservative", "Liberal", "Liberal", "Liberal", "Conservative",
"Conservative", "Conservative", "Liberal", "Liberal", "Liberal",
"Conservative", "Liberal", "Liberal", "Conservative", "Liberal",
"Liberal", "Liberal", "Liberal", "Conservative", "Conservative",
"Liberal", "Liberal", "Conservative", "Conservative", "Liberal",
"Liberal", "Conservative", "Liberal", "Liberal", "Liberal", "Liberal",
"Conservative", "Conservative", "Conservative", "Liberal", "Conservative",
"Conservative", "Liberal", "Conservative", "Liberal", "Conservative",
"Liberal", "Conservative", "Liberal", "Liberal", "Liberal", "Liberal",
"Conservative", "Conservative", "Liberal", "Liberal", "Liberal",
"Conservative", "Conservative", "Conservative", "Conservative",
"Liberal", "Conservative", "Conservative", "Liberal", "Conservative",
"Liberal", "Conservative", "Conservative", "Liberal", "Liberal",
"Liberal", "Liberal", "Liberal", "Liberal", "Conservative", "Liberal",
"Conservative"), neighbourhood_string = c("1015, 46, 699", "255, 1926, 1966, 473, 1864, 1199, 1544",
"716, 1807, 1366, 632, 1385", "1151, 1035, 1048, 649, 217, 1889",
"944, 1497, 1834, 1309, 876", "963, 1784, 1905, 432, 1452, 479",
"1406, 809, 693, 325, 945, 966", "301, 517, 1544, 1786, 1441, 1248",
"1122, 1724, 272, 249", "1851, 1304, 212, 528, 1375", "1774, 596, 749, 1740, 330, 800, 1923, 1805",
"1185, 298, 223, 574, 1853, 218, 1119, 1030", "1706, 734, 488, 1296, 1276",
"766, 1880, 359, 866, 1144, 519, 130", "1187, 679, 1008, 1467, 262, 865",
"1368, 776, 1126, 1672, 1573", "1300, 1841, 1566, 219, 1688",
"1418, 235, 305, 118, 86, 1670", "1016, 1595, 1574, 236, 516, 1913",
"1858, 1698, 928, 254, 1120", "175, 442, 395, 169, 240", "1929, 875, 1589, 318, 1937, 1793",
"452, 1941, 779, 1316, 1859", "1153, 1498, 1582, 666, 1315, 1606",
"1373, 729, 1026, 1765, 1052, 658", "640, 64, 475, 712, 687, 697",
"525, 911, 1388, 360, 389", "496, 1866, 1463, 1426, 1537, 1659, 36, 1222",
"1140, 396, 371, 140, 1703, 733", "946, 343, 336, 557, 544, 879",
"484, 1781, 1047, 706, 1526, 1116, 1214", "1203, 1305, 125, 1447",
"627, 1710, 678, 342, 1161, 700", "1067, 1282, 1192, 293, 1363",
"436, 1419, 77", "1204, 537, 44, 487, 1576, 1483, 299", "477, 1426, 27, 1659, 1463, 668, 1537",
"1942, 1654, 102, 929, 1405, 370", "1022, 1355, 902, 818, 573, 1564, 1667",
"283, 1568, 1938, 502, 1811, 1608", "1323, 1009, 110, 1552, 1207",
"869, 1802, 1630, 438, 356, 220", "522, 986, 1428, 1840, 1265, 1591, 1308, 314",
"88, 725, 225, 994, 1085, 1910, 1829", "299, 334, 35, 76, 1576",
"1340, 1908, 156, 1145, 1662", "1570, 0, 699, 1769", "1391, 938, 1929, 448, 1744, 1871, 1464",
"558, 1673, 1003, 1236, 1727", "914, 1211, 416, 1749, 1363, 73, 273",
"1607, 1131, 1221, 1178, 973, 454, 1668", "313, 1742, 1861, 1925",
"594, 864, 1527, 477, 1606, 515", "873, 965, 1387, 390, 618, 1182, 995",
"1761, 326, 1219, 67, 1234, 1238", "1108, 1275, 1261, 87, 811, 1155",
"646, 111, 1661, 906, 1736, 285", "186, 1350, 530, 394, 874",
"1238, 1547, 1502, 466, 1844, 1234, 1749", "1240, 1948, 1967, 456",
"1158, 1528, 1354, 638, 531, 90, 975", "722, 230, 560, 1322, 1296, 1239",
"506, 1521, 132, 1520, 709, 1848, 565", "1685, 1875, 1603, 1903, 814, 1256",
"712, 126, 983, 1054, 25", "68, 1645, 1086, 1025, 434", "656, 306, 853, 1219, 366, 326, 518",
"1021, 1006, 1679, 1271, 54, 1761", "895, 469, 616, 1025, 1086, 163, 144, 65",
"194, 584, 870, 127, 1107, 1735, 1831", "1642, 467, 1839, 1201, 1081, 124",
"1070, 580, 120, 1562, 892", "1076, 1761, 366, 1658, 326", "1749, 1502, 49, 1304, 1211, 212, 1363, 914",
"1371, 659, 1881, 426, 505, 992, 698, 1244", "1227, 1821, 1712, 311, 1325",
"962, 334, 1874, 44, 115, 299", "34, 1095, 436, 1699", "846, 1239, 1419, 1083, 560",
"1562, 924, 1253, 661, 1806, 1647", "1259, 98, 339, 1267, 1550, 463",
"367, 155, 980, 1616, 446, 189, 1693", "1731, 1969, 153, 1203, 1307, 1408",
"405, 820, 551, 239, 1849", "1366, 651, 347, 1438, 178", "344, 1274, 562, 1193, 783, 270, 648",
"118, 17, 1670", "1108, 55, 398, 1261, 811", "1624, 43, 1396, 994, 1912, 708, 1202, 1910",
"1864, 1889, 1212, 1709, 1966", "975, 1158, 1226, 147, 1528, 60",
"331, 105, 1567, 821, 1179", "450, 499, 934, 1563, 379, 322, 384",
"1117, 949, 620, 1585, 1530, 1618, 810", "731, 255, 1544, 1294",
"364, 527, 1235, 1068, 435, 1747, 481", "441, 431, 219, 1548, 267",
"549, 564, 534, 1459, 931", "1267, 339, 1259, 360, 389, 80",
"1174, 573, 161, 1362, 1564, 1667, 1029, 1657", "1165, 1435, 615, 1345, 1104, 677, 950, 1621",
"163, 434, 1255, 1086, 1050, 1059, 1225", "37, 370, 1942, 1218, 390",
"371, 187, 767, 1477, 1140, 1674, 1196", "1436, 137, 1417, 981, 687",
"91, 1215, 1800, 922, 587", "298, 218, 605", "1106, 481, 486, 1755, 1150, 1878, 1301",
"839, 926, 1043, 1098, 1796, 1812, 1639, 192", "1480, 1139, 258, 1229, 645",
"40, 1552, 145, 1009", "906, 1661, 901, 185, 285, 56", "1622, 837, 375, 1587, 453, 1611",
"900, 1233, 569, 1027, 714, 1460, 476", "355, 308, 773, 945, 325",
"962, 1665, 76, 912, 1874", "1383, 642, 1156, 731, 571, 1581",
"791, 1752, 1018, 1360, 971, 714", "235, 86, 1670, 17, 622, 1418",
"1725, 411, 1511, 1846, 1087, 1867, 128", "348, 71, 1070, 415, 580, 1243, 892",
"1501, 1401, 1351, 923", "464, 1443, 1536, 1687, 909, 998", "688, 607, 1775, 1670",
"1201, 1642, 467, 806, 1510, 1726, 70", "624, 1447, 31, 1203, 1307",
"877, 736, 208, 1953, 64", "1735, 194, 656, 69, 584, 870, 191",
"119, 1511, 595, 1867, 411, 1087, 346", "1085, 725, 307, 361, 211, 553, 1829",
"359, 519, 1410, 644, 1144, 13, 866", "1461, 615, 1946, 1186, 1150, 677, 1791",
"565, 1470, 62, 1533, 1520, 1813, 506", "416, 939, 273, 742, 869, 1896",
"196, 1329, 1245, 567, 327, 650, 1717, 1162", "498, 494, 759, 614",
"1946, 1303, 141, 445, 435, 629, 527", "687, 640, 104", "164, 1036, 961, 541, 1402, 175",
"964, 1580, 471, 429, 207", "767, 609, 28, 1140, 297, 1466",
"136, 1003, 365, 1236, 445", "1919, 760, 364, 964, 762, 1068",
"165, 465, 1386, 635, 841, 749", "793, 889, 68, 1025, 1155, 616",
"555, 110, 1397, 1011, 967", "1035, 1118, 962, 206, 912, 1135, 649, 1760",
"936, 1226, 90", "825, 1055, 723, 756, 279, 214, 1297", "564, 576, 1789, 1514, 1368, 776",
"1688, 942, 1841, 244, 152, 1365", "411, 932, 1867, 984", "942, 150, 691, 1078, 1355, 803",
"82, 1969, 925, 1307, 1408", "638, 464, 1687, 1327, 1949, 531, 1816, 1536",
"81, 229, 1475, 1813, 367", "253, 1340, 1145, 45, 302, 1908",
"209, 1495, 1767, 1597, 1756", "1428, 1591, 1445, 753, 1367, 314, 801",
"1822, 1421, 238, 1723, 717, 1092", "870, 1831, 194, 690, 653, 1283, 1810",
"1174, 317, 1644, 1657, 898, 1029, 1362, 99", "1743, 1446, 201, 1312, 1359, 589, 590",
"1059, 1900, 434, 1255, 68, 101, 895, 1086", "1040, 138, 442, 1402, 1017, 961, 175",
"143, 635, 1386, 1189, 984", "843, 1437, 1017, 1436, 1569, 1349, 1417",
"859, 1745, 474, 260, 626, 316", "338, 593, 1473, 1457, 1100, 1637",
"175, 20, 541, 1876, 854", "455, 1200", "851, 579, 1208, 1074, 266, 1565, 1888",
"1873, 511, 1258, 1830, 1369, 1333, 742", "312, 288, 825, 1750, 279",
"427, 764, 1347, 636, 847, 417, 433, 1285", "541, 138, 164, 854, 169, 20, 1036",
"1450, 494, 274", "303, 1707, 1093, 268", "1438, 457, 347, 1060, 651, 84, 225",
"457, 230, 734, 1438, 651", "1517, 1482, 1693, 189, 1071", "323, 1450, 1952, 1037, 1260, 887",
"1509, 267, 1422, 1034", "828, 1412, 1886, 296, 1165, 1427",
"730, 410, 1847, 493, 570, 1696, 654", "111, 1312, 901, 201, 1604, 1446",
"57, 394, 377, 530, 198", "609, 309, 1751, 953, 767, 1674, 103, 1330",
"1334, 877, 1953, 701, 599, 871", "180, 1693, 1079, 980, 446, 367, 81, 1482",
"369, 622, 1091, 617", "1105, 690, 127, 604, 870", "1812, 1012, 1796, 108, 1098, 1639, 710, 878",
"1178, 333, 1668, 623, 1173", "1107, 69, 870, 127, 160, 1831, 690, 1283",
"437, 496, 1724, 1222, 603, 1558", "759, 134, 1037, 567, 887, 1245, 741, 614",
"1113, 432, 1704, 1009, 479", "530, 1365, 186, 244, 1835", "544, 343, 642, 1581"
), time_point = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), real_neighbourhood = c(3,
0, 2, 5, 1, 2, 4, 5, 1, 2, 6, 6, 1, 1, 5, 5, 1, 3, 2, 2, 1, 2,
1, 3, 3, 2, 2, 4, 1, 4, 4, 2, 4, 4, 2, 3, 1, 2, 1, 2, 4, 5, 6,
2, 3, 3, 4, 3, 2, 5, 4, 0, 3, 3, 5, 4, 3, 5, 4, 2, 4, 2, 3, 4,
1, 5, 4, 3, 5, 4, 3, 2, 2, 2, 4, 2, 2, 1, 3, 2, 3, 3, 3, 2, 3,
2, 1, 4, 4, 2, 3, 2, 1, 1, 3, 5, 1, 2, 2, 4, 6, 1, 2, 4, 4, 3,
1, 3, 2, 1, 3, 2, 4, 3, 3, 1, 3, 4, 2, 3, 3, 3, 4, 3, 5, 4, 3,
3, 4, 4, 5, 2, 4, 6, 4, 2, 4, 0, 4, 3, 3, 1, 2, 3, 3, 3, 3, 1,
3, 1, 3, 2, 3, 2, 4, 2, 1, 3, 5, 1, 3, 4, 5, 5, 5, 2, 5, 2, 3,
4, 0, 3, 3, 2, 5, 3, 1, 3, 3, 3, 3, 2, 2, 1, 4, 5, 4, 6, 3, 5,
1, 3, 4, 3, 2, 5, 6, 4, 3, 3)), row.names = c(NA, -200L), class = c("tbl_df",
"tbl", "data.frame"))
It isn't complete, because the complexity of the data make it somewhat unweildy, but the real_neighbourhood variable encodes the sum of the identity matches for each row, depending on the neighbours defined in neighbourhood_string.