0

I have a large dataset and need to reconstruct some information - specifically, for each agent (row), how many of its variable number of neighbours match it on a specific dimension.

Relevant data look like this:

> head(sim[[1]] %>% select(who, identity, neighbourhood_string, time_point))
# A tibble: 6 x 4
    who identity     neighbourhood_string                   time_point
  <dbl> <chr>        <chr>                                       <dbl>
1 10000 Conservative 1015, 46, 699                                   0
2 10001 Conservative 255, 1926, 1966, 473, 1864, 1199, 1544          0
3 10002 Liberal      716, 1807, 1366, 632, 1385                      0
4 10003 Liberal      1151, 1035, 1048, 649, 217, 1889                0
5 10004 Liberal      944, 1497, 1834, 1309, 876                      0
6 10005 Liberal      963, 1784, 1905, 432, 1452, 479                 0

> tail(sim[[1]] %>% select(who, identity, neighbourhood_string, time_point))
# A tibble: 6 x 4
    who identity     neighbourhood_string                         time_point
  <dbl> <chr>        <chr>                                             <dbl>
1 11964 Conservative 1259, 1682, 1858, 1741, 381, 1956, 1905, 361        100
2 11965 Conservative 1266, 927, 263, 936, 1709                           100
3 11966 Liberal      1012, 1677, 102, 917, 1915, 616, 912, 1792          100
4 11967 Conservative 335, 155, 1369, 492, 1833                           100
5 11968 Conservative 1295, 1795, 1238, 114, 1464                         100
6 11969 Liberal      269, 994, 377, 1151, 746, 1959, 1348, 25            100

'who' is the agent (row) ID, with 10000 added to demarcate simulations apart (hence in the displayed data above, there are 1970 agents * 100 time points). Each agent's neighbours are uniquely identified by 'neighbourhood_string' (which added to 10000, matches their respective 'who' entries), and 'identity' is the feature to match on. Time_point indicates timestep in the simulation.

I am struggling to implement a rowwise operation that sums the number of matches between each agent's identity and those of it's variable number of neighbours. I can do this in an ugly, and more importantly, very slow nested for-loop (top layer is looping over simulations, interior is looping over rows within a given sim):

for (i in 1:20){
  for (j in 1:length(sim[[i]]$who)){
    
    # get neighbour ids
    ids = strtoi(unlist(strsplit(sim[[i]]$neighbourhood_string[j], '[,]')))
    
    # sum number of neighbours with same identity
    sim[[i]]$real_neighbourhood[j] = 
      sum(sim[[i]]$identity[j] == sim[[i]]$identity[(sim[[i]]$who %in% 
                                                       (ids + (i*10000))) & 
                                                      sim[[i]]$time_point == 
                                                      sim[[i]]$time_point[j]])
  }
}   

However, it just isn't practical given the size of the data I'm working with (around 20GB). I'd dearly appreciate any advice on how to speed this up via, e.g., dplyr functions or some other very clever vectorized approach.

I have tried this:

for (i in 1:20){
  test[[i]] = df1 %>% select(who, identity, neighbourhood_string, time_point) %>% 
    filter(who >= i*10000 & who < (i+1)*10000) %>% 
    group_by(time_point) %>% 
    rowwise() %>%
    mutate(real_neighbourhood = 
             sum(identity == identity[who %in% 
                                        (strtoi(unlist(strsplit(neighbourhood_string, '[,]'))) + i*10000)]))
}

But it fails, I think, because I'm failing to appropriately tell it how to look 'down' the column of 'who' within a given time_point. I'm just not familiar enough with dplyr functionality to know how to do this, or if it will actually be faster (though I hardly think it would be slower!).

EDIT: Here is an example of the sort of output I'm trying to achieve, in dput format (first 200 rows):

dput(sim[[1]][1:200,]%>%select(who,identity,neighbourhood_string,time_point,real_neighbourhood))
structure(list(who = c(10000, 10001, 10002, 10003, 10004, 10005, 
10006, 10007, 10008, 10009, 10010, 10011, 10012, 10013, 10014, 
10015, 10016, 10017, 10018, 10019, 10020, 10021, 10022, 10023, 
10024, 10025, 10026, 10027, 10028, 10029, 10030, 10031, 10032, 
10033, 10034, 10035, 10036, 10037, 10038, 10039, 10040, 10041, 
10042, 10043, 10044, 10045, 10046, 10047, 10048, 10049, 10050, 
10051, 10052, 10053, 10054, 10055, 10056, 10057, 10058, 10059, 
10060, 10061, 10062, 10063, 10064, 10065, 10066, 10067, 10068, 
10069, 10070, 10071, 10072, 10073, 10074, 10075, 10076, 10077, 
10078, 10079, 10080, 10081, 10082, 10083, 10084, 10085, 10086, 
10087, 10088, 10089, 10090, 10091, 10092, 10093, 10094, 10095, 
10096, 10097, 10098, 10099, 10100, 10101, 10102, 10103, 10104, 
10105, 10106, 10107, 10108, 10109, 10110, 10111, 10112, 10113, 
10114, 10115, 10116, 10117, 10118, 10119, 10120, 10121, 10122, 
10123, 10124, 10125, 10126, 10127, 10128, 10129, 10130, 10131, 
10132, 10133, 10134, 10135, 10136, 10137, 10138, 10139, 10140, 
10141, 10142, 10143, 10144, 10145, 10146, 10147, 10148, 10149, 
10150, 10151, 10152, 10153, 10154, 10155, 10156, 10157, 10158, 
10159, 10160, 10161, 10162, 10163, 10164, 10165, 10166, 10167, 
10168, 10169, 10170, 10171, 10172, 10173, 10174, 10175, 10176, 
10177, 10178, 10179, 10180, 10181, 10182, 10183, 10184, 10185, 
10186, 10187, 10188, 10189, 10190, 10191, 10192, 10193, 10194, 
10195, 10196, 10197, 10198, 10199), identity = c("Conservative", 
"Conservative", "Liberal", "Liberal", "Liberal", "Liberal", "Liberal", 
"Liberal", "Liberal", "Conservative", "Liberal", "Liberal", "Liberal", 
"Conservative", "Liberal", "Liberal", "Liberal", "Liberal", "Conservative", 
"Conservative", "Liberal", "Liberal", "Conservative", "Liberal", 
"Liberal", "Conservative", "Liberal", "Conservative", "Conservative", 
"Conservative", "Conservative", "Liberal", "Conservative", "Conservative", 
"Liberal", "Liberal", "Liberal", "Liberal", "Liberal", "Conservative", 
"Conservative", "Conservative", "Conservative", "Liberal", "Liberal", 
"Conservative", "Conservative", "Conservative", "Liberal", "Conservative", 
"Liberal", "Liberal", "Conservative", "Conservative", "Liberal", 
"Liberal", "Liberal", "Liberal", "Conservative", "Liberal", "Liberal", 
"Conservative", "Liberal", "Conservative", "Liberal", "Conservative", 
"Liberal", "Liberal", "Conservative", "Conservative", "Conservative", 
"Liberal", "Conservative", "Liberal", "Conservative", "Liberal", 
"Conservative", "Conservative", "Conservative", "Conservative", 
"Liberal", "Conservative", "Conservative", "Conservative", "Conservative", 
"Liberal", "Liberal", "Liberal", "Liberal", "Conservative", "Conservative", 
"Conservative", "Liberal", "Liberal", "Liberal", "Conservative", 
"Liberal", "Liberal", "Conservative", "Liberal", "Conservative", 
"Liberal", "Liberal", "Conservative", "Liberal", "Liberal", "Conservative", 
"Conservative", "Liberal", "Conservative", "Conservative", "Conservative", 
"Liberal", "Conservative", "Conservative", "Liberal", "Conservative", 
"Conservative", "Conservative", "Conservative", "Conservative", 
"Conservative", "Conservative", "Conservative", "Liberal", "Liberal", 
"Liberal", "Conservative", "Liberal", "Liberal", "Liberal", "Conservative", 
"Conservative", "Conservative", "Liberal", "Liberal", "Liberal", 
"Conservative", "Liberal", "Liberal", "Conservative", "Liberal", 
"Liberal", "Liberal", "Liberal", "Conservative", "Conservative", 
"Liberal", "Liberal", "Conservative", "Conservative", "Liberal", 
"Liberal", "Conservative", "Liberal", "Liberal", "Liberal", "Liberal", 
"Conservative", "Conservative", "Conservative", "Liberal", "Conservative", 
"Conservative", "Liberal", "Conservative", "Liberal", "Conservative", 
"Liberal", "Conservative", "Liberal", "Liberal", "Liberal", "Liberal", 
"Conservative", "Conservative", "Liberal", "Liberal", "Liberal", 
"Conservative", "Conservative", "Conservative", "Conservative", 
"Liberal", "Conservative", "Conservative", "Liberal", "Conservative", 
"Liberal", "Conservative", "Conservative", "Liberal", "Liberal", 
"Liberal", "Liberal", "Liberal", "Liberal", "Conservative", "Liberal", 
"Conservative"), neighbourhood_string = c("1015, 46, 699", "255, 1926, 1966, 473, 1864, 1199, 1544", 
"716, 1807, 1366, 632, 1385", "1151, 1035, 1048, 649, 217, 1889", 
"944, 1497, 1834, 1309, 876", "963, 1784, 1905, 432, 1452, 479", 
"1406, 809, 693, 325, 945, 966", "301, 517, 1544, 1786, 1441, 1248", 
"1122, 1724, 272, 249", "1851, 1304, 212, 528, 1375", "1774, 596, 749, 1740, 330, 800, 1923, 1805", 
"1185, 298, 223, 574, 1853, 218, 1119, 1030", "1706, 734, 488, 1296, 1276", 
"766, 1880, 359, 866, 1144, 519, 130", "1187, 679, 1008, 1467, 262, 865", 
"1368, 776, 1126, 1672, 1573", "1300, 1841, 1566, 219, 1688", 
"1418, 235, 305, 118, 86, 1670", "1016, 1595, 1574, 236, 516, 1913", 
"1858, 1698, 928, 254, 1120", "175, 442, 395, 169, 240", "1929, 875, 1589, 318, 1937, 1793", 
"452, 1941, 779, 1316, 1859", "1153, 1498, 1582, 666, 1315, 1606", 
"1373, 729, 1026, 1765, 1052, 658", "640, 64, 475, 712, 687, 697", 
"525, 911, 1388, 360, 389", "496, 1866, 1463, 1426, 1537, 1659, 36, 1222", 
"1140, 396, 371, 140, 1703, 733", "946, 343, 336, 557, 544, 879", 
"484, 1781, 1047, 706, 1526, 1116, 1214", "1203, 1305, 125, 1447", 
"627, 1710, 678, 342, 1161, 700", "1067, 1282, 1192, 293, 1363", 
"436, 1419, 77", "1204, 537, 44, 487, 1576, 1483, 299", "477, 1426, 27, 1659, 1463, 668, 1537", 
"1942, 1654, 102, 929, 1405, 370", "1022, 1355, 902, 818, 573, 1564, 1667", 
"283, 1568, 1938, 502, 1811, 1608", "1323, 1009, 110, 1552, 1207", 
"869, 1802, 1630, 438, 356, 220", "522, 986, 1428, 1840, 1265, 1591, 1308, 314", 
"88, 725, 225, 994, 1085, 1910, 1829", "299, 334, 35, 76, 1576", 
"1340, 1908, 156, 1145, 1662", "1570, 0, 699, 1769", "1391, 938, 1929, 448, 1744, 1871, 1464", 
"558, 1673, 1003, 1236, 1727", "914, 1211, 416, 1749, 1363, 73, 273", 
"1607, 1131, 1221, 1178, 973, 454, 1668", "313, 1742, 1861, 1925", 
"594, 864, 1527, 477, 1606, 515", "873, 965, 1387, 390, 618, 1182, 995", 
"1761, 326, 1219, 67, 1234, 1238", "1108, 1275, 1261, 87, 811, 1155", 
"646, 111, 1661, 906, 1736, 285", "186, 1350, 530, 394, 874", 
"1238, 1547, 1502, 466, 1844, 1234, 1749", "1240, 1948, 1967, 456", 
"1158, 1528, 1354, 638, 531, 90, 975", "722, 230, 560, 1322, 1296, 1239", 
"506, 1521, 132, 1520, 709, 1848, 565", "1685, 1875, 1603, 1903, 814, 1256", 
"712, 126, 983, 1054, 25", "68, 1645, 1086, 1025, 434", "656, 306, 853, 1219, 366, 326, 518", 
"1021, 1006, 1679, 1271, 54, 1761", "895, 469, 616, 1025, 1086, 163, 144, 65", 
"194, 584, 870, 127, 1107, 1735, 1831", "1642, 467, 1839, 1201, 1081, 124", 
"1070, 580, 120, 1562, 892", "1076, 1761, 366, 1658, 326", "1749, 1502, 49, 1304, 1211, 212, 1363, 914", 
"1371, 659, 1881, 426, 505, 992, 698, 1244", "1227, 1821, 1712, 311, 1325", 
"962, 334, 1874, 44, 115, 299", "34, 1095, 436, 1699", "846, 1239, 1419, 1083, 560", 
"1562, 924, 1253, 661, 1806, 1647", "1259, 98, 339, 1267, 1550, 463", 
"367, 155, 980, 1616, 446, 189, 1693", "1731, 1969, 153, 1203, 1307, 1408", 
"405, 820, 551, 239, 1849", "1366, 651, 347, 1438, 178", "344, 1274, 562, 1193, 783, 270, 648", 
"118, 17, 1670", "1108, 55, 398, 1261, 811", "1624, 43, 1396, 994, 1912, 708, 1202, 1910", 
"1864, 1889, 1212, 1709, 1966", "975, 1158, 1226, 147, 1528, 60", 
"331, 105, 1567, 821, 1179", "450, 499, 934, 1563, 379, 322, 384", 
"1117, 949, 620, 1585, 1530, 1618, 810", "731, 255, 1544, 1294", 
"364, 527, 1235, 1068, 435, 1747, 481", "441, 431, 219, 1548, 267", 
"549, 564, 534, 1459, 931", "1267, 339, 1259, 360, 389, 80", 
"1174, 573, 161, 1362, 1564, 1667, 1029, 1657", "1165, 1435, 615, 1345, 1104, 677, 950, 1621", 
"163, 434, 1255, 1086, 1050, 1059, 1225", "37, 370, 1942, 1218, 390", 
"371, 187, 767, 1477, 1140, 1674, 1196", "1436, 137, 1417, 981, 687", 
"91, 1215, 1800, 922, 587", "298, 218, 605", "1106, 481, 486, 1755, 1150, 1878, 1301", 
"839, 926, 1043, 1098, 1796, 1812, 1639, 192", "1480, 1139, 258, 1229, 645", 
"40, 1552, 145, 1009", "906, 1661, 901, 185, 285, 56", "1622, 837, 375, 1587, 453, 1611", 
"900, 1233, 569, 1027, 714, 1460, 476", "355, 308, 773, 945, 325", 
"962, 1665, 76, 912, 1874", "1383, 642, 1156, 731, 571, 1581", 
"791, 1752, 1018, 1360, 971, 714", "235, 86, 1670, 17, 622, 1418", 
"1725, 411, 1511, 1846, 1087, 1867, 128", "348, 71, 1070, 415, 580, 1243, 892", 
"1501, 1401, 1351, 923", "464, 1443, 1536, 1687, 909, 998", "688, 607, 1775, 1670", 
"1201, 1642, 467, 806, 1510, 1726, 70", "624, 1447, 31, 1203, 1307", 
"877, 736, 208, 1953, 64", "1735, 194, 656, 69, 584, 870, 191", 
"119, 1511, 595, 1867, 411, 1087, 346", "1085, 725, 307, 361, 211, 553, 1829", 
"359, 519, 1410, 644, 1144, 13, 866", "1461, 615, 1946, 1186, 1150, 677, 1791", 
"565, 1470, 62, 1533, 1520, 1813, 506", "416, 939, 273, 742, 869, 1896", 
"196, 1329, 1245, 567, 327, 650, 1717, 1162", "498, 494, 759, 614", 
"1946, 1303, 141, 445, 435, 629, 527", "687, 640, 104", "164, 1036, 961, 541, 1402, 175", 
"964, 1580, 471, 429, 207", "767, 609, 28, 1140, 297, 1466", 
"136, 1003, 365, 1236, 445", "1919, 760, 364, 964, 762, 1068", 
"165, 465, 1386, 635, 841, 749", "793, 889, 68, 1025, 1155, 616", 
"555, 110, 1397, 1011, 967", "1035, 1118, 962, 206, 912, 1135, 649, 1760", 
"936, 1226, 90", "825, 1055, 723, 756, 279, 214, 1297", "564, 576, 1789, 1514, 1368, 776", 
"1688, 942, 1841, 244, 152, 1365", "411, 932, 1867, 984", "942, 150, 691, 1078, 1355, 803", 
"82, 1969, 925, 1307, 1408", "638, 464, 1687, 1327, 1949, 531, 1816, 1536", 
"81, 229, 1475, 1813, 367", "253, 1340, 1145, 45, 302, 1908", 
"209, 1495, 1767, 1597, 1756", "1428, 1591, 1445, 753, 1367, 314, 801", 
"1822, 1421, 238, 1723, 717, 1092", "870, 1831, 194, 690, 653, 1283, 1810", 
"1174, 317, 1644, 1657, 898, 1029, 1362, 99", "1743, 1446, 201, 1312, 1359, 589, 590", 
"1059, 1900, 434, 1255, 68, 101, 895, 1086", "1040, 138, 442, 1402, 1017, 961, 175", 
"143, 635, 1386, 1189, 984", "843, 1437, 1017, 1436, 1569, 1349, 1417", 
"859, 1745, 474, 260, 626, 316", "338, 593, 1473, 1457, 1100, 1637", 
"175, 20, 541, 1876, 854", "455, 1200", "851, 579, 1208, 1074, 266, 1565, 1888", 
"1873, 511, 1258, 1830, 1369, 1333, 742", "312, 288, 825, 1750, 279", 
"427, 764, 1347, 636, 847, 417, 433, 1285", "541, 138, 164, 854, 169, 20, 1036", 
"1450, 494, 274", "303, 1707, 1093, 268", "1438, 457, 347, 1060, 651, 84, 225", 
"457, 230, 734, 1438, 651", "1517, 1482, 1693, 189, 1071", "323, 1450, 1952, 1037, 1260, 887", 
"1509, 267, 1422, 1034", "828, 1412, 1886, 296, 1165, 1427", 
"730, 410, 1847, 493, 570, 1696, 654", "111, 1312, 901, 201, 1604, 1446", 
"57, 394, 377, 530, 198", "609, 309, 1751, 953, 767, 1674, 103, 1330", 
"1334, 877, 1953, 701, 599, 871", "180, 1693, 1079, 980, 446, 367, 81, 1482", 
"369, 622, 1091, 617", "1105, 690, 127, 604, 870", "1812, 1012, 1796, 108, 1098, 1639, 710, 878", 
"1178, 333, 1668, 623, 1173", "1107, 69, 870, 127, 160, 1831, 690, 1283", 
"437, 496, 1724, 1222, 603, 1558", "759, 134, 1037, 567, 887, 1245, 741, 614", 
"1113, 432, 1704, 1009, 479", "530, 1365, 186, 244, 1835", "544, 343, 642, 1581"
), time_point = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), real_neighbourhood = c(3, 
0, 2, 5, 1, 2, 4, 5, 1, 2, 6, 6, 1, 1, 5, 5, 1, 3, 2, 2, 1, 2, 
1, 3, 3, 2, 2, 4, 1, 4, 4, 2, 4, 4, 2, 3, 1, 2, 1, 2, 4, 5, 6, 
2, 3, 3, 4, 3, 2, 5, 4, 0, 3, 3, 5, 4, 3, 5, 4, 2, 4, 2, 3, 4, 
1, 5, 4, 3, 5, 4, 3, 2, 2, 2, 4, 2, 2, 1, 3, 2, 3, 3, 3, 2, 3, 
2, 1, 4, 4, 2, 3, 2, 1, 1, 3, 5, 1, 2, 2, 4, 6, 1, 2, 4, 4, 3, 
1, 3, 2, 1, 3, 2, 4, 3, 3, 1, 3, 4, 2, 3, 3, 3, 4, 3, 5, 4, 3, 
3, 4, 4, 5, 2, 4, 6, 4, 2, 4, 0, 4, 3, 3, 1, 2, 3, 3, 3, 3, 1, 
3, 1, 3, 2, 3, 2, 4, 2, 1, 3, 5, 1, 3, 4, 5, 5, 5, 2, 5, 2, 3, 
4, 0, 3, 3, 2, 5, 3, 1, 3, 3, 3, 3, 2, 2, 1, 4, 5, 4, 6, 3, 5, 
1, 3, 4, 3, 2, 5, 6, 4, 3, 3)), row.names = c(NA, -200L), class = c("tbl_df", 
"tbl", "data.frame"))

It isn't complete, because the complexity of the data make it somewhat unweildy, but the real_neighbourhood variable encodes the sum of the identity matches for each row, depending on the neighbours defined in neighbourhood_string.

Adam
  • 1
  • 1
  • Would be good if you provided a small dataset to play with. – Dasr Aug 05 '22 at 07:33
  • I do not understand your question/desired output. Can you create a small toy-dataset (which represents the problem) to work with, and your desired output of this toy-dataset? – Wimpel Aug 05 '22 at 08:52
  • I agree with @wimpel and Dasr. It's impossible to help you without a clear statement of your desired output. However, I suspect your fundamental problem is not the algorithm, but your data format. I suspect that if you transform the data from one-row-per agent to one-row-per-agent-neighbour-combination, your life will become much simpler. – Limey Aug 05 '22 at 08:55
  • Thanks all - I've edited an update with an example tibble. Hope that suffices for clarity, but happy to add more if more is needed. – Adam Aug 05 '22 at 10:29
  • Thank you for providing sample data. In the future, especially when there are embedded spaces in your data (also relevant with `factor`, `Date`, and `POSIXct`-class columns), please provide it in a more consumable format such as `dput` or `read.table`, see https://stackoverflow.com/q/5963269 for good discussions/demonstrations of that. – r2evans Aug 05 '22 at 11:32
  • Sample data updated into dput format! Thanks for teaching me about that too, along the way. – Adam Aug 05 '22 at 14:54

1 Answers1

0

A data.table solution appears to be pretty performant.

Since we can't reproduce the expected results with a truncated table, I generated a dummy table, so pardon that individuals can be their own neighbour and that neighbour assignments are directional.

We can using straight indexing instead of joins since the values of who are sequential within time_point.

library(data.table)
set.seed(94)

sim <- data.table(
  who = rep(1e4:11969, 101),
  identity = c("Conservative", "Liberal")[rbinom(1970*101, 1, rep(rbeta(1970, 0.01, 0.01), 101)) + 1L],
  neighbourhood_string = replicate(101, paste(sample(0:1969, sample(10, 1)), collapse = ", ")),
  time_point = rep(0:100, each = 1970)
)

system.time({
  sim[
    , neighbours := lapply(strsplit(neighbourhood_string, ", "), as.integer)
  ][
    , neighbour_count := sim[
      , .(x = diff(
        c(
          0L,
          cumsum(
            rep.int(identity, lengths(neighbours)) == identity[unlist(neighbours) + 1L]
          )[
            cumsum(lengths(neighbours))
          ]
        )
      )), time_point
    ][[2]]
  ][
    , neighbours := NULL
  ]
})
#>    user  system elapsed 
#>    1.11    0.04    1.16

head(sim)
#>      who     identity                       neighbourhood_string time_point neighbor_count
#> 1: 10000 Conservative    670, 420, 522, 1461, 332, 502, 922, 283          0              5
#> 2: 10001      Liberal      1038, 202, 1628, 1625, 603, 1825, 834          0              6
#> 3: 10002      Liberal 1529, 1612, 1656, 560, 1402, 501, 209, 706          0              3
#> 4: 10003 Conservative                                  1776, 791          0              1
#> 5: 10004      Liberal 497, 822, 211, 1233, 1254, 1126, 1269, 523          0              5
#> 6: 10005      Liberal                                 1125, 1204          0              2

Note that most of the computational time is spent converting the neighbourhood_string to a list of vectors. If the data were to be generated with the neighbour IDs as integer vectors, the above approach could be much faster:

sim <- data.table(
  who = rep(1e4:11969, 101),
  identity = c("Conservative", "Liberal")[rbinom(1970*101, 1, rep(rbeta(1970, 0.01, 0.01), 101)) + 1L],
  neighbours = replicate(101, sample(0:1969, sample(10, 1))),
  time_point = rep(0:100, each = 1970)
)

system.time({
  sim[
    , neighbour_count := sim[
      , .(x = diff(
        c(
          0L,
          cumsum(
            rep.int(identity, lengths(neighbours)) == identity[unlist(neighbours) + 1L]
          )[
            cumsum(lengths(neighbours))
          ]
        )
      )), time_point
    ][[2]]
  ]
})
#>    user  system elapsed 
#>    0.02    0.03    0.04

head(sim)
#>      who     identity                                 neighbours time_point neighbor_count
#> 1: 10000 Conservative    670, 420, 522, 1461, 332, 502, 922, 283          0              5
#> 2: 10001      Liberal      1038, 202, 1628, 1625, 603, 1825, 834          0              6
#> 3: 10002      Liberal 1529, 1612, 1656, 560, 1402, 501, 209, 706          0              3
#> 4: 10003 Conservative                                  1776, 791          0              1
#> 5: 10004      Liberal 497, 822, 211, 1233, 1254, 1126, 1269, 523          0              5
#> 6: 10005      Liberal                                 1125, 1204          0              2
jblood94
  • 10,340
  • 1
  • 10
  • 15