Group individuals based on direct and indirect relationships

Question

I want to group individuals into different households on the basis of their address and their ownership of parcels. People belong to the same household if they live at the same address and if they are linked by ownership, directly or indirectly, of at least one parcel.

The links between individuals can be direct, i.e. two people have a parcel in common. But the link can also be indirect, through cross-linked forming chains - two people have a parcel in common and one of them has a parcel in common with someone else, and all live at the same address.

Here are some examples:

if one person (9) lives alone at his address (C), he will have his household alone even if another person (6) also owns his or her parcel (s).
If two people (12 and 13) live at the same address (F) and own the same parcel (w) then they are part of the same household. But if three people live at the same address (B) but only two people (7 and 8) own the same parcel (r) and a third person (6) who lives at this address (B) but owns another parcel (m) only the two who own the same parcel are from the same household.
If at the same address (A), 4 persons live (1, 2, 3 and 4), if persons (1, 2 and 3) are linked by the possession of several parcels (m, n and o) then they are part of the same household whereas the person (4) who also lives at this address but who does not possess any of these 3 parcels but another one (p) is not part of the same household.

I have three variables: address id, owner id and parcel id. I would like to get a household number. Here is an example table:

 id_address id_owner id_parcel id_household
          A        1         m            1
          A        1         n            1
          A        2         n            1
          A        2         o            1
          A        3         o            1
          A        4         p            2
          A        5         q            3
          B        6         s            4
          B        7         r            5
          B        8         r            5
          C        9         s            6
          D       10         t            7
          E       11         u            8
          E       11         v            8
          F       12         w            9
          F       13         w            9

My first instinct was to loop, but I have 800,000 rows and it might take too long.

Sample data where 'id_household' is the variable I want to create:

structure(list(id_address = c("A", "A", "A", "A", "A", "A", "A", 
"B", "B", "B", "C", "D", "E", "E", "F", "F"), id_owner = c(1L, 
1L, 2L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 11L, 12L, 13L
), id_parcel = c("m", "n", "n", "o", "o", "p", "q", "s", "r", 
"r", "s", "t", "u", "v", "w", "w"), id_household = c(1L, 1L, 
1L, 1L, 1L, 2L, 3L, 4L, 5L, 5L, 6L, 7L, 8L, 8L, 9L, 9L)), class = "data.frame", row.names = c(NA, 
-16L))

I think `igraph::components` can be used here. See e.g. [identify groups of linked episodes which chain together](https://stackoverflow.com/questions/12135971/identify-groups-of-linked-episodes-which-chain-together/12170710#12170710) and Linked therein. You need to do it by group though, i.e. by "id_address", and join the memberships back to the original data. — Henrik, Jul 29 '21 at 19:29
Another similar: [function to track the changes in a field](https://stackoverflow.com/questions/67895328/function-to-track-the-changes-in-a-field/67911222#67911222). I bet someone can dig up a proper canonical answer ;) — Henrik, Jul 29 '21 at 19:54

Henrik · Answer 1 · 2021-07-30T18:39:36.553

Your question may be approached as a graph problem. The igraph package provides excellent tools. The columns 'id_owner' and 'id_parcel' can be treated as an edge list. The components function provides "the cluster id to which each vertex belongs". I use data.table for general data handling.

library(data.table)
library(igraph)

setDT(d)
d2 = d[ , {
  # create graph. columns id_owner and id_parcel are treated as an edge list.
  g = graph_from_data_frame(.SD)

  # get components of the graph that are directly or indirectly connected
  mem = components(g)$membership

  # grab the memberships and their names (i.e. the vertices) 
  .(id_parcel = names(mem), mem = mem)
 
  # do the above for each id_address
}, by = id_address]

# join the memberships to the original data
# paste with id_address for uniqueness
d[d2, on = .(id_address, id_parcel), id := paste0(id_address, mem)]

# if you want a consecutive integer as 'id', to make it agree with 'id_household'
d[ , id2 := as.integer(as.factor(id))]

Output:

d
#     id_address id_owner id_parcel id_household id id2
#  1:          A        1         m            1 A1   1
#  2:          A        1         n            1 A1   1
#  3:          A        2         n            1 A1   1
#  4:          A        2         o            1 A1   1
#  5:          A        3         o            1 A1   1
#  6:          A        4         p            2 A2   2
#  7:          A        5         q            3 A3   3
#  8:          B        6         s            4 B1   4
#  9:          B        7         r            5 B2   5
# 10:          B        8         r            5 B2   5
# 11:          C        9         s            6 C1   6
# 12:          D       10         t            7 D1   7
# 13:          E       11         u            8 E1   8
# 14:          E       11         v            8 E1   8
# 15:          F       12         w            9 F1   9
# 16:          F       13         w            9 F1   9

An alternative where the by operation is avoided. On the other hand some other steps are added, so whether it is more efficient depends on the structure of the data.

First, create 'composite variables' where address is concatenated with parcel and owner respectively. Create membership. Retrieve the original columns by splitting the names (tstrsplit(names(mem), "_", fixed = TRUE)). Join memberships to the original data

d[ , `:=`(
  address_parcel = paste(id_address, id_parcel, sep = "_"),
  address_owner = paste(id_address, id_owner, sep = "_"))]

d2 = d[ , {
  g = graph_from_data_frame(.SD[ , .(address_owner, address_parcel)])
  mem = components(g)$membership
  c(tstrsplit(names(mem), "_", fixed = TRUE), .(mem = mem))
}]

d[d2, on = c(id_address = "V1", id_parcel = "V2"), id_hh := mem]

Output:

d
#     id_address id_owner id_parcel id_household id_hh
#  1:          A        1         m            1     1
#  2:          A        1         n            1     1
#  3:          A        2         n            1     1
#  4:          A        2         o            1     1
#  5:          A        3         o            1     1
#  6:          A        4         p            2     2
#  7:          A        5         q            3     3
#  8:          B        6         s            4     4
#  9:          B        7         r            5     5
# 10:          B        8         r            5     5
# 11:          C        9         s            6     6
# 12:          D       10         t            7     7
# 13:          E       11         u            8     8
# 14:          E       11         v            8     8
# 15:          F       12         w            9     9
# 16:          F       13         w            9     9

Timing the two alternatives on larger data (original data repeated 1e4 times and new ids created within each chunk). For this particular data, the second alternative, which avoids by, is about 100 times faster.

# prepare toy data
d1 = as.data.table(d)
n = 1e4
dL = d1[rep(1:.N, n)]

# make unique id within the repeated data frames
dL[ , `:=`(
  id_address = paste(rep(1:n, each = nrow(d1)), sep = ".", id_address),
  id_owner = paste(rep(1:n, each = nrow(d1)), sep = ".", id_owner),
  id_parcel = paste(rep(1:n, each = nrow(d1)), sep = ".", id_parcel)
)]

Alternative 1: by address

dL1 = copy(dL) 

system.time({
d2 = dL1[ , {
  g = graph_from_data_frame(.SD)
  mem = components(g)$membership
  .(id_parcel = names(mem), mem = mem)
}, by = id_address]

dL1[d2, on = .(id_address, id_parcel), id_hh := paste(id_address, mem, sep = "_")]
})

#  user  system elapsed 
# 59.46    7.68   67.11

Alternative 2. Composite variables & splitting:

dL2 = copy(dL)

system.time({
dL2[ , `:=`(
  address_parcel = paste(id_address, id_parcel, sep = "_"),
  address_owner = paste(id_address, id_owner, sep = "_"))]

d3 = dL2[ , {
  g = graph_from_data_frame(.SD[ , .(address_owner, address_parcel)])
  mem = components(g)$membership
  c(tstrsplit(names(mem), "_", fixed = TRUE), .(mem = mem))
}]

dL2[d3, on = c(id_address = "V1", id_parcel = "V2"), id_hh := mem]
})

# user  system elapsed 
# 0.47    0.24    0.57

Test for equality:

all.equal(as.integer(as.factor(stringi::stri_pad_left(dL1$id_hh, 9, "0"))),
          dL2$id_hh) 
# TRUE

your code worked on my entire dataset of about 800 000 lines in about 7 minutes — Hugo Périlleux Sanchez, Jul 30 '21 at 00:54
@HugoPérilleuxSanchez Please note that I updated with a non-`by` alternative which is faster, at least on toy data composed by repeating your original data. — Henrik, Jul 30 '21 at 11:45
Would be interesting if you could test the new alternative on your real data. Does it work? Do you find a similar gain in speed? Cheers — Henrik, Jul 30 '21 at 11:54
On my set of data, finished in around 5 seconds. Thanks ! I will now try to understand your code! — Hugo Périlleux Sanchez, Jul 30 '21 at 13:47
Thanks for your feedback! Please don't hesitate to ask if you need clarification, or if you spot any flaws. Cheers. — Henrik, Jul 30 '21 at 14:51

score 0 · Answer 2 · answered Jul 30 '21 at 08:07

A friend-colleague solved the problem in Fortran wich is very fast (less than a second after sorting the file and compiling the program):

      program Nohousehold_
      character*1 Passer
      character*13 id_address,id_address1,id_address0,id_addressBL
      character*15 id_owner,id_owner1
      character*23 id_parcel
      character*15 L_owner(1000)
      character*28 L_owner_parcel(1000,1000)
      Integer No_owner, Nbparcel(1000)
      Integer Nohouseholdowner(1000)
      Integer NohouseholdTT
      id_address0='0 0          '
      id_addressBL='             '

      NohouseholdTT=0

      oownern(1,file='data_a.txt')
      oownern(2,file='RESULT3.TXT')
      Read(1,19)Passer             
 19   format(a1)

 81   read(1,11)id_address,id_owner,id_parcel
      if(id_address.eq.id_addressBL)go to 81
      id_owner1='ZZZZZZZZZZZZZZZ'
 82   continue
      if(id_address.eq.id_address0)then
        if(id_owner.ne.id_owner1)then
          NohouseholdTT=NohouseholdTT+1
          write(2,22)id_owner,NohouseholdTT
          id_owner1=id_owner
        endif
        read(1,11)id_address,id_owner,id_parcel
        go to 82
       else
        go to 83
      endif
     
 83   continue

      n=1
 11   format(a13,1x,a15,1x,a28)
 22   format(a15,1x,i10,1x,a28)

      id_address1=id_address
      id_owner1=id_owner
      No_owner=1
      L_owner(1)=id_owner
      Nbparcel(1)=1
      L_owner_parcel(1,1)=id_parcel
      Nohouseholdowner(1)=0
      n=1
      nFini=0
 14   continue
      n=n+1
      read(1,11,end=90)id_address,id_owner,id_parcel
      if(id_address.eq.id_address1)then
         if(id_owner.eq.id_owner1)then
            Nbparcel(No_owner)=Nbparcel(No_owner)+1
            L_owner_parcel(No_owner,Nbparcel(No_owner))=id_parcel
          else
            No_owner=No_owner+1
            L_owner(No_owner)=id_owner
            Nbparcel(No_owner)=1
            L_owner_parcel(No_owner,1)=id_parcel
            Nohouseholdowner(No_owner)=0
            id_owner1=id_owner
         endif
       ELSE
 556     continue
         do 1 Noowner1=1,No_owner 
          if(Nohouseholdowner(Noowner1).eq.0)then
            NohouseholdTT=NohouseholdTT+1
            Nohouseholdowner(Noowner1)=NohouseholdTT
            write(2,22)L_owner(Noowner1),NohouseholdTT
          endif
          do 2 Noowner2=(Noowner1+1),No_owner   
            do 201 Nbparcel1=1,Nbparcel(Noowner1)
              do 202 Nbparcel2=1,Nbparcel(Noowner2)
               if(L_owner_parcel(Noowner1,Nbparcel1).eq.L_owner_parcel(Noowner2,Nbparcel2))then
               if(Nohouseholdowner(Noowner2).eq.0)then
                Nohouseholdowner(Noowner2)=Nohouseholdowner(Noowner1)
                write(2,22)L_owner(Noowner2),NohouseholdTT
                go to 203
               endif
               endif
 202          continue
 201        continue 
 203        continue
 2        continue
 1       continue

         if(NFini.eq.1)go to 91

         id_address1=id_address
         id_owner1=id_owner
         No_owner=1
         L_owner(1)=id_owner
         Nbparcel(1)=1
         L_owner_parcel(1,1)=id_parcel
         Nohouseholdowner(1)=0

      endif
      go to 14
 90   continue
      NFini=1
      go to 556
 91   continue

      close(1)
      close(2)

      end

Group individuals based on direct and indirect relationships

2 Answers2