2

Update

Thanks to @jazzurro for his anwer. It made me realize that the duplicates may just complicate things. I hope by keeping only unique values/row simplifies the task.*

df <- data.frame(ID = c(1,2,3,4,5), 
                  CTR1 = c("England", "England", "England", "China", "Sweden"),
                  CTR2 = c("England", "China", "China", "England", NA),
                  CTR3 = c("USA", "USA", "USA", "USA", NA),
                  CTR4 = c(NA, NA, NA, NA, NA),
                  CTR5 = c(NA, NA, NA, NA, NA),
                  CTR6 = c(NA, NA, NA, NA, NA))


ID CTR1    CTR2    CTR3 CTR4 CTR5 CTR6
1  England China   USA
2  England China   USA
3  England China   USA
4  China   England USA
5  Sweden

It is still the goal to create a co-occurrence matrix (now) based on the following four conditions:

  1. Single observations without additional observations by ID/row are not considered, i.e. a row with only a single country once is counted as 0.

  2. A combination/co-occurrence should be counted as 1.

  3. Being in a combination results in counting as a self-combination as well (USA-USA), i.e. a value of 1 is assigned.

  4. There is no value over 1 assigned to a combination by row/ID.

Aspired Result

         China   England   USA   Sweden

China    4        4         4      0

England  4        4         4      0        

USA      4        4         4      0

Sweden   0        0         0      0

*I've used the code from here to remove all non-unique observations.


Original Post

Assume I have a data set with a low two digit number of columns (some NA/empty) and more than 100.000 rows, represented by the following example dataframe

df <- data.frame(ID = c(1,2,3,4,5), 
                  CTR1 = c("England", "England", "England", "China", "England"),
                  CTR2 = c("England", "China", "China", "England", NA),
                  CTR3 = c("England", "China", "China", "England", NA),
                  CTR4 = c("China", "USA", "USA", "China", NA),
                  CTR5 = c("USA", "England", "USA", "USA", NA),
                  CTR6 = c("England", "China", "USA", "England", NA))


df

ID   CTR1    CTR2    CTR3    CTR4   CTR5    CTR6         
1    England England England China  USA     England 
2    England China   China   USA    England China
3    England China   China   USA    USA     USA  
4    China   England England China  USA     England
5    England 

and I want to count the co-occurrences by ID/row to get a co-occurrence matrix that sums up the co-occurence by ID/row only once, meaning that no value over 1 will be allocated to a combination (i.e. assign a value of 1 for the existence of a co-occurrence independent of in-row frequencies and order, assign a value of 0 for no co-occurrence/combination by ID/row),

1 England-England-England => 1
2 England-England => 1
3 England-China => 1
4 England- => 0

Another important aspects regards the counting of observations that appear once in a row but in combination with others, e.g. USA in row 1. They should get a value of 1 for their own co-occurrence (as they are in a combination even though not with themselves) so that the combination USA-USA also gets a value of 1 assigned.

1    England England England China  USA  England 
USA-USA => 1
China-China => 1
USA-China => 1
England-England => 1
England-USA => 1
England-China => 1

Due to the fact that row count should not >1 for a combination by row/ID, this results to:

        China   England   USA 

China    1        1         1        

England  1        1         1        

USA      1        1         1

This should lead to the following result based on the example dataframe, where a value of 4 is assigned to each combination based on the fact that each combination has occured at least in four rows and each string is part of a combination of the original dataframe:

         China   England   USA 

China    4        4         4        

England  4        4         4        

USA      4        4         4

So there are five conditions for counting:

  1. Single observations without additional observations by ID/row are not considered, i.e. a row with only a single country once is not counted.
  2. A combination should be counted as 1.
  3. Observations occuring more than once do not contribute to a higher value for the interaction, i.e. several occurrences of the same country do not matter.
  4. Being in a combination (even in the case the same country does not appear twice in a row) results in counting as a self-combination, i.e. a value of 1 is assigned.
  5. There is no value over 1 assigned to a combination by row/ID.

I've tried to implement this by using dplyr, data.table, base aggregate or plyr adjusting code from [1], [2], [3], [4], [5] and [6] but as I don't care about order within a row but I also don't want to sum up all combinations within a row, I haven't got the aspired result so far.

I'm a novice in R. Any help is very much appreciated.

Seb
  • 179
  • 1
  • 8
  • 1
    Can you explain this part in a clear way: `So single observations without additional observations by ID/row are not considered`? Do you mean that any row which has only one observation is not considered (e.g., CTR1 has a country, but all others are NA)? If so, there is no such case in the sample data. Is that right? – jazzurro Jan 09 '20 at 12:11
  • Yes, exactly. There's no case in the sample data here. Does it help if I amend the example accordingly? – Seb Jan 09 '20 at 12:17
  • Amended the reproducible example. – Seb Jan 09 '20 at 12:32
  • 1
    Thanks. Can I ask how you get 4 for USA-USA combination from your data? I am very confused. – jazzurro Jan 09 '20 at 13:08
  • Yes, of course. 1. USA is within four rows. 2. It is not the only value in one row (so it won't be counted as 0). 3. As part of any combination the value of 1 should be repeated for the string itself. So, in row 1 because England-USA and China-USA get a value of 1, USA-USA should get a value of 1 - as it is part of the other combinations. – Seb Jan 09 '20 at 13:21
  • 1
    Thanks again. Let me ask you more. Take row 3 as an example. England appears once. China appears twice. USA appears three times. Given your description, I think you want to consider each country appearing once. That is, you want to remove duplication. Then, you want to have all possible combination of the three countries for row 3. Is this what you are thinking? – jazzurro Jan 09 '20 at 14:06
  • Yes, I'm afraid I haven't thought about that in this way but you are right. If I see it that way, it should be a lot easier. First remove duplicates, then count as combination, right? – Seb Jan 09 '20 at 14:09
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/205679/discussion-between-jazzurro-and-seb). – jazzurro Jan 09 '20 at 14:17

2 Answers2

2

DATA

I modified your data so that data can represent your actual situation.

#   ID    CTR1    CTR2    CTR3  CTR4    CTR5    CTR6
#1:  1 England England England China     USA England
#2:  2 England   China   China   USA England   China
#3:  3 England   China   China   USA     USA     USA
#4:  4   China England England China     USA England
#5:  5  Sweden    <NA>    <NA>  <NA>            <NA>


df <- structure(list(ID = c(1, 2, 3, 4, 5), CTR1 = c("England", "England", 
"England", "China", "Sweden"), CTR2 = c("England", "China", "China", 
"England", NA), CTR3 = c("England", "China", "China", "England", 
NA), CTR4 = c("China", "USA", "USA", "China", NA), CTR5 = c("USA", 
"England", "USA", "USA", ""), CTR6 = c("England", "China", "USA", 
"England", NA)), class = c("data.table", "data.frame"), row.names = c(NA, 
-5L))

UPDATE

After seeing the OP's previous question, I got a clear picture in my mind. I think this is what you want, Seb.

# Transform the data to long-format data. Remove rows that have zero character (i.e, "") or NA. 

melt(setDT(df), id.vars = "ID", measure = patterns("^CTR"))[nchar(value) > 0 & complete.cases(value)] -> foo

# Get distinct value (country) in each ID group (each row)
unique(foo, by = c("ID", "value")) -> foo2

# https://stackoverflow.com/questions/13281303/creating-co-occurrence-matrix
# Seeing this question, you want to create a matrix with crossprod().

crossprod(table(foo2[, c(1,3)])) -> mymat

# Finally, you need to change diagonal values. If a value is equal to one,
# change it to zero. Otherwise, keep the original value.

diag(mymat) <- ifelse(diag(mymat) <= 1, 0, mymat)

#value
#value     China England Sweden USA
#China       4       4      0   4
#England     4       4      0   4
#Sweden      0       0      0   0
#USA         4       4      0   4
jazzurro
  • 23,179
  • 35
  • 66
  • 76
  • Thank you! I still want to count USA-USA, China-China and England-England based on single frequency in case of not being the only string a row. I'm not sure if it helps but I've asked a similar question regarding co-occurrence calculation (though in a different way of course) the other day: https://stackoverflow.com/questions/59632794/how-to-calculate-a-co-occurrence-matrix-from-a-data-frame-with-several-columns. – Seb Jan 09 '20 at 18:25
  • 1
    @Seb I am in my mini break and my mind got a clear picture of what you were saying after seeing your previous post. I updated my suggestion. – jazzurro Jan 10 '20 at 03:08
1

Here is an option using base::table:

#get paired combi and remove those from same country
pairsDF <- as.data.frame(do.call(rbind, 
    by(df, df$ID, function(x) t(combn(unlist(x[-1L]), 2L)))))

#tabulate pairs
duppairs <- rbind(pairsDF, data.frame(V1=pairsDF$V2, V2=pairsDF$V1))
tab <- table(duppairs, useNA="no")

#set diagonals to be the count of countries if count is at least 2
cnt <- c(table(unlist(df[-1L])))
cnt[cnt==1L] <- 0L
diag(tab) <- cnt[names(diag(tab))]

output:

         V2
V1        China England Sweden USA
  China       4       4      0   4
  England     4       4      0   4
  Sweden      0       0      0   0
  USA         4       4      0   4

data:

df <- data.frame(ID = c(1,2,3,4,5), 
    CTR1 = c("England", "England", "England", "China", "Sweden"),
    CTR2 = c("China", "China", "China", "England", NA),
    CTR3 = c("USA", "USA", "USA", "USA", NA),
    CTR4 = c(NA, NA, NA, NA, NA),
    CTR5 = c(NA, NA, NA, NA, NA),
    CTR6 = c(NA, NA, NA, NA, NA))
chinsoon12
  • 25,005
  • 4
  • 25
  • 35