0

I have done my research and googling but have yet to find a solution to the following problem. I have quite often found solutions to R-related issues from this forum, so I thought I'd give it a try and hope that somebody can suggest something. I would need it for my PhD thesis; anybody who's code or suggestions I will use will naturally be acknowledged and credited.

So: I need to draw lines/segments to connect points in a plot (of multidimensional scaling, specifically) in R (SPSS-based solutions are welcome as well) - but not between all points, just those that represent properties/variables that at least one data item shares - the placement of the lines should be based on the data that the plot in question is based on itself. Let me exeplify; below are some fictional data with dummy variables, where '1' means that the item has the property:

       "properties"
        a   b   c
"items" ---------
tree  | 1   1   0
house | 0   1   1
hut   | 0   1   1
book  | 1   0   0

enter image description here

The plot is a multidimensional scaling plot (distances are to be interpreted as dissimilarities). This is the logic:

  • there's a line between A and B, because there is at least one item/variable ("tree") in the data that has both properties;
  • there is a line between B and C, because there is at least one item in the data ("house" and "hut") that has both properties;
  • there is an item ("book") that has only one property (A), so it does not affect the placement of the lines
  • importantly, there is no line between A and C because there are no items in the data that have both properties.

What I am looking for is a way to add the grey lines automatically/computationally that I have for now drawn manually on the plot above. The automatic drawing should be based on the data as described above. With a small data set, drawing the lines manually is no problem, but becomes a problem when there are tens of such "properties" and hundreds of items/rows of data. Any ideas? Some R code (commented if possible) would be most welcome!

EDIT: It seems I forgot something very important. First thing, the solution proposed by @GaborCsardi below works perfectly with the example data, thanks for that! But I forgot to include that the linking of the points should also be "conservative", with as few connecting lines as possible. For example, if there is an item that has all the "properties", then it should not create lines between every single property point in the plot just because of that, if the points are connected by other items already, even if indirectly. So a plot based on the following data should not be a full triangle, even though item1 has all three properties:

      A B C
item1 1 1 1
item2 1 1 0
item3 0 1 1

Instead, A,B and B,C should be connected by a line, but a line between A and C would be exessive, as they are already indirectly connected (through B). Could this be done with incidence graphs?

  • About the edit: so you want a minimum spanning tree of the graph? (Please see Wikipedia for minimum spanning tree.) A particular spanning tree? Or any spanning tree? – Gabor Csardi Feb 03 '13 at 19:26
  • From what I've read now (thanks for the reference), it seems I would need a minimal spanning tree, to keep the total length/weight of the lines minimal (to exclude the excess lines). There was something about an Euclidean MST where the edge weights would be equal to the Eucl distances, but that doesn't seem to be it, as the MDS plot is already a display of Euclidean distances that convey the dissimilarities, between all points, not just the connected ones. (Coming from the humanities, this is quite new to me; I've used f.ex cluster analysis, but I never knew that it was based on MST) – user2037150 Feb 03 '13 at 23:33
  • So if you need an MST, then just call `minimum.spanning.tree()` on the projection. I have updated my answer. – Gabor Csardi Feb 06 '13 at 02:41
  • @GaborCsardi: Thanks a lot! I didn't realize it's as easy as that; I'm going to try to implement this as soon as I get to R. – user2037150 Feb 07 '13 at 10:25

1 Answers1

4

This is very easy if you use graphs, and create the projection of the bipartite graph that you have in your table. E.g.

library(igraph)

## Some example data
mat <- "       properties
        items  a   b   c
        tree   1   1   0
        house  0   1   1
        hut    0   1   1
        book   1   0   0
       "
tab <- read.table(textConnection(mat), skip=1,
                  header=TRUE, row.names=1)

## Create a bipartite graph
graph <- graph.incidence(as.matrix(tab))

## Project the bipartite graph
proj <- bipartite.projection(graph)

## Plot one of the projections, the one you need 
## happens to be the second one
plot(proj$proj2)

## Minimum spanning tree of the projection
plot(minimum.spanning.tree(proj$proj2))

For more information see the manual pages, i.e. ?"igraph-package" ?graph.incidence, ?bipartite.projection and ?plot.igraph.

Gabor Csardi
  • 10,705
  • 1
  • 36
  • 53
  • maybe you need to coerce tab to a matrix ..`graph <- graph.incidence(as.matrix(tab))` – agstudy Feb 03 '13 at 15:51
  • @agstudy: indeed, thanks, I was using the development version of igraph, and that allows data frames as well. Fixed it. – Gabor Csardi Feb 03 '13 at 16:06
  • Your are welcome.. great work in the `igraph` package. I am a great fan even I haven't the occasion to use it in a professional project. Do you know any pro applications using this package? – agstudy Feb 03 '13 at 16:11
  • 1
    I am not sure what you mean by pro applications, the GPL might be limiting for commercial software, I guess. But there are a bunch of research papers using it, and that was exactly the goal of creating it. – Gabor Csardi Feb 03 '13 at 16:45
  • Thanks for the helpful suggestions! (I didn't realize it's possible to edit the question after asking it; so I deleted the comment, and added the need for conservativeness to the question itself) – user2037150 Feb 03 '13 at 17:56