1

I have a large dataframe ncol =220 I want to compare the columns to see if they may be identical and produce a matrix for ease of identification.

So what I have is

      x    y   z
1   dog   dog   cat    
2   dog   dog   dog
3   cat   cat   cat

What I want

     x     y     z
x   -     True   False
y   True     -   False
z   False False   -

Is there a way to do this using identical() in R?

AudileF
  • 436
  • 2
  • 10

2 Answers2

4

To compliment @Cath's comment about stringdist, it is as easy as,

library(stringdist)

stringdistmatrix(df, df) == 0

#      [,1]  [,2]  [,3]
#[1,]  TRUE  TRUE FALSE
#[2,]  TRUE  TRUE FALSE
#[3,] FALSE FALSE  TRUE
Sotos
  • 51,121
  • 6
  • 32
  • 66
3

Probably not very efficient but you can try:

seq_col <- seq_len(ncol(df))
sapply(seq_col, function(i) sapply(seq_col, function(j) identical(df[, i], df[, j])))
      # [,1]  [,2]  [,3]
# [1,]  TRUE  TRUE FALSE
# [2,]  TRUE  TRUE FALSE
# [3,] FALSE FALSE  TRUE

It gives you what you want (except for the diagonal, which is all TRUE here) but there must be a package with a function to create a distance matrix based on character vectors. Maybe something with stringdist ?

Cath
  • 23,906
  • 5
  • 52
  • 86
  • 1
    Thanks Cath. My original data is binary so this is great. Ill give it a go. – AudileF Jul 07 '17 at 12:18
  • Hi @Cath I get the following error for the sapply command: `Error: unexpected 'function' in "sapply(seq_col function"` – AudileF Jul 07 '17 at 12:28
  • Thanks Cath, works a treat. Just wondering if you could point me in a direct to learn more about `function()` and `sapply()`. Ive seen them pop up a few times in useful commands. – AudileF Jul 07 '17 at 12:36
  • 1
    @AudileF y/w :-) sapply permits to work on each element of an object (list or vector) and apply "some function" to those elements. Then you can define the function you need or use a "predefined" one (and so the function will be applied on each element). See also https://stackoverflow.com/q/3505701/4137985 – Cath Jul 07 '17 at 12:38
  • 1
    @AudileF For a nice overview of the `*apply`-family and related functions, see: [*R Grouping functions: sapply vs. lapply vs. apply. vs. tapply vs. by vs. aggregate*](https://stackoverflow.com/q/3505701/2204410) – Jaap Jul 07 '17 at 12:40