2

I have a big dataset df (354903 rows) with two columns named df$ColumnName and df$ColumnName.1

head(df)
       CompleteName       CompleteName.1
1   Lefebvre Arnaud Lefebvre Schuhl Anne
1.1 Lefebvre Arnaud              Abe Lyu
1.2 Lefebvre Arnaud              Abe Lyu
1.3 Lefebvre Arnaud       Louvet Nicolas
1.4 Lefebvre Arnaud   Muller Jean Michel
1.5 Lefebvre Arnaud  De Dinechin Florent

I am trying to create labels to see whether the name is the same or not. When I try a small subset it works [1 if they are the same, 0 if not]:

> match(df$CompleteName[1], df$CompleteName.1[1], nomatch = 0)
[1] 0
> match(df$CompleteName[1:10], df$CompleteName.1[1:10], nomatch = 0)
[1] 0 0 0 0 0 0 0 0 0 0

But as soon as I throw the complete columns, it gives me complete different values, which seem nonsense to me:

> match(df$CompleteName, df$CompleteName.1, nomatch = 0)
[1] 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101
[23] 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101
[45] 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101

Should I use sapply? I did not figured it out, I tried this with an error:

 sapply(df, function(x) match(x$CompleteName, x$CompleteName.1, nomatch = 0))

Please help!!!

Henrik
  • 65,555
  • 14
  • 143
  • 159
Saul Garcia
  • 890
  • 2
  • 9
  • 22
  • 1
    you probably don't want match - it gives the matching values form the second column, not whether they are equal. If you have strings, you could probable just use `as.numeric(df$CompleteName == df$CompleteName.1)` – jeremycg Apr 01 '16 at 00:14
  • also, use `stringsAsFactors = FALSE` in constructing your data.frame – jaimedash Apr 01 '16 at 00:17
  • @thelatemail as others have pointed out, `match` won't work here. my comment was meant to add to @jeremycg 's – jaimedash Apr 01 '16 at 00:25
  • 1
    There is also no evidence of these being factor columns anyway, is there? – Rich Scriven Apr 01 '16 at 00:27
  • @thelatemail: No, it doesn't work if there are other levels. See my answer and try to run it without stringsAsFactors = FALSE – andrechalom Apr 01 '16 at 00:27
  • @HaddE.Nuff `> default.stringsAsFactors() [1] TRUE` – jaimedash Apr 01 '16 at 00:30
  • 2
    @jaimedash - That's not what I meant. I mean that we have no idea whether this OP has factor columns or not. There is no evidence in the question that tells us whether they are factor or character. It's not a big deal though. This is one reason why `dput()` is preferred when posting data in a question. – Rich Scriven Apr 01 '16 at 00:33
  • Yes! Saul for future questions, try to follow these guidelines, which make answering easier http://stackoverflow.com/a/5963610/4598520 You could also edit your question to be reproducible – jaimedash Apr 01 '16 at 01:00

3 Answers3

8

From the man page of match,

‘match’ returns a vector of the positions of (first) matches of its first argument in its second.

So your data seem to indicate that the first match of "Lefebvre Arnaud" (the first position in the first argument) is in the row 101. I believe what you intended to do is a simple comparison, so that's just the equality operator ==.

Some sample data:

> a <- rep ("Lefebvre Arnaud", 6)
> b <- c("Abe Lyu", "Abe Lyu", "Lefebvre Arnaud", rep("De Dinechin Florent", 3))
> x <- data.frame(a,b, stringsAsFactors=F)
> x
            a                   b
1 Lefebvre Arnaud             Abe Lyu
2 Lefebvre Arnaud             Abe Lyu
3 Lefebvre Arnaud     Lefebvre Arnaud
4 Lefebvre Arnaud De Dinechin Florent
5 Lefebvre Arnaud De Dinechin Florent
6 Lefebvre Arnaud De Dinechin Florent
> x$a == x$b
[1] FALSE FALSE  TRUE FALSE FALSE FALSE

EDIT: Also, you need to make sure that you are comparing apples to apples, so double check the data type of your columns. Use str(df) to see whether the columns are strings or factors. You can either construct the matrix with "stringsAsFactors = FALSE", or convert from factor to character. There are several ways to do that, check here: Convert data.frame columns from factors to characters

Community
  • 1
  • 1
andrechalom
  • 737
  • 3
  • 13
4

As others have pointed out, match isn't right here. What you want is equality, which you can get by testing with ==, which gives you TRUE/FALSE. Then using as.numeric will give you desired 1/0 or using which will give you the indices.

But you may still have an issue with factors!

 # making up some similar data( adapted from earlier answer)
 a <- rep ("Lefebvre Arnaud", 6)
 b <- c("Abe Lyu", "Abe Lyu", "Lefebvre Arnaud", rep("De Dinechin Florent", 3))
 df <- data.frame(CompleteName = a, CompleteName.1 = b)
 which(df$CompleteName == df$CompleteName1)
 #integer(0)
 #Warning message:
 #In is.na(e2) : is.na() applied to non-(list or vector) of type 'NULL'

 str(df)
 # 'data.frame':    6 obs. of  2 variables:
 # $ CompleteName  : Factor w/ 1 level "Lefebvre Arnaud": 1 1 1 1 1 1
 # $ CompleteName.1: Factor w/ 3 levels "Abe Lyu","De Dinechin Florent",..: 1 1 3 2 2 2

stringsAsFactors

Above, the data.frame wasn't constructed with stringsAsFactors=FALSE and caused an error. Unfortunately, out of the box R will coerce strings to factors on loading a csv or creating a data.frame. This can be fixed when creating the data.frame by explicitly specifying stringsAsFactors=FALSE

df <- data.frame(CompleteName = a, CompleteName.1 = b, stringsAsFactors = FALSE)
df[which(df$CompleteName == df$CompleteName.1), ]
##     CompleteName CompleteName.1
## 3 Lefebvre Arnaud Lefebvre Arnaud

To avoid the issue in the future, run options(stringsAsFactors = FALSE) at the beginning of your R session (or put it at the top of your .R script). More discussion here:

Community
  • 1
  • 1
jaimedash
  • 2,683
  • 17
  • 30
  • 2
    BEWARE that using non-standard "options" on your code may make it harder for you to write collaborative code! Whenever you send your script to other people, the different options may cause the script to break down. – andrechalom Apr 01 '16 at 00:37
  • +1 when collaborating on code it's better to put it at the beginning of a script file than in the .Rprofile for this reason – jaimedash Apr 01 '16 at 00:39
  • @HaddE.Nuff good point. I just swapped the df to the one the other answer – jaimedash Apr 01 '16 at 00:57
3

Here's a solution using a data.table with performance comparison to the data.frame solution based on an identical number of records as in your case.

col1 = sample(x = letters, size = 354903, replace = TRUE)
col2 = sample(x = letters, size = 354903, replace = TRUE)

library(data.table)
dt = data.table(col1 = col1, col2 = col2)
df = data.frame(col1 = col1, col2 = col2)

# comparing the 2 columns
system.time(dt$col1==dt$col2)
system.time(df$col1==df$col2)

# storing the comparison in the table/frame itself
system.time(dt[, col3:= (col1==col2)])
system.time({df$col3 = (df$col1 == df$col2)})

The data.table approach offers a significant speedup on my machine: from 0.020s to 0.008s.

Try it for yourself and see. I know this is not really significant with such a small number of rows but multiply that 1000 and you'll see a major difference!

Matt Weller
  • 2,684
  • 2
  • 21
  • 30
  • 3
    That's an interesting take on the problem, but it uses far more complex code than necessary. This is not a question about performance, and premature optimization is the root of all evil. – andrechalom Apr 01 '16 at 00:33
  • 2
    I don't believe it's far more complex than necessary, typing `data.table` instead of `data.frame` is actually less typing and takes advantage of a considerably more efficient storage mechanism. I'm glad I got into the practice of using this package at an early juncture, it's saved me a considerable amount of time and I'd encourage new R users to do likewise so that when the really need it they have the tools in place. – Matt Weller Apr 01 '16 at 00:41
  • 1
    The problem I see in your answer is having to learn the `data.table` syntax. `col3:= ` is absolutely meaningless for someone who doesn't understand `data.table` syntax. It's a very elegant line, and far more efficient than some "pure R" code, but using it or not is a choice nonetheless. – andrechalom Apr 01 '16 at 00:48
  • 1
    I appreciate this answer, I will look deep into this `data.table`method. It definitely will help me if I get to reduce the operation time, normally I was dealing with 2 million rows dataset, and this was a combination of a subset of 800 records. Thank you for the insight!!! – Saul Garcia Apr 01 '16 at 08:51