Remove duplicates in R without converting to numeric

Question

I have 2 variables in a data frame with 300 observations.

$ imagelike: int 3 27 4 5370 ...
$ user: Factor w/ 24915 levels "\"0.1gr\"","\"008bla\"", ..

I then tried to remove the duplicates, such as "- " appears 2 times:

testclean <- data1[!duplicated(data1), ]

This gives me the warning message:

In Ops.factor(left): "-"not meaningful for factors

I have then converted it to a maxtrix:

data2 <- data.matrix(data1)
testclean2 <- data2[!duplicated(data2), ]

This does the trick - however - it converts the userNames to a numeric.

========================================================================= I am new but I have tried looking at previous posts on this topic (including the one below) but it did not work out:

Convert data.frame columns from factors to characters

This looks like a problem that may be better solved when reading in the data. Are you able to some of the raw data? — user20650, Sep 30 '16 at 15:50
sorry, i missed a word in my comment above ;). Should of read *Are you able to **share** some of the raw data?* (say the first ten rows / five columns ). Also, can you show how you read in the data. cheers — user20650, Sep 30 '16 at 16:01
I suggest you improve your question by reading about [how to ask questions](http://stackoverflow.com/help/mcve) and about [reproducible questions](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). You will get considerably more views (and possibly answers) if your question is structured in a way that facilitates us helping you. — r2evans, Sep 30 '16 at 16:06
*"I don't know the usual way of sharing data here"* ... **read the links**, they are provided for good reason. Please, it really helps. Providing links to data in the question should be avoided; providing links to data in comments is worse (easy to miss). Just give us a small and representative dataset within the question. (Sometimes it takes some effort to produce a small amount of data that triggers all of your problems; often in the course of doing this, you'll find something yourself.) — r2evans, Sep 30 '16 at 17:59
@r2evans thank you, I am learning now how to do this. I still need to learn lots of things, both with R and this forum. I will work on that :) — Henk101, Sep 30 '16 at 18:05

score 1 · Answer 1 · answered Sep 30 '16 at 17:24

Some sample data, from your image (please don't post images of data!):

data1 <- data.frame(imageLikeCount = c(3,27,4,4,16,103),
                    userName = c("\"testblabla\"", "test_00", "frenchfries", "frenchfries", "test.inc", "\"parmezan_pizza\""))
str(data1)
# 'data.frame': 6 obs. of  2 variables:
#  $ imageLikeCount: num  3 27 4 4 16 103
#  $ userName      : Factor w/ 5 levels "\"parmezan_pizza\"",..: 2 5 3 3 4 1

To fix the problem with factors as well as the embedded quotes:

data1$userName <- gsub('"', '', as.character(data1$userName))
str(data1)
# 'data.frame': 6 obs. of  2 variables:
#  $ imageLikeCount: num  3 27 4 4 16 103
#  $ userName      : chr  "testblabla" "test_00" "frenchfries" "frenchfries" ...

Like @DanielWinkler suggested, if you can change how the data is read-in or defined, you might choose to include stringsAsFactors = FALSE (this argument is accepted in many functions, including read.csv, read.table, and most data.frame functions including as.data.frame and rbind):

data1 <- data.frame(imageLikeCount = c(3,27,4,4,16,103),
                    userName = c("\"testblabla\"", "test_00", "frenchfries", "frenchfries", "test.inc", "\"parmezan_pizza\""),
                    stringsAsFactors = FALSE)
str(data1)
# 'data.frame': 6 obs. of  2 variables:
#  $ imageLikeCount: num  3 27 4 4 16 103
#  $ userName      : chr  "\"testblabla\"" "test_00" "frenchfries" "frenchfries" ...

(Note that this still has embedded quotes, so you'll still need something like data1$userName <- gsub('"', '', data1$userName).)

Now, we have data that looks like this:

data1
#   imageLikeCount       userName
# 1              3     testblabla
# 2             27        test_00
# 3              4    frenchfries
# 4              4    frenchfries
# 5             16       test.inc
# 6            103 parmezan_pizza

and your need to remove duplicates works:

data1[! duplicated(data1), ]
#   imageLikeCount       userName
# 1              3     testblabla
# 2             27        test_00
# 3              4    frenchfries
# 5             16       test.inc
# 6            103 parmezan_pizza

If this satisfies your question, could you accept it (check mark to the left of the answer) and consider an up-vote? Stack Exchange etiquette directs closing a question with the best answer (can be changed in the future if needed), and if you find one or more answers particularly good, you can "up-vote" them. Both actions give gratitude, kudos, and measurable reputation points to the posters. — r2evans, Sep 30 '16 at 18:28

Daniel Winkler · Answer 2 · 2016-09-30T18:13:48.193

-1

Try

data$userName <- as.character(data$userName)

And then data<-unique(data)

You could also pass the argument stringAsFactor = FALSE when reading the data. This is usually a good idea.

edited Sep 30 '16 at 18:13

answered Sep 30 '16 at 16:46

Daniel Winkler

487
3
11

From a glance at their screenshot,and `str`, its unlikely to be this simple. For example, `dat <- data.frame(x=1:3, y=c("word", "\"word\"", "and another")); as.character(dat$y); length(unique(as.character(dat$y)))` – user20650 Sep 30 '16 at 16:51
This could probably be addressed with gsub replacing ' '' ' with empty space. Something like `gsub('"', '', data$userName)` – Daniel Winkler Sep 30 '16 at 16:54
And that should probably have been your answer rather than what you posted. – IRTFM Sep 30 '16 at 17:12
I don't see how `data[unique(data),]` would work: `unique` will return a `data.frame`, not a vector of `integer` or `logical` need for row-indexing. Similar to the suggestion to the OP, can you please **provide sample data** to support your code? – r2evans Sep 30 '16 at 17:17

Remove duplicates in R without converting to numeric

2 Answers2