0

So I have a dataframe that has I imported from a csv. I already subsetted it by one of the columns. I found some duplicate entries in the table and tried to eliminate them and reassigning them to the dataframe using df[-c(##,##),] (much like the answer to this question. Here's what my code looks like.

Where I import the file and subset it:

df1 <- read.csv("file.csv",
                header = T, 
                sep = ",",
                stringsAsFactors = FALSE,
                fileEncoding = "UTF-8-BOM")

> df1
     semester course.name lect.num professor.gender
1 2015 Spring   Bio1A        101                 M
2 2014 Spring   Bio1B        103                 M
3 2015 Spring   Bio1A        102                 F
4 2015 Spring   Bio1A        102                 M
5 2014 Spring   Bio1B        101                 M
6 2014 Spring   Bio1B        102                 F 
7 2014 Spring   Bio1A        101                 F

df2 <- df1[ df1$course.name == "Bio1A", ]

> df2[duplicated(df2[, c(1,3)], incomparables = FALSE) | duplicated(df2[, 
c(1,3)], incomparables = FALSE, fromLast = T) ,c(1:4)]
     semester course.name lect.num professor.gender
3 2015 Spring   Bio1A        102                 F
4 2015 Spring   Bio1A        102                 M

> dim(df2)
[1] 3 4

So now I want to remove the row with the index of 1, so I do the following

df2 <- df2[ -c(3), ]

This theoretically should work, but instead when check duplicates it's still there despite the changing the dimensions.

> df2[duplicated(df2[, c(1,3)], incomparables = FALSE) | duplicated(df2[, 
c(1,3)], incomparables = FALSE, fromLast = T) ,c(1:4)]
     semester course.name lect.num professor.gender
3 2015 Spring   Bio1A        102                 F
4 2015 Spring   Bio1A        102                 M

> dim(df2)
[1] 2 4

I can't view the dataframe by calling it cuz it's actually over 500 rows, but when I view the data.frame using View(df2), I can still see the rows I thought I elimanted in the dataframe as well. Does anyone have an explanation for what could be happening? Could there be a Bug in Rstudio? Am I doing something wrong? Any advice would be appreciated!

It's also worth mentioning that

df2 <- df2[ -c(3), , drop = F]

doesn't affect the outcomes above.


EDIT: So I didn't realize that R doesn't update the indices of the rows when it subsets. So what I thought was row 3, was actually row 2 when it subsetted it. Thanks for the help!

  • I can't reproduce your problem as i get `df2[duplicated(df2[, c(1,3)], incomparables = FALSE) | duplicated(df2[, c(1,3)], incomparables = FALSE, fromLast = T) ,c(1:4)]# [1] semester course.name lect.num professor.gender <0 rows> (or 0-length row.names)` in the last run – akrun Mar 04 '18 at 06:51
  • 1
    Note that `=` isn't `==` so `df2 <- df1[ course.name = "Bio1A", ]` doesn't work as you intend. In fact, it shouldn't work at all, which makes me suspect that the code that you show isn't your actual code. Please use your actual code, and make it a [mcve]. – John Coleman Mar 04 '18 at 12:11
  • I actually didn't copy and paste the code and was trying to rewrite it (in efforts to see if I would catch the mistakes while rewriting it). I missed what I actually did when subsetting df1; I actually did df2 <- df1[ df1$course.name = "Bio1A", ]. But I managed to figure out the issue: I forgot that when you subset, the row names are the original indices in the original dataframe. My above example wouldn't have been much help so I updated it to create an example which would reflect the problem and solution –  Mar 04 '18 at 17:43

0 Answers0