1

I have looked at other questions that seem to be having similar problems with random rows being replaced by all NA values, but have not found a solution because the other users had NAs already existing in their data frames (like Subsetting R data frame results in mysterious NA rows).

I used na.omit to remove any possible NAs first, but at the last step NA rows are still being produced.

I am subsetting data using three columns - an ID column which is a mix of numbers and letters (11xx1234), a binary categorical column (0 or 1), and a value column that is distance in meters. Any ID that appears more than once is assigned a 1 in the binary category column. I am trying to pull out data that is a 1 in the binary category column, but don't want the ID represented more than once so that the distance associated with each ID is not counted more than once and skews any statistical test.

So something like:

x<-data.frame(ObjectID = c("11AD1234", "11AD1234", "11AB123", "11BA34", "11DA354", "11DA354"), 
component = c(1,1,0,0,1,1), 
distance = c(2,2,5,8,4,4))

Which gives:

  ObjectID component distance
1 11AD1234         1        2
2 11AD1234         1        2
3  11AB123         0        5
4   11BA34         0        8
5  11DA354         1        4
6  11DA354         1        4

Here is the code I am trying to use which works great until the distance column is added.

x[unique(x[x$component==1,]$ObjectID),]$distance
[1] 2 8

The correct answer should be 2 and 4, so what am I doing that is messing this up? And also somehow replacing a bunch of rows with NAs, (though it isn't represented in this example)? The real data is confidential and cannot be shared, sorry!

Anne-Marie
  • 13
  • 2
  • x[unique(x[x$component==1,]$ObjectID),]$distance This code gives me NAs... not what you're getting. – Pss Dec 13 '20 at 21:53

2 Answers2

3

We can specify the row with the logical expression, subset the columns with the column names as strings, get the unique and extract the distance

unique(x[x$component ==1, c("ObjectID", "distance")])$distance
#[1] 2 4

If the intention is only to get the 'distance' based on the 'unique' values of 'ObjectID', we can use duplicated

with(subset(x, component == 1, select = c(ObjectID, distance)), 
        distance[!duplicated(ObjectID)])
#[1] 2 4

Or more compactly, join two conditions with &

subset(x, !duplicated(ObjectID) & component == 1)$distance
#[1] 2 4

The issue in OP's code is using the unique value of 'ObjectID' as row index, which fails as the index can be either logical or numeric index

unique(x[x$component==1,]$ObjectID)
#[1] "11AD1234" "11DA354" 

If we have to convert this to logical, we can use %in%

akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thank you for the quick response! Both code options seem to work. I am still a bit confused as to why 'ObjectID' is not able to be used as the index? – Anne-Marie Dec 13 '20 at 22:11
  • @Anne-Marie If the data.frame have row.names as the ObjectID, it can be used. But, here, it is not the case and data.frame cannot have duplicate row.names as well. The row indexing is either numeric or logical – akrun Dec 13 '20 at 22:12
2

Its hard to comment on how you can go about removing the NAs from your data. A couple of choices that you can avail.

From your data, consider this

x<-data.frame(ObjectID = c("11AD1234", "11AD1234", "11AB123", "11BA34", "11DA354", "11DA354","11DA354", "11DA354"), 
              component = c(1,1,0,0,1,1, NA, NA), 
              distance = c(2,2,5,8,NA,NA,4,4))

x <- x[!is.na(x)]

Another option is to use libraries to do that for you.

library(tidyverse)
x %>% na.omit() 

Pss
  • 553
  • 4
  • 12