Removing duplicates only for certain rows

Question

I have a dataframe that looks like this:

*VarName1* - *VarValue1*
*VarName2* - *VarValue2*
*Etc.*

In practice it looks somethings like this:

nmlVar     - noFloat

Date-Batch - 2011020147
Weight     - 10
Length     - 5 
Height     - 8
Date-Batch - 2011020148
Weight     - 10.3
Length     - 6 
Height     - 8
Date-Batch - 2011020147
Weight     - 10
Length     - 5 
Height     - 8

I am preparing to organise the data in such a way that I can use it for analysis. I already found out how to transpose the rows into columns in this post: Transposing rows into columns, then split them

I used this code to transpose:

library(dplyr)
library(tidyr)
DFP %>% 
  mutate(sample = cumsum(nmlVar == 'Batch')) %>% 
  spread(nmlVar, noFloat)

I want to do the same, but then use the "Date-Batch" variable as key variable in the function above. This is needed because this is the key used in another dataframe and I want to merge those.

The problem is that this Date-Batch variable not always has unique values (check the first and third occurence). I am trying to find a function that deletes every second occurence of the same Date-Batch value.

I tried to describe it in 'programming words':

FOR Date-Batch IN nmlVar IF duplicate DELETE second occurence

I don't know if this is the best way to do this, or perhaps you can set me up in another way.

second batch of duplicate date should be deleted whatever its content ? — moodymudskipper, Oct 23 '17 at 07:18
@Moody_Mudskipper Yes, I have enough rows of data to ignore a few duplicates for the analysis — Kevin, Oct 23 '17 at 07:20

score 0 · Answer 1 · answered Oct 23 '17 at 07:22

0

Depending of what you call a duplicate here:

library(dplyr)
library(tidyr)
DFP %>% 
  mutate(sample = cumsum(nmlVar == 'Date-Batch')) %>% 
  spread(nmlVar, noFloat) %>%
  select(-sample) %>%
  filter(!duplicated(.))

DFP %>% 
  mutate(sample = cumsum(nmlVar == 'Date-Batch')) %>% 
  spread(nmlVar, noFloat) %>%
  select(-sample) %>%
  filter(!duplicated(`Date-Batch`))

output for both in this case:

#   Date-Batch Height Length Weight
# 1 2011020147      8      5   10.0
# 2 2011020148      8      6   10.3

data

DFP <- read.table(text="nmlVar      noFloat
Date-Batch  2011020147
Weight      10
Length      5 
Height      8
Date-Batch  2011020148
Weight      10.3
Length      6 
Height      8
Date-Batch  2011020147
Weight      10
Length      5 
Height      8",header=T)

answered Oct 23 '17 at 07:22

moodymudskipper

46,417
11
121
167

Thanks for your effort. It still reproduces the same error as before: "Duplicate identifier for rows x,y,z" Can it be that is has to do with the fact that it first tries to transpose the rows and then looks for the duplicate rows instead of vice versa? – Kevin Oct 23 '17 at 07:34
I didn't have an error, but you had a typo in your question, with `Date-Batch` spelt as `Batch` in your code. would it be the issue ? – moodymudskipper Oct 23 '17 at 07:53
If not please give us real reproducible data using `dput` so we're sure we're working on the same thing. – moodymudskipper Oct 23 '17 at 07:54

Removing duplicates only for certain rows

1 Answers1