I have a large data for which I'm attempting to remove repeated row entries based on several columns. The column headings and sample entries are
count freq, cdr3nt, cdr3aa, v, d, j, VEnd, DStart, DEnd, JStart
5036 0.0599 TGCAGTGCTAGAG CSARDPDR TRBV20-1 TRBD1 TRBJ1-5 15 17 43 21
There are several thousand rows, and for two rows to match all the values except for "count" and "freq" must be the same. I want to remove the repeated entries, but before that, I need to change the "count" value of the one repeated row with the sum of the individual repeated row "count" to reflect the true abundance. Then, I need to recalculate the frequency of the new "count" based on the sum of all the counts of the entire table.
For some reason, the script is not changing anything, and I know for a fact that the table has repeated entries.
Here's my script.
library(dplyr)
# Input sample replicate table.
dta <- read.table("/data/Sample/ci1371.txt", header=TRUE, sep="\t")
# combine rows with identical data. Recalculation of frequency values.
dta %>% mutate(total = sum(count)) %>%
group_by(cdr3nt, cdr3aa, v, d, j, VEnd, DStart, DEnd, JStart) %>%
summarize(count_new = sum(count), freq = count_new/mean(total))
dta_clean <- dta
Any help is greatly appreciated. Here's a screenshot of how the datatable looks like.