0

I have a large data for which I'm attempting to remove repeated row entries based on several columns. The column headings and sample entries are

 count  freq,   cdr3nt,       cdr3aa,    v,      d,    j,     VEnd, DStart, DEnd, JStart
 5036   0.0599  TGCAGTGCTAGAG CSARDPDR TRBV20-1 TRBD1 TRBJ1-5  15     17     43    21

There are several thousand rows, and for two rows to match all the values except for "count" and "freq" must be the same. I want to remove the repeated entries, but before that, I need to change the "count" value of the one repeated row with the sum of the individual repeated row "count" to reflect the true abundance. Then, I need to recalculate the frequency of the new "count" based on the sum of all the counts of the entire table.

For some reason, the script is not changing anything, and I know for a fact that the table has repeated entries.

Here's my script.

library(dplyr)

# Input sample replicate table.
  dta <- read.table("/data/Sample/ci1371.txt", header=TRUE, sep="\t")

# combine rows with identical data.  Recalculation of frequency values.
 dta %>% mutate(total = sum(count)) %>%
    group_by(cdr3nt, cdr3aa, v, d, j, VEnd, DStart, DEnd, JStart) %>%
    summarize(count_new = sum(count), freq = count_new/mean(total))

 dta_clean <- dta

Any help is greatly appreciated. Here's a screenshot of how the datatable looks like. screenshot

Lou_A
  • 249
  • 1
  • 11
  • 1
    Sample data would be useful, or would you prefer we generate something completely random (and perhaps not representative)? – r2evans Apr 28 '20 at 15:52
  • it is not helpful to include a read.table() of data when cannot access. Rather, provide some sample data, e.g. with dput(), that contains some duplicates so we can reproduce the problem – mnist Apr 28 '20 at 15:52
  • It's probably easier for me to attach a reduced copy of the dataset. Is there a way to do that on this board? – Lou_A Apr 28 '20 at 15:55
  • 1
    As @mnist suggested, the preferred method is an unambiguous format that we can use easily, which is `dput`. Along with that, though, is that it is best to provide a *representative sample* of the data so that we don't need to scroll through pages of data, and you get a method to work on your real (larger) data. (Or you can adapt my play data below to fit your dataset.) – r2evans Apr 28 '20 at 15:57
  • Um, **no**. Even if I could know for certain how you're reading that spreadsheet of data into R, *I am not going to transcribe from an image* (https://xkcd.com/2116/). See these links for other suggestions for providing representative sample data: https://stackoverflow.com/q/5963269, [mcve], and https://stackoverflow.com/tags/r/info. – r2evans Apr 28 '20 at 17:07

2 Answers2

1

Preliminary step: transform in data.table and store column names that are not count and freq

library(data.table)
setDT(df)
cols <- colnames(df)[3:ncol(df)]

(in your example, count and freq are in the first two positions)

To recompute count and freq:

df_agg <- df[, .(count = sum(count)), by = cols]
df_agg[, 'freq' := count/sum(count)]

If you want to keep unique values by all columns except count and freq

df_unique <- unique(df, by = cols)
linog
  • 5,786
  • 3
  • 14
  • 28
  • Linog, the product of df_agg seems to do the trick. However, the new count and freq columns are on the of the data table, what's the best way to attach it to the beginning of the table (left-side) instead? – Lou_A Apr 28 '20 at 17:33
  • You can reorder columns with `setcolorder`. In your case, `setcolorder(df_agg, c('freq','count')` might do the trick. Other columns are reported after those mentioned – linog Apr 28 '20 at 17:51
  • I'll give it a shot – Lou_A Apr 28 '20 at 17:52
0

Sample data, where grp1 and grp2 are intended to be all of your grouping variables.

set.seed(42)
dat <- data.frame(
  grp1 = sample(1:2, size=20, replace=TRUE),
  grp2 = sample(3:4, size=20, replace=TRUE),
  count = sample(100, size=20, replace=TRUE),
  freq = runif(20)
)
head(dat)
#   grp1 grp2 count      freq
# 1    2    4    38 0.6756073
# 2    2    3    44 0.9828172
# 3    1    4     4 0.7595443
# 4    2    4    98 0.5664884
# 5    2    3    44 0.8496897
# 6    2    4    96 0.1894739

Code:

library(dplyr)
dat %>%
  group_by(grp1, grp2) %>%
  summarize(count = sum(count)) %>%
  ungroup() %>%
  mutate(freq = count / sum(count))
# # A tibble: 4 x 4
#    grp1  grp2 count   freq
#   <int> <int> <int>  <dbl>
# 1     1     3    22 0.0206
# 2     1     4   208 0.195 
# 3     2     3   383 0.358 
# 4     2     4   456 0.427 
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 1
    Since I've been working with dplyr I'll try this code first – Lou_A Apr 28 '20 at 16:14
  • For some reason, the repeats are still showing up. The resulting table is identical to the initial table. It seems the code is not reading my table. – Lou_A Apr 28 '20 at 16:19
  • I find it very unlikely that using all 9 columns (of that data) is going to give you any multi-row groups. Only you can differentiate between what identifies a *group* and what identifies data within that group. For instance, I'd think that `cdr3nt` and `cdr3aa` are *not* group identifiers, and I'd also guess that your last four columns also. So perhaps just `group_by(v:j)`? – r2evans Apr 28 '20 at 17:10
  • I'll explore this. – Lou_A Apr 28 '20 at 17:34