0

I'm new here and I'm analyzing certain data. Inspecting the data, I found some issues in the strings of a column. as you can see, there are some string with duplicate words. My idea is to remove only them. could you suggest me a way to do it? There are about 30.000 rows and only the ones with WT_d8_r2 report this error. Thank you

KO_d6_r1_AAACATGCACCTAATG-1               7
KO_d6_r1_AAACATGCAGGAATCG-1               8
KO_d6_r1_AAACATGCAGGATAAC-1              18
KO_d6_r1_AAACCAACAATATAGG-1              22
KO_d6_r1_AAACCGAAGCGAGTAA-1               8   
WT_d8_r2_WT_d8_r2_AGGCTAAAGTCAATCA-1     20
WT_d8_r2_WT_d8_r2_AGGGCTACAATGAATG-1      3
WT_d8_r2_WT_d8_r2_AGGGCTACACACTAAT-1      3
WT_d8_r2_WT_d8_r2_AGGGCTACAGCTTACA-1     18
WT_d8_r2_WT_d8_r2_AGGGCTACATAGCTGC-1      9
WT_d8_r2_WT_d8_r2_AGGGTTGCAAAGCTCC-1     19
WT_d8_r2_WT_d8_r2_AGGGTTGCAACCCTAA-1      4
WT_d8_r2_WT_d8_r2_AGGGTTGCAGCTCAAC-1      2

I'm expcting this:

KO_d6_r1_AAACATGCACCTAATG-1               7
KO_d6_r1_AAACATGCAGGAATCG-1               8
KO_d6_r1_AAACATGCAGGATAAC-1              18
KO_d6_r1_AAACCAACAATATAGG-1              22
KO_d6_r1_AAACCGAAGCGAGTAA-1               8   
WT_d8_r2_AGGCTAAAGTCAATCA-1              20
WT_d8_r2_AGGGCTACAATGAATG-1               3
WT_d8_r2_AGGGCTACACACTAAT-1               3
WT_d8_r2_AGGGCTACAGCTTACA-1              18
WT_d8_r2_AGGGCTACATAGCTGC-1               9
WT_d8_r2_AGGGTTGCAAAGCTCC-1              19
WT_d8_r2_AGGGTTGCAACCCTAA-1               4
WT_d8_r2_AGGGTTGCAGCTCAAC-1               2
yami
  • 3
  • 2
  • As far as I can see all the strings are unique here, there are no duplicated strings. Also, It would be easier to help you if you provide a [minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – Mohan Govindasamy Jul 14 '22 at 09:42

1 Answers1

0

with stringi::stri_split and duplicated:

data <- read.table(text='KO_d6_r1_AAACATGCACCTAATG-1               7
KO_d6_r1_AAACATGCAGGAATCG-1               8
KO_d6_r1_AAACATGCAGGATAAC-1              18
KO_d6_r1_AAACCAACAATATAGG-1              22
KO_d6_r1_AAACCGAAGCGAGTAA-1               8   
WT_d8_r2_WT_d8_r2_AGGCTAAAGTCAATCA-1     20
WT_d8_r2_WT_d8_r2_AGGGCTACAATGAATG-1      3
WT_d8_r2_WT_d8_r2_AGGGCTACACACTAAT-1      3
WT_d8_r2_WT_d8_r2_AGGGCTACAGCTTACA-1     18
WT_d8_r2_WT_d8_r2_AGGGCTACATAGCTGC-1      9
WT_d8_r2_WT_d8_r2_AGGGTTGCAAAGCTCC-1     19
WT_d8_r2_WT_d8_r2_AGGGTTGCAACCCTAA-1      4
WT_d8_r2_WT_d8_r2_AGGGTTGCAGCTCAAC-1      2')

data$V1 <- lapply(stringi::stri_split(str=data$V1,fixed = "_"),function(x) paste0(x[!duplicated(x)],collapse='_'))
data
#>                             V1 V2
#> 1  KO_d6_r1_AAACATGCACCTAATG-1  7
#> 2  KO_d6_r1_AAACATGCAGGAATCG-1  8
#> 3  KO_d6_r1_AAACATGCAGGATAAC-1 18
#> 4  KO_d6_r1_AAACCAACAATATAGG-1 22
#> 5  KO_d6_r1_AAACCGAAGCGAGTAA-1  8
#> 6  WT_d8_r2_AGGCTAAAGTCAATCA-1 20
#> 7  WT_d8_r2_AGGGCTACAATGAATG-1  3
#> 8  WT_d8_r2_AGGGCTACACACTAAT-1  3
#> 9  WT_d8_r2_AGGGCTACAGCTTACA-1 18
#> 10 WT_d8_r2_AGGGCTACATAGCTGC-1  9
#> 11 WT_d8_r2_AGGGTTGCAAAGCTCC-1 19
#> 12 WT_d8_r2_AGGGTTGCAACCCTAA-1  4
#> 13 WT_d8_r2_AGGGTTGCAGCTCAAC-1  2
Waldi
  • 39,242
  • 6
  • 30
  • 78