0

I have a phylip formatted text file of 300+ aligned COI sequences. I am trying to condense sequences into haplotypes for analysis using an R script written by a friend. The part I am having trouble with is where the program compares each sequence to the following sequence and determines if they differ by more than just N characters. It will run through the first few sequences before throwing the following error:

`Error in if (dif.nuc1[p] == "N" | dif.nuc2[p] == "N") { : missing value where TRUE/FALSE needed`

There are no gaps in the alignment so there shouldn't be any need to manage N/A data.

Any idea what the issue is and/or how to fix it? Alternatively, any recommendations for programs to consolidate haplotypes would also be appreciated.

Thank you in advance.

Below is the console output:

>       if (sum(dif.nuc1=='N')==0 & sum(dif.nuc2=='N')==0){       
+       } else if (length(dif.nuc1)!=0){
+         counter<- 0
+         for (p in 1:length(dif.nuc1)){       
+           cat('p is', p, '\n')
+           if (dif.nuc1[p]== 'N'| dif.nuc2[p]== 'N'){
+             counter<- (counter + 1)
+           }
+         }  
+         if (counter == length(dif.nuc1)){        
+           hap.equiv<- c(hap.equiv, paste('Hap_', m, ' == Hap_', n, '  ', sep=''))
+         }
+       }
Error in if (dif.nuc1[p] == "N" | dif.nuc2[p] == "N") { : 
missing value where TRUE/FALSE needed

I have tried including the modifying the code in the following ways to manage N/A data, but did not solve the issue.

if (dif.nuc1[p] == 'N' | dif.nuc2[p] == 'N' | is.na(dif.nuc1[p]) | is.na(dif.nuc2[p])) {
if (sum(dif.nuc1 %in% c('N', 'NA')) == 0 & sum(dif.nuc2 %in% c('N', 'NA')) == 0) {
          } else if (length(dif.nuc1)!=0){
            counter<- 0
            for (p in 1:length(dif.nuc1)){
              cat('p is', p, '\n')
              if (dif.nuc1[p]== 'N'| dif.nuc2[p]== 'N'){
                counter<- (counter + 1)

I have also double checked my data to ensure no ambiguity codes and no gaps that woudld cause NA data

KennyG
  • 9
  • 2
  • I can try to help, however, I would need it to include the relevant code and data. See: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Mark Jun 24 '23 at 06:31
  • @KennyG First of all, **what kind of _object_ is `dif.nuc*`?** Is it a single string (ie. a `character` vector of length `1`), and you want to iterate over its _characters_? Or is it a "column" of _n_ strings (ie. a `character` vector of length `n`), and you want to iterate over those _strings_? Also, please **note the difference between `|` and `||`**. While the `|` operates on logical _vectors_ in parallel, the `||` operates on logical _scalars_ (`TRUE`, `FALSE`, `NA`) and "short-circuits" when it encounters `TRUE`. – Greg Jun 24 '23 at 16:52
  • @Greg, dif.nuc* is a character vector that stores the differences between two haplotypes and I believe it is the length of the number of differing nucleotides that are subsequently stored in a downstream file. Thank you for pointing out the differences between the two characters, I will modify my code accordingly. – KennyG Jun 24 '23 at 21:50
  • @KennyG When you replace `|` with `||`, be sure to move your `is.na()` tests to the beginning of the condition; that way, you actually take advantage of the short-circuiting. But probably the best approach (for scalars like `dif.nuc*[p]`) is to simply use the condition `isTRUE(dif.nuc1[p] == "N") || isTRUE(dif.nuc2[p] == "N")`, where `isTRUE()` just handles `NA`s automatically. – Greg Jun 25 '23 at 19:39
  • @KennyG I should add: there is probably a _far_ more efficient way to do this, using [vectorized operations](https://bookdown.org/rdpeng/rprogdatascience/vectorized-operations.html). **Would you kindly post a "reproducible example" ("reprex") of your data, by pasting the output from `dput(head(dif.nuc1, n = 10))` and `dput(head(dif.nuc2, n = 10))`?** – Greg Jun 25 '23 at 19:42

0 Answers0