0

I'm relatively new to C# and I'm trying to get my head around a problem that I believe should be pretty simple in concept, but I just cant get it.

I am currently, trying to display a message to the console when the program is run from the command line with two arguments, if a sequence ID does not exist inside a text file full of sequence ID's and DNA sequences against a query text file full of Sequence ID's. For example args[0] is a text file that contains 41534 lines of sequences which means I cannot load the entire file into memory.:

NR_118889.1 Amycolatopsis azurea strain NRRL 11412 16S ribosomal RNA, partial sequence GGTCTNATACCGGATATAACAACTCATGGCATGGTTGGTAGTGGAAAGCTCCGGCGT

NR_118899.1 Actinomyces bovis strain DSM 43014 16S ribosomal RNA, partial sequence GGGTGAGTAACACGTGAGTAACCTGCCCCNNACTTCTGGATAACCGCTTGAAAGGGTNGCTAATACGGGATATTTTGGCCTGCT

NR_074334.1 Archaeoglobus fulgidus DSM 4304 16S ribosomal RNA, complete sequence >NR_118873.1 Archaeoglobus fulgidus DSM 4304 strain VC-16 16S ribosomal RNA, complete sequence >NR_119237.1 Archaeoglobus fulgidus DSM 4304 strain VC-16 16S ribosomal RNA, complete sequence
ATTCTGGTTGATCCTGCCAGAGGCCGCTGCTATCCGGCTGGGACTAAGCCATGCGAGTCAAGGGGCTT

args[1] is a query text file with some sequence ID's:

NR_118889.1

NR_999999.1

NR_118899.1

NR_888888.1

So when the program is run, all I want are the sequence ID's that were not found in args[0] from args[1] to be displayed.

NR_999999.1 could not be found

NR_888888.1 could not be found

I know this probably super simple, and I have spent far too long on trying to figure this out by myself to the point where I want to ask for help.

Thank you in advance for any assistance.

DeanOC
  • 7,142
  • 6
  • 42
  • 56
Mazurian
  • 1
  • 4
  • The task is called a ['diff'](https://en.wikipedia.org/wiki/Diff). In its full glory this is [highly complex](https://stackoverflow.com/questions/24887238/how-to-compare-two-rich-text-box-contents-and-highlight-the-characters-that-are/24970638?r=SearchResults&s=1|27.2706#24970638). – TaW Sep 28 '19 at 09:04
  • The issue is usually how to re-sync the comaprisons when some text is not different but missing in one set. The best course of action usually is to decide if you really need a full diff. In your case you may get away by creating two lists and comparing those, ie you would ignore the order of the names and treat each line as one entity.. List has many useful functions to do that.. [Example](https://stackoverflow.com/questions/12795882/quickest-way-to-compare-two-generic-lists-for-differences) – TaW Sep 28 '19 at 09:06
  • You want to read each file once and put lines into an array (string[]). Right now if the first file has 100 lines you are opening and reading the 2nd file 100 times. The solution is actually more complicated than your current code. The solution requires you to compare each row in order and when one file does not match the other file print the line. Then continue printing non matching lines until a match is found which is complicated. Easies method is to use a linq join. See https://code.msdn.microsoft.com/101-LINQ-Samples-3fb9811b – jdweng Sep 28 '19 at 09:11

3 Answers3

0

You can try this.

It loads each file content and compare with each other.

static void Main(string[] args)
{
  if ( args.Length != 2 )
  {
    Console.WriteLine("Usage: {exename}.exe [filename 1] [filename 2]");
    Console.ReadKey();
    return;
  }

  string filename1 = args[0];
  string filename2 = args[1];

  bool checkFiles = true;

  if ( !File.Exists(filename1) )
  {
    Console.WriteLine($"{filename1} not found.");
    checkFiles = false;
  }

  if ( !File.Exists(filename2) )
  {
    Console.WriteLine($"{filename2} not found.");
    checkFiles = false;
  }
  if ( !checkFiles )
  {
    Console.ReadKey();
    return;
  }

  var lines1 = System.IO.File.ReadAllLines(args[0]).Where(l => l != "");
  var lines2 = System.IO.File.ReadAllLines(args[1]).Where(l => l != "");

  foreach ( var line in lines2 )
    if ( !lines1.StartsWith(line) )
    {
      Console.WriteLine($"{line} could not be found");
      checkFiles = false;
    }

  if (checkFiles)
    Console.WriteLine("There is no difference.");

  Console.ReadKey();
}
  • Sorry, I updated it. I thought using the names would be similar to what Im trying to achieve but maybe not. – Mazurian Sep 28 '19 at 14:51
  • Answer updated: it only displays lines of the first file that does not contains ID's of the second file. Isn't that your question? It seems there is a problem with your sample. I don't understand the `>` and the line structure that seems to vary. Could you add an empty line between each line, please? –  Sep 28 '19 at 15:15
  • @ Olivier Rogier I want the program to search the entire file for the sequence ID's in the query text file, and return an error if a sequence id could not be found. The > specifies a new sequence ID, however a DNA sequence can have multiple ID's which is why the third one has three. I split them up to make it easier to read but in reality they are all bunched up. – Mazurian Sep 29 '19 at 00:50
0

This works, but it only processes the first line of the files...

using( System.IO.StreamReader sr1 = new System.IO.StreamReader(args[1]))
                {
                    using( System.IO.StreamReader sr2 = new System.IO.StreamReader(args[2]))
                    {
                        string line1,line2;

                while ((line1 = sr1.ReadLine()) != null) 
                {
                    while ((line2 = sr2.ReadLine()) != null)
                    {
                        if(line1.Contains(line2))
                        {
                            found = true;
                            WriteLine("{0} exists!",line2);
                        }



                        if(found == false)
                        {
                            WriteLine("{0} does not exist!",line2);
                        }
                    }
                }
                    }
                }
Mazurian
  • 1
  • 4
0
var saved_ids = new List<String>();
foreach (String args1line in File.ReadLines(args[1]))
                {

                    foreach (String args2line in File.ReadLines(args[2]))
                    {

                        if (args1line.Contains(args2line))
                        {
                            saved_ids.Add(args2line);


                        }



                    }

                }

                using (System.IO.StreamReader sr1 = new System.IO.StreamReader(args[1]))
                        {
                            using (System.IO.StreamReader sr2 = new System.IO.StreamReader(args[2]))
                            {


                                string line1, line2;



                                while ((line1 = sr1.ReadLine()) != null)
                                  {





                                    while ((line2 = sr2.ReadLine()) != null)
                                     {






                                        if (line1.Contains(line2))
                                        {

                                            saved_ids.Add(line2);
                                            break;


                                        }


                                        if (!line1.StartsWith(">"))
                                        {
                                            break; 
                                        }

                                        if (saved_ids.Contains(line1))
                                        {

                                            break;
                                        }

                                        if (saved_ids.Contains(line2))
                                        {
                                            break;
                                        }


                                        if (!line1.Contains(line2))
                                        {
                                            saved_ids.Add(line2);
                                            WriteLine("The sequence ID {0} does not exist", line2);



                                        }





                                    }








                                    if (line2 == null)
                                    {
                                        sr2.DiscardBufferedData();
                                        sr2.BaseStream.Seek(0, System.IO.SeekOrigin.Begin);
                                        continue;
                                    }
                                }
                            }
                        }
Mazurian
  • 1
  • 4