-1

I want to search list of words in multiple text files and compute number of files that contain word.

my code take alot of time reach to hours.

        uniqword = File.ReadAllLines(@"H:\\backstage\my work\uniqword.txt").ToList();
        string[] allfile = Directory.GetFiles(@"H:\\backstage\my work\categories file text\categories", "*.txt");
        var no_doc_word = new Dictionary<string, int>();

            foreach (string ff1 in allfile)// read one file in files until finish
            {
            List<string> allLinesText = File.ReadAllLines(ff1).ToList();

            foreach (string word in uniqword)
               {
                if (allLinesText.Contains(word))
                    if (no_doc_word.ContainsKey(word))
                        no_doc_word[word]++;
                    else
                        no_doc_word.Add(word, 1);

            }
        }
Programmer
  • 39
  • 5
  • 1
    There is no question here. Why not use ReadAllText and loop once as opposed to line by line? Currently you would not match differently cased words and dog would match doggedly. – Alex K. Jan 26 '17 at 16:19
  • 2
    How many files? How big are the files? – Matt Spinks Jan 26 '17 at 16:20
  • 1
    What do you want us to do with this? We're not just going to re-write your code for you. What's your question? – rory.ap Jan 26 '17 at 16:20
  • 54,000 text files... I splited it to words – Programmer Jan 26 '17 at 16:21
  • You could rewrite this so you only read the text file until the point at which all words in uniqword were found - might reduce the work load a bit. Use a StreamReader rather than ReadAllLines – Tim Rutter Jan 26 '17 at 16:23
  • my code is very slow... I want idea or code optimal for porpuse – Programmer Jan 26 '17 at 16:23
  • 1
    How anyone can suggest "optimal for the purpose" if you did not say what purpose of the code is? Normally you'd just use any existing indexing engine and query... I.e. if you are using Windows - http://stackoverflow.com/questions/34338465/how-to-use-windows-search-service-in-c-sharp – Alexei Levenkov Jan 26 '17 at 16:28
  • my work is extract unique words from 54,000 text files then compute number of documents that contain word (unique words) – Programmer Jan 26 '17 at 16:31

2 Answers2

1

You can just check for the word and count them as you read the file :

async Task Contains(string file)
{
    using ( StreamReader reader = new StreamReader(File.OpenRead(file))
    {
        string line = string.Empty;
        while( (line = reader.ReadLine()) != null)
        {
            string[] words = line.Split(new char[] { ' ', ',', '.' });
            foreach(string word in uniqword)
            {
                int howMany = words.Count(w => w.Equals(word);
                if (no_doc_word.ContainsKey(word))
                    no_doc_word[word] += howMany;
                else
                    no_doc_word.Add(word, howMany);
            }
        }
    }
}

And since this is async you can even call this as many times you want :

public void Check()
{
    string[] files = new string[] { @"C:\file1.txt", @"C:\file2.txt" };
    List<Task> tasks = new List<Task>();
    foreach(string file in files)
        tasks.Add(Contains(file));

    Task.WaitAll(tasks.ToArray());
}

EDIT:

Benefits from using this method is that all of the files ( or almost all ) are processed in the same time.

mrogal.ski
  • 5,828
  • 1
  • 21
  • 30
0

Depending on what the actual bottleneck is, something as simple as this can do it (explanations in comments):

    var words = new HashSet<string>(File.ReadAllLines(@"H:\\backstage\my work\uniqword.txt"));          // list of words
    var filewords = Directory.GetFiles(@"H:\\backstage\my work\categories file text\categories", "*.txt")    
        .Select(f => File.ReadAllText(f))                                                               // read all text in each file
        .SelectMany(s => words.Intersect(Regex.Split(s, @"\W|_")))                                      // intersect the set of words in each file with the master list of words, then flatten the list
        .GroupBy(s => s).ToDictionary(w => w.Key, w => w.Count());                                      // build a dictionary of each word and how many files it's used in
Blindy
  • 65,249
  • 10
  • 91
  • 131
  • this code dont read first line(read from رسل to end file and skip catId=) my file is: _catId=1 _رسل _جامعة . . – Programmer Jan 26 '17 at 18:51
  • 1
    First line of what? – Blindy Jan 26 '17 at 18:55
  • my files are arabic language except first line is number of calss, your code does not read as explained in my previous comment. – Programmer Jan 26 '17 at 19:12
  • I want to compute tf-idf, when I apply idf(catId=..) not found in filewords – Programmer Jan 26 '17 at 19:29
  • 1
    I'm sorry but you're not making much sense. If your initial question doesn't properly describe your issue, try making a new question with more information (and lots of examples, because again you're not making much sense). – Blindy Jan 27 '17 at 15:39