1

i have the task to implement simple semantic analysis of the text(800MB txt file).For small files everything went quickly. I read this file line by line and those working. The file reading takes 9s. But once you start with the analysis and adding words to the dictionary and storing their positions in the text processing takes too long.

Could you advise me better variation or what would be the better solution to the problem? When dealing with the issue of semantic analysis of the text and the procedure, I will advice for any advice. Thy.

public List<string> SplitWords(string s)
    {
        s = s.ToLower();
        arrayWords = Regex.Split(s, @"\W+");
        listWords = arrayWords.OfType<string>().ToList();

        for (int i = 0; i < listWords.Count; i++)
        {
            if (Array.BinarySearch(stopwords, listWords[i]) >= 0 || listWords[i].Length < 2)
            {
                listWords.RemoveAt(i);
                i--;
            }

        }
        return listWords;
    }

Code for separating words

 public void AddToDictonary(List<string> arrayWords)
        {
            for (int i = 0; i < arrayWords.Count; i++)
            {
                if (!dictonary.ContainsKey(arrayWords[i]))
                {
                    dictonary.Add(arrayWords[i], new List<int>() { i });
                }
                else
                {
                    dictonary[arrayWords[i]].Add(i);
                }
            }
        }

Code for add to the dictionary.

  • 2
    Instead of asking if the dictionary contains the word, you should use the `TryGetValue` method. See: http://stackoverflow.com/questions/9382681/what-is-more-efficient-dictionary-trygetvalue-or-containskeyitem – Oscar Mederos Feb 14 '13 at 00:48
  • 2
    I also suggest you using **dotTrace** or a similar tool. It will give you a performance report of your code, and you'll be able which part of your code is the slower one. – Oscar Mederos Feb 14 '13 at 00:50
  • I try TryGetValue.Thanks. The slowest code is FOR(splitwords function) where I comparing every word from my text file wtch 321 stopswords in array. I was thinking about using StringBuilder. What do you think? How do speed comparisons? – user2039847 Feb 14 '13 at 12:52
  • How many words are you getting in `listWords` when you do `arrayWords.OfType().ToList();`? – Oscar Mederos Feb 14 '13 at 18:17
  • low, averaging 10, I read a text line by line. Extremely slow is a lot of comparisons in the loop for in SplitWords function – user2039847 Feb 15 '13 at 01:57

1 Answers1

0

You can use the regular expression I posted here to tokenize your sentences

Community
  • 1
  • 1
Fran Casadome
  • 508
  • 4
  • 15