7

I am reading through documents, and splitting words to get each word in the dictionary, but how could I exclude some words (like "the/a/an").

This is my function:

private void Splitter(string[] file)
{
    try
    {
        tempDict = file
            .SelectMany(i => File.ReadAllLines(i)
            .SelectMany(line => line.Split(new[] { ' ', ',', '.', '?', '!', }, StringSplitOptions.RemoveEmptyEntries))
            .AsParallel()
            .Distinct())
            .GroupBy(word => word)
            .ToDictionary(g => g.Key, g => g.Count());
    }
    catch (Exception ex)
    {
        Ex(ex);
    }
}

Also, in this scenario, where is the right place to add .ToLower() call to make all the words from file in lowercase? I was thinking about something like this before the (temp = file..):

file.ToList().ConvertAll(d => d.ToLower());
Paul Roub
  • 36,322
  • 27
  • 84
  • 93
  • Please don't code exceptions like `catch (Exception ex)`. It's a bad practice. You should only catch truly exceptional exceptions - otherwise you should code in a way to prevent errors before they occur. See http://ericlippert.com/2008/09/10/vexing-exceptions/ – Enigmativity May 15 '15 at 07:25
  • Irrelevant to the question, but you can use `Regex.Split` with parameter `"\b"` to get the individual words. It is more reliable than writing all those characters by hand. – Yusuf Tarık Günaydın May 15 '15 at 07:27
  • @Enigmativity Without this catch I was getting exception when I closed openiledialog without chosing a file to split. – Ken'ichi Matsuyama May 15 '15 at 07:28
  • 2
    @Ken'ichiMatsuyama Then your error is to call this function if a file was not selected. The correct way is to see if a file was selected, not to go forward and later catch exceptions – Sami Kuhmonen May 15 '15 at 07:35
  • 1
    @Enigmativity This is completely off topic, but a useful talk. I personally feel that you should capture all exceptions. It should go without saying that you should always at least do something with it though. The way that I have always saw it is that you want your user to have a pleasant experience and if for whatever reason there is an error that will cause a failure to the application then you don't want the user to see something like nullObjectException at line 203... blah blah... They want to see an error which tells them how they can report the bug to you. – Keithin8a May 15 '15 at 07:43
  • 3
    one important thing to note with my comment is that you should catch all exceptions but you should think about particular types of exceptions which you may want to catch and handle differently. – Keithin8a May 15 '15 at 07:44
  • 2
    @Keithin8a Exactly. There is a difference between catching exceptions (usually higher up) to have a trace and show the user a nice error message and laziness, like checking if the user actually selected a file or canceled the operation. We wouldn't just call methods on an object without checking if it's null and just catch an exception either. And yes, this is going very OT :/ – Sami Kuhmonen May 15 '15 at 07:46
  • 2
    @SamiKuhmonen Yeah I read your comment again and realised I replied to the wrong person. Off topic conversations are sometimes useful in comments. We are all here to learn are we not? But yeah they shouldn't go too overboard. – Keithin8a May 15 '15 at 07:48
  • By your words I am now trying to find an error first, so +1 for everyone I guess? – Ken'ichi Matsuyama May 15 '15 at 07:55
  • 1
    @Keithin8a - You should limit catching exceptions throughout the inner workings of your code **unless you can handle them in a meaningful way to the user**. I agree that you want your user to have a pleasant experience, but I prefer to see an exception rather than have the code **swallow an exception and appear not to work**. I would much rather have a single high-level exception handler that says to the user that "something went wrong and would you like to email the error to the application support team" - then something can be done to fix it and the user would appreciate that. – Enigmativity May 15 '15 at 08:06

2 Answers2

4

Do you want to filter out stop words?

 HashSet<String> StopWords = new HashSet<String> { 
   "a", "an", "the" 
 }; 

 ...

 tempDict = file
   .SelectMany(i => File.ReadAllLines(i)
   .SelectMany(line => line.Split(new[] { ' ', ',', '.', '?', '!', }, StringSplitOptions.RemoveEmptyEntries))
   .AsParallel()
   .Select(word => word.ToLower()) // <- To Lower case 
   .Where(word => !StopWords.Contains(word)) // <- No stop words
   .Distinct()
   .GroupBy(word => word)
   .ToDictionary(g => g.Key, g => g.Count());

However, this code is a partial solution: proper names like Berlin will be converted into lower case: berlin as well as acronyms: KISS (Keep It Simple, Stupid) will become just a kiss and some numbers will be incorrect.

Dmitry Bychenko
  • 180,369
  • 20
  • 160
  • 215
  • Sorry, I am not native speaker, just checked what stop words are, and yeah I would like to filter them. I like this idea, hashset should also be performance efficient. – Ken'ichi Matsuyama May 15 '15 at 07:32
1

I would do this:

var ignore = new [] { "the", "a", "an" };
tempDict = file
    .SelectMany(i =>
        File
            .ReadAllLines(i)
            .SelectMany(line =>
                line
                    .ToLowerInvariant()
                    .Split(
                        new[] { ' ', ',', '.', '?', '!', },
                        StringSplitOptions.RemoveEmptyEntries))
                    .AsParallel()
                    .Distinct())
    .Where(x => !ignore.Contains(x))
    .GroupBy(word => word)
    .ToDictionary(g => g.Key, g => g.Count());

You could change ignore to a HashSet<string> if performance becomes an issue, but it would be unlikely since you are using file IO.

Enigmativity
  • 113,464
  • 11
  • 89
  • 172