1

I'm trying to implement a sliding window algorithm for matching words in a text file. I come from a procedural background and my first attempt to do this in a functional language like Erlang seems to require time O(n^2) (or even more). How would one do this in a functional language?

-module(test).
-export([readText/1,patternCount/2,main/0]).

readText(FileName) ->
    {ok,File} = file:read_file(FileName),
    unicode:characters_to_list(File).

patternCount(Text,Pattern) ->
    patternCount_(Text,Pattern,string:len(Pattern),0).

patternCount_(Text,Pattern,PatternLength,Count) ->
    case string:len(Text) < PatternLength of 
        true -> Count;
        false ->
            case string:equal(string:substr(Text,1,PatternLength),Pattern) of
                true ->
                    patternCount_(string:substr(Text,2),Pattern,PatternLength,Count+1);
                false ->
                    patternCount_(string:substr(Text,2),Pattern,PatternLength,Count)
            end
    end.

main() ->
    test:patternCount(test:readText("file.txt"),"hello").
Steve Vinoski
  • 19,847
  • 3
  • 31
  • 46
kaiseroskilo
  • 1,719
  • 4
  • 21
  • 29

1 Answers1

6

Your question is a bit too broad, since it asks about implementing this algorithm in functional languages but how best to do that is language-dependent. My answer therefore focuses on Erlang, given your example code.

First, note that there's no need to have separate patternCount and patternCount_ functions. Instead, you can just have multiple patternCount functions with different arities as well as multiple clauses of the same arity. First, let's rewrite your functions to take that into account, and also replace calls to string:len/1 with the length/1 built-in function:

patternCount(Text,Pattern) ->
    patternCount(Text,Pattern,length(Pattern),0).

patternCount(Text,Pattern,PatternLength,Count) ->
    case length(Text) < PatternLength of
        true -> Count;
        false ->
            case string:equal(string:substr(Text,1,PatternLength),Pattern) of
                true ->
                    patternCount(string:substr(Text,2),Pattern,PatternLength,Count+1);
                false ->
                    patternCount(string:substr(Text,2),Pattern,PatternLength,Count)
            end
    end.

Next, the multi-level indentation in the patternCount/4 function is a "code smell" indicating it can be done better. Let's split that function into multiple clauses:

patternCount(Text,Pattern,PatternLength,Count) when length(Text) < PatternLength ->
    Count;
patternCount(Text,Pattern,PatternLength,Count) ->
    case string:equal(string:substr(Text,1,PatternLength),Pattern) of
        true ->
            patternCount(string:substr(Text,2),Pattern,PatternLength,Count+1);
        false ->
            patternCount(string:substr(Text,2),Pattern,PatternLength,Count)
    end.

The first clause uses a guard to detect that no more matches are possible, while the second clause looks for matches. Now let's refactor the second clause to use Erlang's built-in matching. We want to advance through the input text one element at a time, just as the original code does, but we also want to detect matches as we do so. Let's perform the matches in our function head, like this:

patternCount(_Text,[]) -> 0;
patternCount(Text,Pattern) ->
    patternCount(Text,Pattern,Pattern,length(Pattern),0).

patternCount(Text,_Pattern,_Pattern,PatternLength,Count) when length(Text) < PatternLength ->
    Count;
patternCount(Text,[],Pattern,PatternLength,Count) ->
    patternCount(Text,Pattern,Pattern,PatternLength,Count+1);
patternCount([C|TextTail],[C|PatternTail],Pattern,PatternLength,Count) ->
    patternCount(TextTail,PatternTail,Pattern,PatternLength,Count);
patternCount([_|TextTail],_,Pattern,PatternLength,Count) ->
    patternCount(TextTail,Pattern,Pattern,PatternLength,Count).

First, note that we added a new argument to the bottom four clauses: we now pass Pattern as both the second and third arguments to allow us to use one of them for matching and one of them to maintain the original pattern, as explained more fully below. Note also that we added a new clause at the very top to check for an empty Pattern and just return 0 in that case.

Let's focus only on the bottom three patternCount/5 clauses. These clauses are tried in order at runtime, but let's look at the second of these three clauses first, then the third clause, then the first of the three:

  1. In the second of these three clauses, we write the first and second arguments in [Head|Tail] list notation, which means Head is the first element of the list and Tail is the rest of the list. We use the same variable for the head of both lists, which means that if the first elements of both lists are equal, we have a potential match in progress, so we then recursively call patternCount/5 passing the tails of the lists as the first two arguments. Passing the tails allows us to advance through both the input text and the pattern an element at a time, checking for matching elements.

  2. In the last clause, the heads of the first two arguments do not match; if they did, the runtime would execute the second clause, not this one. This means that our pattern match has failed, and so we no longer care about the first element of the first argument nor about the second argument, and we have to advance through the input text to look for a new match. Note that we write both the head of the input text and the second argument as the _ "don't care" variable, as they are no longer important to us. We recursively call patternCount/5, passing the tail of the input text as the first argument and the full Pattern as the second argument, allowing us to start looking for a new match.

  3. In the first of these three clauses, the second argument is the empty list, which means we've gotten here by successfully matching the full Pattern, element by element. So we recursively call patternCount/5 passing the full Pattern as the second argument to start looking for a new match, and we also increment the match count.

Try it! Here's the full revised module:

-module(test).
-export([read_text/1,pattern_count/2,main/0]).

read_text(FileName) ->
    {ok,File} = file:read_file(FileName),
    unicode:characters_to_list(File).

pattern_count(_Text,[]) -> 0;
pattern_count(Text,Pattern) ->
    pattern_count(Text,Pattern,Pattern,length(Pattern),0).

pattern_count(Text,_Pattern,_Pattern,PatternLength,Count)
  when length(Text) < PatternLength ->
    Count;
pattern_count(Text,[],Pattern,PatternLength,Count) ->
    pattern_count(Text,Pattern,Pattern,PatternLength,Count+1);
pattern_count([C|TextTail],[C|PatternTail],Pattern,PatternLength,Count) ->
    pattern_count(TextTail,PatternTail,Pattern,PatternLength,Count);
pattern_count([_|TextTail],_,Pattern,PatternLength,Count) ->
    pattern_count(TextTail,Pattern,Pattern,PatternLength,Count).

main() ->
    pattern_count(read_text("file.txt"),"hello").

A few final recommendations:

  • Searching through text element by element is slower than necessary. You should have a look at the Boyer-Moore algorithm and other related algorithms to see ways of advancing through text in larger chunks. For example, Boyer-Moore attempts to match at the end of the pattern first, since if that's not a match, it can advance through the text by as much as the full length of the pattern.

  • You might want to also looking into using Erlang binaries rather than lists, as they are more compact memory-wise and they allow for matching more than just their first elements. For example, if Text is the input text as a binary and Pattern is the pattern as a binary, and assuming the size of Text is equal to or greater than the size of Pattern, this code attempts to match the whole pattern:

    case Text of
        <<Pattern:PatternLength/binary, TextTail/binary>> = Text ->
            patternCount(TextTail,Pattern,PatternLength,Count+1);
        <<_/binary,TextTail/binary>> ->
            patternCount(TextTail,Pattern,PatLen,Count)
    end.
    

    Note that this code snippet reverts to using patternCount/4 since we no longer need the extra Pattern argument to work through element by element.

  • As shown in the full revised module, when calling functions in the same module, you don't need the module prefix. See the simplified main/0 function.

  • As shown in the full revised module, conventional Erlang style does not use mixed case function names like patternCount. Most Erlang programmers would use pattern_count instead.

Community
  • 1
  • 1
Steve Vinoski
  • 19,847
  • 3
  • 31
  • 46
  • Thank you for the detailed answer! I wish I could give you more votes than one. I'll carefully read again and again what you've said. Thanks for your help again ! – kaiseroskilo Oct 17 '14 at 17:29