Your question is a bit too broad, since it asks about implementing this algorithm in functional languages but how best to do that is language-dependent. My answer therefore focuses on Erlang, given your example code.
First, note that there's no need to have separate patternCount
and patternCount_
functions. Instead, you can just have multiple patternCount
functions with different arities as well as multiple clauses of the same arity. First, let's rewrite your functions to take that into account, and also replace calls to string:len/1
with the length/1
built-in function:
patternCount(Text,Pattern) ->
patternCount(Text,Pattern,length(Pattern),0).
patternCount(Text,Pattern,PatternLength,Count) ->
case length(Text) < PatternLength of
true -> Count;
false ->
case string:equal(string:substr(Text,1,PatternLength),Pattern) of
true ->
patternCount(string:substr(Text,2),Pattern,PatternLength,Count+1);
false ->
patternCount(string:substr(Text,2),Pattern,PatternLength,Count)
end
end.
Next, the multi-level indentation in the patternCount/4
function is a "code smell" indicating it can be done better. Let's split that function into multiple clauses:
patternCount(Text,Pattern,PatternLength,Count) when length(Text) < PatternLength ->
Count;
patternCount(Text,Pattern,PatternLength,Count) ->
case string:equal(string:substr(Text,1,PatternLength),Pattern) of
true ->
patternCount(string:substr(Text,2),Pattern,PatternLength,Count+1);
false ->
patternCount(string:substr(Text,2),Pattern,PatternLength,Count)
end.
The first clause uses a guard to detect that no more matches are possible, while the second clause looks for matches. Now let's refactor the second clause to use Erlang's built-in matching. We want to advance through the input text one element at a time, just as the original code does, but we also want to detect matches as we do so. Let's perform the matches in our function head, like this:
patternCount(_Text,[]) -> 0;
patternCount(Text,Pattern) ->
patternCount(Text,Pattern,Pattern,length(Pattern),0).
patternCount(Text,_Pattern,_Pattern,PatternLength,Count) when length(Text) < PatternLength ->
Count;
patternCount(Text,[],Pattern,PatternLength,Count) ->
patternCount(Text,Pattern,Pattern,PatternLength,Count+1);
patternCount([C|TextTail],[C|PatternTail],Pattern,PatternLength,Count) ->
patternCount(TextTail,PatternTail,Pattern,PatternLength,Count);
patternCount([_|TextTail],_,Pattern,PatternLength,Count) ->
patternCount(TextTail,Pattern,Pattern,PatternLength,Count).
First, note that we added a new argument to the bottom four clauses: we now pass Pattern
as both the second and third arguments to allow us to use one of them for matching and one of them to maintain the original pattern, as explained more fully below. Note also that we added a new clause at the very top to check for an empty Pattern
and just return 0 in that case.
Let's focus only on the bottom three patternCount/5
clauses. These clauses are tried in order at runtime, but let's look at the second of these three clauses first, then the third clause, then the first of the three:
In the second of these three clauses, we write the first and second arguments in [Head|Tail
] list notation, which means Head
is the first element of the list and Tail
is the rest of the list. We use the same variable for the head of both lists, which means that if the first elements of both lists are equal, we have a potential match in progress, so we then recursively call patternCount/5
passing the tails of the lists as the first two arguments. Passing the tails allows us to advance through both the input text and the pattern an element at a time, checking for matching elements.
In the last clause, the heads of the first two arguments do not match; if they did, the runtime would execute the second clause, not this one. This means that our pattern match has failed, and so we no longer care about the first element of the first argument nor about the second argument, and we have to advance through the input text to look for a new match. Note that we write both the head of the input text and the second argument as the _
"don't care" variable, as they are no longer important to us. We recursively call patternCount/5
, passing the tail of the input text as the first argument and the full Pattern
as the second argument, allowing us to start looking for a new match.
In the first of these three clauses, the second argument is the empty list, which means we've gotten here by successfully matching the full Pattern
, element by element. So we recursively call patternCount/5
passing the full Pattern
as the second argument to start looking for a new match, and we also increment the match count.
Try it! Here's the full revised module:
-module(test).
-export([read_text/1,pattern_count/2,main/0]).
read_text(FileName) ->
{ok,File} = file:read_file(FileName),
unicode:characters_to_list(File).
pattern_count(_Text,[]) -> 0;
pattern_count(Text,Pattern) ->
pattern_count(Text,Pattern,Pattern,length(Pattern),0).
pattern_count(Text,_Pattern,_Pattern,PatternLength,Count)
when length(Text) < PatternLength ->
Count;
pattern_count(Text,[],Pattern,PatternLength,Count) ->
pattern_count(Text,Pattern,Pattern,PatternLength,Count+1);
pattern_count([C|TextTail],[C|PatternTail],Pattern,PatternLength,Count) ->
pattern_count(TextTail,PatternTail,Pattern,PatternLength,Count);
pattern_count([_|TextTail],_,Pattern,PatternLength,Count) ->
pattern_count(TextTail,Pattern,Pattern,PatternLength,Count).
main() ->
pattern_count(read_text("file.txt"),"hello").
A few final recommendations:
Searching through text element by element is slower than necessary. You should have a look at the Boyer-Moore algorithm and other related algorithms to see ways of advancing through text in larger chunks. For example, Boyer-Moore attempts to match at the end of the pattern first, since if that's not a match, it can advance through the text by as much as the full length of the pattern.
You might want to also looking into using Erlang binaries rather than lists, as they are more compact memory-wise and they allow for matching more than just their first elements. For example, if Text
is the input text as a binary and Pattern
is the pattern as a binary, and assuming the size of Text
is equal to or greater than the size of Pattern
, this code attempts to match the whole pattern:
case Text of
<<Pattern:PatternLength/binary, TextTail/binary>> = Text ->
patternCount(TextTail,Pattern,PatternLength,Count+1);
<<_/binary,TextTail/binary>> ->
patternCount(TextTail,Pattern,PatLen,Count)
end.
Note that this code snippet reverts to using patternCount/4
since we no longer need the extra Pattern
argument to work through element by element.
As shown in the full revised module, when calling functions in the same module, you don't need the module prefix. See the simplified main/0
function.
As shown in the full revised module, conventional Erlang style does not use mixed case function names like patternCount
. Most Erlang programmers would use pattern_count
instead.