Ideal functional language for implementing a full text search with .NET

Question

During my studies of computer science I came accross some functional languages like Prolog but now I have only been doing imperative stuff like C#, Ruby JavaScript and Java for the past 10 years. Currently I am creating a full text search engine for an online shop and I have come quite far the "imperative way" already. But having stumbled accross some functional languages like Haskell of Clojure it became clear that the functional paradigm fits so much better and that the imperative way is just not the right tool for this job.

So we have a full text index of some 10 million records. Each record basically contains a word occurrence, along with the id and text position from the record of which it originates.

When the user enters a search string it is parsed into an expression tree. For example the search string "transformer 100 W" results in something like

AND('transformer%', OR(NEAR('100', 'W'), NEAR('100', 'watts'), '100W', '0.1kW'))

There is some extra "intelligence" going on here, but that is of no concern for this question.

The expression tree is then evaluated recursively and results in a couple of sql queries that could return up to 100,000 rows in the form of .NET-DataTables. These are then read into sets or dictionaries and, depending on the predicates, intersections and unions are applied in order to find all results that match the entire search expression. For the NEAR evaluation the position indexes of the found occurrences are compared as well. But this is all done imperatively, with a lot of for-loops.

Additionally there is a ranking function that adds up scores of the found word occurrences. Words that are only found as prefixes or with fuzzy matching (done by the database server) get lower scores than precise matches.

For each resulting item I also need to get a list of all word occurrences that were matched, in order to highlight these words in the result pages.

So roughly the evaluation algorithm is a function like

expression tree, full text index -> 
resulting items that match the expressin tree, 
each with a ranking sum 
and a list of all found word occurrences for this item

I am just giving a rough overview here, but I hope you get enough of a picture.

Now the "real world" constraints:

The whole application (up to now) is written in C#, so an easy integration with .NET is paramount.
Loads of data is read into .NET-DataTables and will then need to be evaluated and transformed. The results should be contained in .NET types (Dictionaries, Sets, Arrays, whatever...).
Performance is of great importance. At present my algorithm often takes two seconds for a search (not counting the sql), which is kind of ok, but should be improved. Our server has 16 processors, so parallel processing would be welcome. Since we get about one search request per second and the current implementation is single threaded, processor time is still available.
The language (and the compiler) should be mature.

Since I need to stay with .NET, I was looking into Clojure-CLR, F# and Scala for .NET.

I like the concepts of Clojure a lot, but right now I can not assess, whether it would be up to the job. Reading about F# gave me mixed feelings, since it seems to want to be able to do just about everything, whereas I would tend to a more "pure" mathematical approach for the given task. But maybe that is possible with F# as well and I am not yet aware of it. I haven't delved into Scala much yet, but it seems to be well established.

Any insights would be welcome!

Not helping you at all here, but ... Prolog is usually not considered a functional language, but rather a logic programming language, or a deductive language. — Malte Schwerhoff, Nov 05 '12 at 14:14
C#, Ruby, JavaScript are functional languages. Prolog is not. I suspect, LINQ (which is also pretty much functional) should be sufficient in your case. — SK-logic, Nov 05 '12 at 14:14
C#, Ruby and JavaScript are NOT functional languages. They are languages that support _some_ functional abstractions, but have little language support for functional programming. — Cubic, Nov 05 '12 at 14:17
@Cubic, it depends on functional language definition. I'd say, having first class functions and lexical closures is enough. My standard test is very simple: if it is possible to implement a fixed point combinator, then it is a functional language. — SK-logic, Nov 05 '12 at 14:41
@SK-logic, so [all of these are functional](http://rosettacode.org/wiki/Y_combinator) (including Java and C)? — huon, Nov 05 '12 at 14:46
@dbaupp - not all the implementations are decent. Java is functional, C++ is functional indeed, but C is not - no lexical closures, therefore, no way to implement a *generic* fixed point combinator. — SK-logic, Nov 05 '12 at 14:50
@SK-logic I'd argue having a standard library that actually makes functional programming feasible (which neither JavaScript, nor C# nor Ruby do) should also be a requirement. Of course the definition of "functional language" is fuzzy, but it should require functional programming being practical in that language. — Cubic, Nov 05 '12 at 15:01
In my experience with Clojure-CLR, it's easy to use C# code from Clojure, but hard to use Clojure code from C#. So if your GUI is C#, it's a lot more difficult to call the Clojure code from there. Things just don't map well. I second the comment that LINQ should be sufficient and the cleanest solution, or PLINQ for parallel processing. Mixing languages can be more trouble than it's worth. — Dax Fohl, Nov 05 '12 at 15:01
@Cubic, there are some pure functional languages out there with no library at all. — SK-logic, Nov 05 '12 at 15:11
Hey guys, you might consider opening a separate thread on the functional-language discussion. Here it basically makes the answers to my question hard to discuss. Thank you Dax for sharing your insight and for staying to the point! — Majnu, Nov 05 '12 at 15:26

score 15 · Answer 1 · edited May 23 '17 at 12:18

The whole application (up to now) is written in C#, so an easy integration with .NET is paramount.

Loads of data is read into .NET-DataTables and will then need to be evaluated and transformed. The results should be contained in .NET types (Dictionaries, Sets, Arrays, whatever...).

F# should be a superior choice. Being a first-class language in Visual Studio, F#'s interoperability with C# is quite good.

Performance is of great importance. At present my algorithm often takes two seconds for a search (not counting the sql), which is kind of ok, but should be improved. Our server has 16 processors, so parallel processing would be welcome. Since we get about one search request per second and the current implementation is single threaded, processor time is still available.

Assuming that you start by a functional-first and immutable implementation, it should be easy to parallelize your app. Moreover, asynchronous workflow is a bless for IO-bound applications like yours.

The language (and the compiler) should be mature.

I don't compare F# to Clojure and Scala on JVM, but F# is much more mature than Clojure CLR and Scala on .NET. In choosing F#, you're sure to have long-term commitment from Microsoft and help from ever-growing F# community.

When the user enters a search string it is parsed into an expression tree.

You can represent expression trees using discriminated unions. With the introduction of query expressions in F# 3.0, you are able to translate your logics into SQL queries easily. You can even push it further by defining a similar query language for your domain.

Reading about F# gave me mixed feelings, since it seems to want to be able to do just about everything, whereas I would tend to a more "pure" mathematical approach for the given task. But maybe that is possible with F# as well and I am not yet aware of it.

F# 3.0 introduces type providers to allow users access non-structured data in a type-safe way; you may want to look at this "F# 3.0 - Information Rich Programming" video for more insights. If you would like to use F# as a programming language for data mining, I have asked a related question and got pretty good responses here.

That said, your first feelings about F# may not be correct. From my experience, you can always stay as close to the functional and immutable side as you want. Given that you already have an interesting application, I suggest to get your hands dirty to know whether F# is the language for your purpose.

UPDATE:

Here is an F# prototype which demonstrates the idea:

/// You start by modeling your domain with a set of types.
/// FullText is a sequence of Records, which is processed on demand.
type Word = string
and Freq = int
and Record = {Occurrences: (Word * Freq) list; Id: string}
and FullText = Record seq

/// Model your expression tree by following the grammar closely.
type Expression =
    | Occur of Word
    | Near of Word * Word
    | And of Expression * Expression 
    | Or of Expression * Expression

/// Find wether a word w occurs in the occurrence list.
let occur w {Occurrences = xs} = xs |> Seq.map fst |> Seq.exists ((=) w)

/// Check whether two words are near each other.
/// Note that type annotation is only needed for the stub implementation.
let near (w1: Word) (w2: Word) (r: Record): bool = failwith "Not implemented yet"

/// Evaluate an expression tree.
/// The code is succinct and clear thanks to pattern matching. 
let rec eval expr r = 
    match expr with
    | Occur w -> occur w r
    | Near(w1, w2) -> near w1 w2 r
    | And(e1, e2) -> eval e1 r && eval e2 r
    | Or(e1, e2) -> eval e1 r || eval e2 r

/// Utility function which returns second element in a 3-tuple
let inline snd3 (_, x, _) = x

/// Get the rank of the record by adding up frequencies on the whole database.
let rank (r: Record) (ft: FullText): Freq = failwith "Not implemented yet"

/// Retrieve all records which match the expression tree.
let retrieve expr fullText =
    fullText |> Seq.filter (eval expr)
             |> Seq.map (fun r -> r, rank r fullText, r.Occurrences)
             |> Seq.sortBy snd3

/// An example query
let query = 
    And (Occur "transformer%", 
         Or (Or (Near ("100", "W"), Near ("100", "watts")), 
             Or (Occur "100W", Occur "0.1kW")))

Hi Pad, thanks for your fast answer. I was a little bit hoping to not have to use a Microsoft-only language, but if it is the right tool for the job, no problem. I'll wait for some pro clojure or scala arguments before I make a decision here though. — Majnu, Nov 05 '12 at 14:59
I should have mentioned that F# is an [open source](http://blogs.msdn.com/b/fsharpteam/archive/2012/09/24/announcing-the-f-3-0-open-source-code-drop.aspx) language. Recently, there has been some discussion in F# community to push F# support on Mono further. So don't worry that you're being tied by a Microsoft-only language :). — pad, Nov 05 '12 at 15:03
hi majnu, this is an interesting comment. I have my own reason for not wanting to be tied to MSFT only as well. (having to get VS require n levels of authorization, our clusters are a mix of linux and other stufff). what are yours ? I would tend to think msft would not loose in pursuing opening langages, as it would broaden their adoption and they would still have the best people working on it. the .net framework seems a very nice piece of technology that should be used way more... — nicolas, Nov 07 '12 at 09:40
In the case of F# the issues mentionned are 'solved' as you have emacs, monodevelop the the IDE as well, and mono for the runtime. i have tried mono for F#2.0 it was working fine. F#3.0 i think is in the pipe — nicolas, Nov 07 '12 at 09:42
@pad: Thank you for the code sample! It gives me a good idea, about how F# would approach the task. I started reading Chris Smith's book and do like the language a lot. In the end I've decided to mark Dax's reply as the answer though, since that is the way I will take for now. I wasn't aware about how far C# could take me on this. — Majnu, Nov 12 '12 at 15:12

score 7 · Accepted Answer · edited May 23 '17 at 10:28

I'm curious why you're not considering LINQ as an option. It seems to satisfy all your criteria. Note I have no experience with Scala, so I can't comment on that.

The whole application (up to now) is written in C#, so an easy integration with .NET is paramount.

Loads of data is read into .NET-DataTables and will then need to be evaluated and transformed. The results should be contained in .NET types (Dictionaries, Sets, Arrays, whatever...).

Here, LINQ > F# > Clojure-CLR. If everything is already in C#, LINQ is going to be the easiest to integrate. Visual Studio support for things like intellisense and function definition navigation seems much better in a C#-only program. Calling Clojure from C# can be horrendous—in theory it should work OK, but in practice, be prepared to spend weeks figuring out why things aren't working the way you'd expect. It's really designed to be the top level thing; you call C# from Clojure, going the opposite direction really just isn't high on the Clojure-CLR developers' priority list; there's basic support, but you get what you get.

Performance is of great importance. At present my algorithm often takes two seconds for a search (not counting the sql), which is kind of ok, but should be improved. Our server has 16 processors, so parallel processing would be welcome. Since we get about one search request per second and the current implementation is single threaded, processor time is still available.

LINQ ~= F# > Clojure. I've read elsewhere that LINQ's performance can be shown to be slightly better than F# for most idiomatically-written algorithms, but they're close enough that it shouldn't matter. PLINQ makes parallelism easy. Clojure-CLR has mega-slow startup time, and the runtime overhead slows things down too.

The language (and the compiler) should be mature.

LINQ >= F# > Clojure. Not to say F# is immature at all, but Visual Studio support lags behind a bit, and there's much more production code in the world (and a lot more stack overflow answers) based on LINQ than F#.

Reading about F# gave me mixed feelings, since it seems to want to be able to do just about everything, whereas I would tend to a more "pure" mathematical approach for the given task. But maybe that is possible with F# as well and I am not yet aware of it.

None of the languages are pure pure like Haskell, but in terms of how difficult it makes it to write non-pure code, I'd rank it as: LINQ > Clojure > F# > Scala. LINQ can only be made impure by calling impure methods. Clojure has refs and atoms, F# anything can be designated mutable, and Scala (per my understanding) really is just Java with functional features bolted on.

The functional feature that F# and Scala both have going for them though is language support for pattern matching. Where in C# you'd need either some kind of inheritance hierarchy or chains of b?x:y operators to do things functionally (or if/else if you're fine with non-functional approach), pattern matching makes conditional operation on different variations of raw data types much more succinct. This maybe could be useful in your calculation of exact vs prefix vs fuzzy match rankings, however a b?x:y chain var alg = x.match == exact ? alg1 : x.match == prefix ? alg2 : alg3 in C# would be perfectly legible in this simple case—it's when the matching gets much more complex that language integrated pattern matching becomes more valuable.

Interestingly, I think the one aspect of your toolkit where F# would prove more useful than LINQ isn't the querying, which the name of LINQ itself should indicate it can handle, but the parsing of your search string into an expression tree. This is one area where functional languages and pattern matching really excels, and add in tools like FsLex and FsYacc can give you a big head start.

All that said, I think the decision comes down to where you're wanting to go. If you just want to clean up your search algorithms and be done with it, I'd advise the LINQ approach. But if you're wanting to piece-by-piece get into a more functional-oriented style for the whole program (and your company is willing to pay for the time you're committing to it), then do maybe look at the F# option. Either way I'd do the LINQ option first, as that will likely be more straightforward to you, and help guide your F# to be more functionally idiomatic once you start down that path.

Simplistically, here's what you'd want, just fill in your functions for your Near and Equal fetchers, and your GetRank and GetStrings functions, and use the below

static IEnumerable<Record> FetchRecords(this Tree tree) {
    return tree.Op == "OR"    ? tree.Args.SelectMany(FetchRecords).Distinct() :
           tree.Op == "AND"   ? tree.Args.Select(FetchRecords).Aggregate((intersect, current) => intersect.Intersect(current)) :
           tree.Op == "NEAR"  ? FetchValsNear(tree.Args[0].Op, tree.Args[1].Op) :
                                FetchValsEqual(tree.Op);
}

static IEnumerable<Record> FetchValsEqual(string s) {
    throw new NotImplementedException();
}

static IEnumerable<Record> FetchValsNear(string s1, string s2) {
    throw new NotImplementedException();
}

static IEnumerable<Tuple<Record, double, string[]>> OrderByRank(this IEnumerable<Record> vals) {
    return from val in vals
           let rank = GetRank(val)
           orderby rank
           let strings = GetStringsIn(val)
           select Tuple.Create(val, rank, strings);
}

static string[] GetStringsIn(Record val) {
    throw new NotImplementedException();
}

static double GetRank(Record val) {
    throw new NotImplementedException();
}

class Tree {
    public string Op;
    public Tree[] Args;
}

struct Record {/*your record here--use struct so Distinct and Intersect above work naturally (or use class and override Equals)*/}

like this:

foreach (var tuple in myTree.FetchRecords().AsParallel().OrderByRank().Take(30)) {
    // add to datagrid or whatever
}

This gives you both simple parallelizability, and laziness so that the GetStringsIn function is only executed on the records you take (in this case the top 30). (Note the AND selector can be simplified using any of the IntersectAll examples here).

Hi Dax, first of all thank you for your generous answer! I would indeed like to transform the current algorithm from a looping, state changing monster into functional, declarative code that much more states what the outcome should look like, rather than what all needs to be done to get there. Not only for beauty's sake but also for maintainability and extensibility. I have used LINQ for folding and mapping collections, and thus avoiding self written loops. I will, however, need to look into LINQ a little deeper at first, in order to understand how I could use it here. Will get back to you! — Majnu, Nov 05 '12 at 19:07
The biggest drawback of using LINQ is *extensibility*. As you said yourself, C# lacks of first-class functions, pattern matching, a powerful type system, etc. As the application evolves, you will be bitten by the limitations of the host language. To me, using real functional programming languages also means paradigm shift; they force me to think in the *declarative* way, which I can easily deviate from in C#. — pad, Nov 05 '12 at 19:25
@pad, I agree 100%, you can't write a whole application in LINQ. But given his requirements, if time and money are finite, and he just wants to clean up some query/aggregation code, LINQ will do the job just as well, and will be far more pragmatic in my mind. With greater time and scope, F# will certainly be more of an experience, but will be a steeper learning curve and may or may not be a better fit for the particular job. Anyway just wanted to add the option to his list of considerations. — Dax Fohl, Nov 05 '12 at 20:37
Hi Dax, thanks for the awsome code samples! Yes it is imperative, but way more elegant than what I have right now. I will need to see whether this approach would be able to cover all of my present (and some future) requirements and will get back to you on that. Maybe I need a day or two since I am busy with some other stuff right now... — Majnu, Nov 06 '12 at 21:15
@pad: I don't know whether this is a litte much to ask, but would you be willing to show roughly how you would implement this code in F#? Just to get an idea. — Majnu, Nov 06 '12 at 21:17
@majnu: I added an F# example. Hope it is clear enough for you to grasp the idea. — pad, Nov 06 '12 at 23:03
@majnu, I think you're misunderstanding something—not all C# code is necessarily imperative, and not all F# code is necessarily declarative. Both C# and F# are hybrid languages capable of writing imperative and declarative code. LINQ and b?x:y chains are declarative in C#, and anything that pulls from a database or throws exceptions will be imperative in F#. So the two solutions given actually use an equal mixture of declarative and imperative codes; F# just makes it a little more idiomatic. — Dax Fohl, Nov 07 '12 at 02:10
@Dax: Sorry for the late reply, my son needed to go to hospital, but is all well now. I read a lot on LINQ an PLINQ in the mean time and decided that this indeed will be the right way to go for now. It was quite an eye opener for me to see what all can be done with lambda expressions! One main consideration for me is also that I feel I can control performance issues better this way, since I really know what the program is doing and when it is doing what. With F# that does not seem obvious to me (yet). Thanks again for your input! — Majnu, Nov 12 '12 at 15:04

Ideal functional language for implementing a full text search with .NET

2 Answers2