Haskell complete text file indexer

Question

Upon seeing a post earlier about text file indexing, i became curious as I'm currently learning to use Haskell (which isn't going well) as to if its possible to achieve the same goal easily in Haskell: How do I create a Java string from the contents of a file?

The objective of the class is to read a file, alphabetically order the words within the file and display the line numbers these words appear on.

Would the function in Haskell be able to be built in a similar approach to this?

Piotr Miś · Answer 1 · 2014-05-07T18:15:50.440

Here is a function that will do the heavy-lifting you need:

import Data.List

sortWords :: String -> [(Int,String)]
sortWords contents = sortBy (\(_,w1) (_,w2) -> compare w1 w2) ys
  where xs = zip [1..] (lines contents)
        ys = concatMap (\(n,line) -> zip (repeat n) (words line)) xs

The only thing left to do is to write some simple IO code to read a file and nicely print results. If you run this function on the following input:

Here are some test
lines to see if this works...
What: about? punctuation!

You will get this:

(1,"Here")
(3,"What:")
(3,"about?")
(1,"are")
(2,"if")
(2,"lines")
(3,"punctuation!")
(2,"see")
(1,"some")
(1,"test")
(2,"this")
(2,"to")
(2,"works...")

score 6 · Answer 2 · answered May 07 '14 at 16:07

6

Here's a Haskell solution

import Data.List (sortBy)
import Data.Function (on)

numberWordsSorted :: String -> [(String,Int)]
numberWordsSorted  = sortBy (compare `on` fst)
                   . concat 
                   . zipWith (\n xs -> [(x,n)|x<- xs]) [1..]   -- add line numbers
                   . map words
                   . lines

main = fmap numberWordsSorted (readFile "example.txt") >>= mapM_ print

If you run it on example.txt with contents

the quick brown fox
jumps over the lazy dog

you get

("brown",1)
("dog",2)
("fox",1)
("jumps",2)
("lazy",2)
("over",2)
("quick",1)
("the",1)
("the",2)

You should use Data.Map instead of pairs if you're rather see the [1,2] instead of two lines of output.

answered May 07 '14 at 16:07

AndrewC

32,300
7
79
115

thats great. but i am also getting lots of other details of the file in my output. such as the font and document sizes. could that be because the file is '.rtf'? – dizzytri99er May 08 '14 at 17:31
@dizzytri99er It's definitely because it's rtf - open it in notepad to see the data. You could open it in a word processor and save it as text to ger rid of the formatting (easy, quick), or find an rtf library to parse it. – AndrewC May 08 '14 at 20:55
yeah you were right. only gotta remove the stop words now(such as 'and') :( – dizzytri99er May 08 '14 at 21:17

Haskell complete text file indexer

2 Answers2