8

I try to traverse the directory tree. A naive depth-first traversal seems not to produce the data in a lazy fashion and runs out of memory. I next tried a breadth first approach, which shows the same problem - it uses all the memory available and then crashes.

the code I have is:

getFilePathBreadtFirst :: FilePath -> IO [FilePath]
getFilePathBreadtFirst fp = do
  fileinfo <- getInfo fp 
  res :: [FilePath]  <- if isReadableDirectory fileinfo
          then do
                children  <- getChildren fp 
                lower    <-  mapM getFilePathBreadtFirst children  
                return (children ++  concat lower)
           else return [fp]        -- should only return the files? 
  return res 

getChildren :: FilePath -> IO [FilePath]
getChildren path = do 
          names <- getUsefulContents path
          let namesfull = map (path </>) names
          return namesfull

testBF fn = do  -- crashes for /home/frank, does not go to swap 
  fps <- getFilePathBreadtFirst fn
  putStrLn $ unlines fps

I think all the code is either linear or tail recursive, and I would expect that the listing of filenames starts immediately, but in fact it does not. Where is the error in my code and my thinking? Where have I lost lazy evaluation?

user855443
  • 2,596
  • 3
  • 25
  • 37

2 Answers2

7

I will use three separate tricks to solve your question.

  • Trick 1: Use the pipes library to stream file names concurrent with traversing the tree.
  • Trick 2: Use the StateT (Seq FilePath) transformer to achieve a breadth-first traversal.
  • Trick 3: Use the MaybeT transformer to avoid manual recursion when writing the loop and exit.

The following code combines these three tricks in one monad transformer stack.

import Control.Monad
import Control.Monad.Trans
import Control.Monad.Trans.Maybe
import Control.Monad.State.Lazy
import Control.Pipe
import Data.Sequence
import System.FilePath.Posix
import System.Directory

loop :: (Monad m) => MaybeT m a -> m ()
loop = liftM (maybe () id) . runMaybeT . forever

quit :: (Monad m) => MaybeT m a
quit = mzero

getUsefulContents :: FilePath -> IO [FilePath]
getUsefulContents path
  = fmap (filter (`notElem` [".", ".."])) $ getDirectoryContents path

permissible :: FilePath -> IO Bool
permissible file
  = fmap (\p -> readable p && searchable p) $ getPermissions file

traverseTree :: FilePath -> Producer FilePath IO ()
traverseTree path = (`evalStateT` empty) $ loop $ do
    -- All code past this point uses the following monad transformer stack:
    -- MaybeT (StateT (Seq FilePath) (Producer FilePath IO)) ()
    let liftState = lift
        liftPipe  = lift . lift
        liftIO    = lift . lift . lift
    liftState $ modify (|> path)
    forever $ do
        x <- liftState $ gets viewl
        case x of
            EmptyL    -> quit
            file :< s -> do
                liftState $ put s
                liftPipe $ yield file
                p <- liftIO $ doesDirectoryExist file
                when p $ do
                    names <- liftIO $ getUsefulContents file
                    -- allowedNames <- filterM permissible names
                    let namesfull = map (path </>) names
                    liftState $ forM_ namesfull $ \name -> modify (|> name)

This creates a generator of breadth-first file names that can be consumed concurrent with the tree traversal. You consume the values using:

printer :: (Show a) => Consumer a IO r
printer = forever $ do
    a <- await
    lift $ print a

>>> runPipe $ printer <+< traverseTree path
<Prints file names as it traverses the tree>

You can even choose to not demand all the values:

-- Demand only 'n' elements
take' :: (Monad m) => Int -> Pipe a a m ()
take' n = replicateM_ n $ do
    a <- await
    yield a

>> runPipe $ printer <+< take' 3 <+< traverseTree path
<Prints only three files>

More importantly, that last example will only traverse the tree as much as necessary to generate the three files and then it will stop. This prevents wastefully traversing the entire tree when all you wanted was 3 results!

To learn more about the pipes library trick, consult the pipes tutorial at Control.Pipes.Tutorial.

To learn more about the loop trick, read this blog post.

I couldn't find a good link for the queue trick for breadth first traversal, but I know it's out there somewhere. If somebody else knows a good link for this, just edit my answer to add it.

Gabriella Gonzalez
  • 34,863
  • 3
  • 77
  • 135
  • thank you for your code. it is a great help to understand pipes. i was reading about conduits and was planning to use it, but expected that i should first have a simple lazy solution for only the tree traversal. i tried it and it works, but it does not recurse down the tree and i do not understand where it would recurse in your code. the missing code is filtering out "." and ".." from the list of dirs getUsefulContents path = do names <- getDirectoryContents path return (filter (`notElem` [".", ".."]) names) – user855443 Sep 27 '12 at 06:14
  • on deeper inspection i see the (hidden) recursion in the last line with the liftstate, where the new filenames are added to the "todo" list. i did not see this, because the code does not produce the full filepath for the added files. the value of path is the original start value and not set each time to the current file name -> replace path with file, then it works. to completely work, one has to check permissions on the directory, which i do with getInfo :: FilePath -> IO Info which i took from real world haskell chapter 9. – user855443 Sep 27 '12 at 07:02
  • it runs into difficulties, when it encounters links and i have to add a test to filter out links as well. it works and uses all my 4 cores! there is still a memory leak, as usage grows very slowly till it runs out of memory. can you see where? your help is much appreciated, it was exactly what i needed to have a good practical example of how to use pipes when traversing a tree! – user855443 Sep 27 '12 at 07:04
  • I added your changes and I observe the memory leak, too. I believe the issue is due to `forever` leaking space according to [this GHC bug](http://hackage.haskell.org/trac/ghc/ticket/5205), which will be fixed in an upcoming release, but I'm not 100% sure yet. I will try again later to see if I can prove if that is the case. – Gabriella Gonzalez Sep 27 '12 at 17:04
0

i have separated the managemetn of the pipe and the tree traversal. here first the code for the pipe (essentially the code of gonzales - thank you!):

traverseTree :: FilePath -> Producer FilePath IO ()
-- ^ traverse a tree in breadth first fashion using an external doBF function 
traverseTree path = (`evalStateT` empty) $ loop $ do
-- All code past this point uses the following monad transformer stack:
-- MaybeT (StateT (Seq FilePath) (Producer FilePath IO)) ()
let liftState = lift
    liftPipe  = lift . lift
    liftIO    = lift . lift . lift
liftState $ modify (|> path)
forever $ do
    x <- liftState $ gets viewl
    case x of
        EmptyL    -> quit
        file :< s -> do
            (yieldval, nextInputs) <- liftIO $ doBF file 
            liftState $ put s
            liftPipe $ yield yieldval
            liftState $ forM_ nextInputs $ \name -> modify (|> name)

next the code for the tree traversal:

doBF :: FilePath -> IO (FilePath, [FilePath])
doBF file = do 
    finfo <- getInfo file
    let p =  isReadableDirectoryNotLink finfo
    namesRes <- if p then do
        names :: [String] <- liftIO $ getUsefulContents file
        let namesSorted = sort names
        let namesfull = map (file </>) namesSorted
        return namesfull
        else return []          
    return (file, namesRes) 

I hope to replace doBF with a similar function to traverse depth first. i assumed that i could make traverseTree more general and not only for FilePath ~ String, but i do not see in which class the empty function on sequences is. could be generally useful.

user855443
  • 2,596
  • 3
  • 25
  • 37