3

Suppose I have the following data type which maps my database schema.

data Object = Object
    { classification :: Text
    , country :: Text
    , numberOfParts :: Int
    -- Lots of other fields
    }

I want to extract statistics for all objects in the database. For example, I want to extract the frequency of every field in the Person data constructor. So I would have the following function :

-- In the return type, the key of the Map is the field name. 
-- Each map value represents possible values with frequency 
-- (ex. "classification" -> [("table", 10), ("chair", 3), ("computer", 2)])
generateStats :: [Object] -> Map Text [(Text, Integer)]

This function would calculate the frequency of every field, so I would have to call id object, classification object, country object, etc. If the datatype has 50 fields, I would have to call 50 functions to access those fields.

Is there a way to generalize this ?

Can it be generalized to any data constructor ?

Is there a more elegant way to solve this type of problem ?

kolam
  • 731
  • 4
  • 17
  • Generics or ScrapYourBoilerplate might help here. I'm not terribly familiar with those to suggest a solution, though. – chi Jan 27 '20 at 21:35
  • 2
    What's `generateStats` supposed to do for the non-`Text` fields like `numberOfParts`? Do you want to skip them? – K. A. Buhr Jan 27 '20 at 22:03
  • @K.A.Buhr In my initial implementation, I convert every value to Text – kolam Jan 28 '20 at 17:05

2 Answers2

3

This sort of problem can be solved with generics. Usually, the syb package (Data.Generics or Data.Data or SYB or "scrap your boilerplate" generics) is the easiest to use, so it's worth trying it first and moving on to more complicated libraries only if you can't get it to work for a particular task.

Here, syb provides a straightforward way of retrieving the list of field names from a record constructor. If you derive a Data instance for some Object:

{-# LANGUAGE DeriveDataTypeable #-}
import Data.Generics
import Data.Text (Text)
import qualified Data.Text as Text
data Object = Object
    { classification :: Text
    , country :: Text
    , numberOfParts :: Int
    } deriving (Data)

then you can fetch the field names at runtime with the following function:

-- Get field names (empty list if not record constructor)
getnames :: Data object => object -> [Text]
getnames = map Text.pack . constrFields . toConstr

like so:

λ> :set -XOverloadedStrings
λ> getnames $ Object "prime" "Canada" 5
["classification","country","numberOfParts"]

You can fetch field values as Text at runtime using a generic query gmapQ and writing a generic helper function toText that converts field values of various types to Text:

-- Get field values as Text.
getfields :: Data object => object -> [Text]
getfields = gmapQ toText

The toText function has type:

toText :: (Data a) => a -> Text

and needs to be prepared to handle any possible field encountered. A limitation of Data.Data generics is that you can only handle a fixed set of explicit types with a default value for "the rest". Here, we handle Text, String, Int, and Double types and throw an error with unknown for "the rest":

{-# LANGUAGE TypeApplications #-}

toText = mkQ unknown           -- make a query with default value "unknown"
                id             -- handle:          id     :: Text -> Text
         `extQ` Text.pack      -- extend to:       pack   :: String -> Text
         `extQ` tshow @Int     -- extend to:       tshow  :: Int -> Text
         `extQ` tshow @Double  -- extend to:       tshow  :: Double -> Text
  where tshow :: (Show a) => a -> Text
        tshow = Text.pack . show
        unknown = error "unsupported type"

If you wanted to handle all types with a Show (or some other) instance, then syb won't do the job. (If you tried dropping the type application above and writing `extQ` tshow to handle all Show cases, you'd get an error.) Instead, you'd need need to upgrade to syb-with-class or some other generics library to handle this.

With all that in place, getting a list of key/value pairs from any object is straightword:

getpairs :: Data object => object -> [(Text,Text)]
getpairs = zip <$> getnames <*> getfields

This works on Objects:

λ> concatMap getpairs [Object "prime" "Canada" 5, Object "substandard" "Fakeistan" 100]
[("classification","prime"),("country","Canada"),("numberOfParts","5")
,("classification","substandard"),("country","Fakeistan"),("numberOfParts","100")]

or anything else with a Data instance. Sum types and record-less constructors should work okay. With the type:

data OtherObject = Foo { foo :: String, factor :: Double }
                 | Bar { bar :: Int }
                 | NotARecord Int Int Int
                 deriving (Data)

we get:

λ> getpairs $ Foo "exchange" 0.75
[("foo","exchange"),("factor","0.75")]
λ> getpairs $ Bar 42
[("bar","42")]
λ> getpairs $ NotARecord 1 2 3
[]

Here's a complete code example:

{-# LANGUAGE DeriveDataTypeable #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE TypeApplications #-}

import Data.Generics
import Data.Text (Text)
import qualified Data.Text as Text

data Object = Object
    { classification :: Text
    , country :: Text
    , numberOfParts :: Int
    } deriving (Data)

data OtherObject = Foo { foo :: String, factor :: Double }
                 | Bar { bar :: Int }
                 | NotARecord Int Int Int
                 deriving (Data)

-- Get field names (empty list if not record constructor)
getnames :: Data object => object -> [Text]
getnames = map Text.pack . constrFields . toConstr

-- Get field vales as Text.
getfields :: Data object => object -> [Text]
getfields = gmapQ toText

-- Generic function to convert one field.
toText :: (Data a) => a -> Text
toText = mkQ unknown           -- make a query with default value "unknown"
                id             -- handle:          id     :: Text -> Text
         `extQ` Text.pack      -- extend to:       pack   :: String -> Text
         `extQ` tshow @Int     -- extend to:       tshow  :: Int -> Text
         `extQ` tshow @Double  -- extend to:       tshow  :: Double -> Text
  where tshow :: (Show a) => a -> Text
        tshow = Text.pack . show
        unknown = error "unsupported type"

-- Get field name/value pairs from any `Data` object.
getpairs :: Data object => object -> [(Text,Text)]
getpairs = zip <$> getnames <*> getfields

main :: IO ()
main = mapM_ print $
  [ getpairs $ Object "prime" "Canada" 5
  , getpairs $ Foo "exchange" 0.75
  , getpairs $ Bar 42
  , getpairs $ NotARecord 1 2 3
  ]
K. A. Buhr
  • 45,621
  • 3
  • 45
  • 71
2

This solution depends on the generics machinery from generics-sop and the streaming sinks from foldl.

Some required pragmas and imports:

{-# LANGUAGE DeriveGeneric,DeriveAnyClass,ScopedTypeVariables,FlexibleContexts,
             GADTs,TypeApplications,OverloadedStrings,StandaloneDeriving, TypeOperators #-}
module Main (main) where

import qualified GHC.Generics as GHC
import Generics.SOP (All,And,IsProductType,productTypeFrom,
                     DatatypeInfo(..),datatypeInfo,
                     ConstructorInfo(..),FieldInfo(..),FieldName,                         
                     projections, I(..), K(..),type (-.->)(Fn),type (:.:)(Comp),
                     Generic,HasDatatypeInfo)
import Generics.SOP.NP -- All the *_NP functions come form here
import Generics.SOP.Dict (Dict(..),zipAll)
import qualified Control.Foldl as L
import Data.Proxy
import Data.Text (Text)
import qualified Data.Map.Strict as Map

Datatypes and functions to calculate histograms, which aren't tied to any concrete record:

newtype Histogram a = Histogram (Map.Map a Int) deriving Show

-- Hides the exact type of the key behind an existential
data SomeHistogram = forall a. (Ord a, Show a) => SomeHistogram (Histogram a) 
deriving instance Show SomeHistogram

-- Streaming sink for a single field
histogram :: Ord a => L.Fold a (Histogram a)
histogram = (L.Fold step Map.empty Histogram) 
  where
    step m a = Map.insertWith (+) a 1 m

-- For any record with Generics.SOP.Generic instance, 
-- create a streaming sink that accepts record values and
-- returns a list of histograms, one for each field
recordHistogram :: forall r xs . (IsProductType r xs, All Ord xs, All Show xs)
                => L.Fold r [SomeHistogram]
recordHistogram = 
    let productOfFolds = 
            cliftA_NP 
                (Proxy @Ord)
                (\(Fn proj) -> 
                     Comp (L.premap (\o -> let np = productTypeFrom @r @xs o
                                               I r = proj (K np)
                                            in r) 
                                    histogram))
                (projections @xs)
        foldToProduct = sequence'_NP productOfFolds -- pull the Fold outward
        -- convince GHC that we have a combination of Ord and Show for all fields
        ordAndShow = zipAll (Dict @(All Ord) @xs) (Dict @(All Show) @xs)
        foldToList = case ordAndShow of -- collapse result of Fold into a list
            Dict -> collapse_NP . cliftA_NP (Proxy @(Ord `And` Show)) (K . SomeHistogram) 
                    <$> 
                    foldToProduct
     in foldToList

In case we want a list of field names to zip with the list of histograms:

fieldNamesOf :: forall r xs. (IsProductType r xs, HasDatatypeInfo r) 
             => Proxy r 
             -> [FieldName] 
fieldNamesOf _ =
    case datatypeInfo (Proxy @r) of
        ADT _ _ ((Record _ fields) :* Nil) _ -> 
            collapse_NP (liftA_NP (\(FieldInfo i) -> K i) fields)
        _ -> error "hey, not a record!"

Putting it all to work with Object:

data Object = Object
    { classification :: Text
    , country :: Text
    , numberOfParts :: Int
    } deriving (GHC.Generic,Generic,HasDatatypeInfo) 
-- Generic and HasDatatypeInfo from generics-sop

main :: IO ()
main = print $ L.fold recordHistogram [Object "foo" "Spain" 4, Object "bar" "France" 4]

This solution has two potential problems:

  • Internally, recordHistogram uses n-ary products from generics-sop. Constructing and traversing these products might incur in some overhead.
  • There might be some space leak in the streaming sink (the Fold) returned by recordHistogram. Some extra strictness might be necessary.
danidiaz
  • 26,936
  • 4
  • 45
  • 95
  • 1
    You don't actually need to work with `Dict` here. You can change the type signature for `recordHistogram` to require ``All (Ord `And` Show) xs`` and then pass ``Proxy @(Ord `And` Show)`` in `productOfFolds` as well. – kosmikus Feb 21 '20 at 10:50