I want to represent PDF files for parsing and printing and am struggling to find good types for this.
PDF files contain values, which can be text, names (identifiers), dictionaries mapping names to values, and many others that I left out of these examples. I started out with something like this:
data Value = Text String | Name String | Dictionary [(String, Value)]
instance Show Value where
show (Text text) = "(" ++ text ++ ")"
show (Name name) = "/" ++ name
show (Dictionary entries) = "<<" ++ unlines (showEntry <$> entries) ++ ">>" where
showEntry (key, value) = show (Name key) ++ " " ++ show value
Unfortunately, it would be easy for showEntry
to accidentally use just show key
or even show (Text key)
. The type system does not help with picking the right implementation. The fact that dictionary keys are names is not captured by their type, which is just String
.
One could solve this by modeling keys as values:
data Value = Text String | Name String | Dictionary [(Value, Value)]
instance Show Value where
show (Text text) = "(" ++ text ++ ")"
show (Name name) = "/" ++ name
show (Dictionary entries) = "<<" ++ unlines (showEntry <$> entries) ++ ">>" where
showEntry (key, value) = show key ++ " " ++ show value
This way, showEntry
gets a key
of type Value
and thus automatically uses the correct implementation. However, this is arguably even worse, as now invalid Dictionary
values with keys that are not names can be represented.
My next idea was to use separate types. Just like names are used in dictionaries, so are text and dictionaries used in other data structures, so they should also get their own types:
data Text = Text String
data Name = Name String
data Dictionary = Dictionary [(Name, Value)]
data Value = TextValue Text | NameValue Name | DictionaryValue Dictionary
instance Show Text where show (Text text) = "(" ++ text ++ ")"
instance Show Name where show (Name name) = "/" ++ name
instance Show Dictionary where
show (Dictionary entries) = "<<" ++ unlines (showEntry <$> entries) ++ ">>" where
showEntry (key, value) = show key ++ " " ++ show value
instance Show Value where
show (TextValue text) = show text
show (NameValue name) = show name
show (DictionaryValue dictionary) = show dictionary
The types now accurately represent the structure of the data and all is well. Unfortunately, this feels very ugly and redundant. To construct values, one now needs twice as many constructors:
DictionaryValue (Dictionary [(Name "foo", TextValue (Text "bar")), (Name "test", NameValue (Name "baz"))])
It feels like the tagged unions in ADTs just get in the way here, since the tags are unneeded, with the types already uniquely determining which case is being chosen.
Is this the best we can do or is there a nicer way for this scenario?
I imagine this type of problem comes up all the time in parsing formats that allow for nested values (like arithmetic expressions, XML, JSON, DSLs, etc.). What is the canonical/usual representation people use for this?