1 .
Unboxed types are the primitives in Haskell. For example, Int
is defined as: data Int = GHC.Types.I# GHC.Prim.Int#
(for the GHC compiler). The trailing #
symbol is used to indicate primitives (this is only convention). Primitives don't really exist in Haskell. You can't define additional primitives. When they appear in code, the compiler is responsible for translating them to 'real' function calls (functions can be primitives too) and datatypes.
Yes, newtype does not 'box' a type additionally. But you can't have a newtype containing a primitive - newtype Int2 = Int2 Int#
is invalid while data Int2 = Int2 Int#
is fine.
The main difference between primitive and boxed types in the context of the question you linked is how they are represented in memory. A primitive type means there are no pointers to follow. A pointer to an Int#
must point to the value of the number, whereas a pointer to an Int
may point to a thunk which points to a thunk ... etc. Note that this means that primitives are always strict. If you believe this will be an issue, use the UNPACK
pragma, which removes any 'intermediate' boxing. That is,
data D = D (Int, Int)
is stored as a pointer (D) to a pointer (the tuple) to a block of memory containing two pointers (Ints) which each point to an actual Int#
. However,
data D = D {-# UNPACK #-} !(Int, Int)
is stored as a pointer (D) to two Int
s, thereby removing one level of boxing. Note the !
. This indicates that that field is strict and is required for UNPACK
.
2 . Any data which is going to be called with polymorphic functions should be kept packed, as unpacked data passed to polymorphic functions will be repacked anyways (introducing unnecessary overhead). The reasoning behind keeping long-lived data packed is that it is more likely to be used in an intermediate datatype or function which will require repacking, while this is easier to control with short-lived data which is only passed to a few functions before being garbage collected.
3 . In 99% of cases, you won't have issues with garbage collector pauses. In general, there aren't things you can do to guarantee the GC will not pause. The only suggestion I have is, don't try to reinvent the wheel. There are libraries designed for high-performance computations with large amounts of data (repa, vector, etc). If you try to implement it yourself, chances are, they did it better!