7 September 2016

Both Haskell and Purescript have a notion of "records," which are types that have a fixed set of named fields. The two languages treat them very differently, each with some advantages and disadvantages. It's no secret that doing records right is hard, and GHC is in the process of improving its record system. In this post I'm going to talk about my experience with records in each language, the problems I've encountered, as well as the partial solutions that I've found.

Haskell Records

In Haskell, records are a style of data constructor. Record constructors and non-record constructors can be mixed in the same type. So for example, you could define a type like

data Foo a = Bar a | Baz { qux :: a, quux :: String }

Let me emphasize the terminology here. Foo is a type constructor with one type argument (a), Bar is a non-record data constructor with one argument of type a, and Baz is a record data constructor with two arguments (qux of type a and quux of type String). While this is possible, I don't think I've ever wanted to define a type like this in a real project. The types I do want fall into two cases:

Named tuples

data Stuff = Stuff
    { field1 :: Integer
    , field2 :: String
    , field3 :: Double
    }

Which fields are present depends on the tag

data HowMuchStuff
    = ALittleStuff
        { bottom :: Integer }
    | SomeStuff
        { bottom :: Integer
        , middle :: String
        }
    | MoreStuff
        { bottom :: Integer
        , middle :: String
        , top :: Double
        }

You can use a record constructor either with named fields or with the arguments in order like a non-record constructor. Additionally, Haskell creates functions to retrieve the field's value. There is also special syntax for updating a field of a record.

-- These are the same value
x = Stuff { field1 = 5, field2 = "Hello", field3 = 1.5 }
x' = Stuff 5 "Hello" 1.5

-- Accessor functions
-- field1 :: Stuff -> Integer
-- field2 :: Stuff -> String
-- field3 :: Stuff -> Double
field1 x -- 5
field2 x -- "Hello"
field3 x -- 1.5

-- Updating a record
x { field1 = 3 } -- Stuff { field1 = 3, field2 = "Hello", field3 = 1.5 }

This ends up being mostly okay for the "named tuple" style types, but once you've been around Haskell for a while and define a type with multiple record constructors with different fields, you notice a glaring downside:

-- Haskell defines functions
-- bottom :: HowMuchStuff -> Integer
-- middle :: HowMuchStuff -> String
-- top :: HowMuchStuff -> Double

bottom (ALittleStuff 5) -- 5
bottom (SomeStuff 3 "Hello") -- 3
middle (MoreStuff 1 "World" 2.3) -- "World"
top (ALittleStuff 4) -- Runtime error!

One of the biggest benefits of using Haskell is that the type system often prevents you from running code that does nonsensical like trying to extract a field that doesn't exist. But when you define a type with multiple record constructors, suddenly the compiler is forcing you to have these functions where matching types isn't sufficient to avoid runtime errors.

There are also some other annoyances with Haskell's record system. First, you can't define multiple types that have the same field name (although GHC 8.0 has taken the first step toward allowing this). Second, updating a field inside multiple layers of record nesting is quite nasty:

data A = A { fieldA :: B }
data B = B { fieldB :: C }
data C = C { fieldC :: Integer }

x = A { fieldA = B { fieldB = C { fieldC = 5 } } }
y = let b = fieldA x
        c = fieldB b
    in A { fieldA = b { fieldB = c { fieldC = 7 } } }

For now, let's leave Haskell for a little bit and talk about how Purescript handles these ideas.

Purescript

Rather than associating records with data constructors, Purescript has a notion of a record type. The type { foo :: Int, bar :: String } is a record type with exactly two fields: foo of type Int and bar of type String. So while you can write types similarly to Haskell, the meaning is actually quite different.

newtype Foo = Foo
    { foo :: Int
    , bar :: String
    }

You might notice that I wrote newtype here instead of data. In Haskell that would be impossible because the Foo data constructor takes two arguments with types Int and String. However, in Purescript, Foo is actually taking a single argument with type { foo :: Int, bar :: String }, and therefore can be defined as a newtype.

Purescript allows record access with dot syntax similar to other languages: x.foo gives the foo field of the value x. This raises a question: What is the type of the function \x -> x.foo? x should be able to be any record type with a foo field. To resolve this, Purescript has the notion of extensible records.

The type of \x -> x.foo is written as ( foo :: a | r ) -> a. To match a type against the pattern ( foo :: a | r ), Purescript checks that the type is a record type and that it has a field of name foo, and then binds a to the type of that field and r to the record type with the rest of the fields.

This lets you write data types mostly how you want, with no naming restrictions and so on. However, there is a pretty significant downside when you have nested data types that are each newtype wrappers around records. Let's imagine we have these type definitions:

newtype A = A { fieldA :: B }
newtype B = B { fieldB :: C }
newtype C = C { fieldC :: Int }

In Haskell, accessing the Int from a given value of type A isn't too bad, because we have accessor functions fieldA :: A -> B, fieldB :: B -> C, fieldC :: C -> Int. In Purescript, however, we end up needing to unwrap the newtype with a case statement at each level:

x = A { fieldA: B { fieldB: C { fieldC: 5 } } }
x.fieldA.fieldB.fieldC -- Invalid because `x` is of type `A` which is not a record type
                       -- (It is a newtype wrapper around a record type).
-- To access the nested field, pattern match at each level:
case x of
    A a -> case a.fieldA of
        B b -> Case b.fieldB of
            C c -> c.fieldC

In contrast to Haskell, both accesses and updates are a pain in this setting!

Lenses

If you read my post from two weeks ago, it probably is no surprise that lenses solve a lot of these problems. Unfortunately, lenses have a reputation of being hard to understand. The main important type is type LensLike f s t a b = (a -> f b) -> s -> f t. It is not immediately obvious why this type is helpful at all or what it even means. This series of posts does a great job of elucidating the answer to that question.

As I learned more about lenses, it became clear to me that lenses really capture what it means for a type to be embedded in another type, which includes record fields as a special case. Imagine a world where instead of accessor functions, Haskell produced lenses for the fields of record types. Suddenly, with a few small functions, updating nested fields becomes much easier:

data A = A { fieldA :: B }
data B = B { fieldB :: C }
data C = C { fieldC :: Integer }

-- fieldA :: Lens' A B
-- fieldB :: Lens' B C
-- fieldC :: Lens' C Integer
-- view :: Lens' s a -> s -> a
-- set :: Lens' s a -> s -> a -> s

x = A { fieldA = B { fieldB = C { fieldC = 5 } } }
view (fieldA . fieldB . fieldC) x -- 5
set (fieldA . fieldB . fieldC) x 7
    -- A { fieldA = B { fieldB = C { fieldC = 7 } } }

This is a massive ergonomic improvement over what we had before. The extremely comprehensive lens package provides template Haskell functions to automatically write these lenses for us, but unfortunately the actual field names need to be different from the lens names to avoid conflict with the accessor functions. Typically, field names get prefixed with an underscore, and then the lens library strips those when writing the lens definitions.

Lenses also provide a solution to the problem we had with Purescript. For defining a lens it only really matters that one type is embedded in another, and not the exact structure of how that embedding happens. Therefore, we could have a lens library that produces lenses for newtyped records that unwraps and rewraps the values for us. The code would then look almost identical to the Haskell code above:

newtype A = A { fieldA :: B }
newtype B = B { fieldB :: C }
newtype C = C { fieldC :: Int }

-- fieldA :: Lens' A B
-- fieldB :: Lens' B C
-- fieldC :: Lens' C Int
-- view :: Lens' s a -> s -> a
-- set :: Lens' s a -> s -> a -> s

x = A { fieldA: B { fieldB: C { fieldC: 5 } } }
view (fieldA . fieldB . fieldC) x -- 5
set (fieldA . fieldB . fieldC) x 7
    -- A { fieldA: B { fieldB: C { fieldC: 7 } } }

This is still a little bit unsatisfying, because if we had multiple types with the same field name, we'd like to be able to overload the name with a lens that can take either and provides the appropriate types. One solution to this is the Field typeclass present in the record package. I won't get into the details of exactly why the typeclass is defined how it is, but it works quite well in practice.

Traversals

However, lenses only solve our issues for the named-tuple style of types. A lens represents a field that is always present, while in the multiple-constructor case sometimes fields are missing. In that case, the appropriate concept is the related idea of a traversal. I don't want to get bogged down in the details of what exactly is a traversal, but I do want to write the types so that you can see the similarity to lenses:

type Lens      s t a b = forall f. Functor     f => (a -> f b) -> s -> f t
type Traversal s t a b = forall f. Applicative f => (a -> f b) -> s -> f t

Where lenses correspond to a single copies of a type embedded in a larger type, traversals correspond to some number of embedded copies. It could be zero, one, two, five, or fifty, and you'd be able to define a traversal. In the case of multiple constructors with different fields, we only care about traversals with either zero or one embedded copy.

The record library's typeclass seemed to me to be a big step toward the ideal solution for records. It allows for the same field name to be used by multiple types while still providing a path for type inference, but unfortunately the original version of the type class assumes that the field will be represented by a bona fide lens. Since sometimes I really want a traversal, I wrote record-lens which adds a parameter which will always be either Functor or Applicative, and provides template Haskell that determines what the instance should be and writes it for you.

Now that GHC 8.0 has the DuplicateRecordFields extension, record-lens becomes quite useful. You can see it in action in the Splendor types. With DuplicateRecordFields and record-lens, we get almost all the way to where I want to be. Unfortunately, the real field names still need to be prefixed with an underscore, and the accessor functions are still there (although it's easy enough to never use them). These underscores quickly leak into other parts of the project if you want to use automatically generated instances for things like JSON encoding and decoding.

The ideal world

So where do we go from here? In my ideal world, lenses would be a fundamental part of Haskell and Purescript, rather than libraries. If lenses and traversals replaced accessor functions, then we wouldn't have any need for these functions that cause runtime errors, and updating nested records becomes painless.

If you want to allow the same name to refer to multiple lenses, such as for different record types with the same field name, there is always going to be some amount of complexity to allow for type checking to happen appropriately. I have not personally run into any situations where the typeclass in record-lens does not meet my requirements, but I would not be surprised if others find that it doesn't work for them.

As I've found, Haskell libraries can get really close to what I consider the ideal world right now, with the main remaining issue being that the field names need to be prefixed with an underscore. One simple way to get even closer would be to have a GHC extension that simply avoids defining the accessor functions. Since field names and values occur in entirely distinct places in Haskell syntax, there would be no ambiguity here, and I could use template Haskell to write the lenses and traversals that I really want.