23 February 2023

At my last job, I wrote a lot of backend code and had plenty of time to think about storage APIs or “data layers.”

Commonly we think of data layer in an application as a resource or object store, along with CRUD methods (Create, Read, Update, Delete) to interact with it. Unfortunately, this is one of those things that feels like a neat pattern, but isn’t. When you dig into it, a lot of unspecified details come to the surface:

  • What are we creating? In some cases it is a whole object, but often there are parts that are added by the creation process, such as DB-generated ID or creation time.
  • What does update really mean? Replacing an object wholesale is possible, but again, frequently we want to only update parts of an object (so the DB can reconcile non-conflicting changes), or provide conditions like “only update if it hasn’t been changed since I read it.”
    • Should it be possible to update an object without starting with a retrieved copy? E.g. “set field F to value X” doesn’t require a read first.
  • For reading, what is the input? We need our stored objects to have a unique derivable “key.” Frequently this is a DB-generated row number or random ID, but we can have unique keys be part of the data itself.
  • What are the error conditions? What happens if we update an object that’s been deleted; should it be recreated? What happens if we create an object with a key that already exists? Or delete a key which doesn’t exist? Shall we have mustExist, mustNotExist, allowMissing, and/or allowExists modifiers? Shall we represent clashes in the same error channel as DB errors, or a different one?
  • Maybe we shouldn’t have create or update, and instead have a “set”/”ensure”?

The idea of a clean separation between app layer and data layer starts to dissolve. For example, which layer should handle assigning creation times? Is there even a correct answer to that question? Different storage backends or databases might work so differently that trying to cover them in a single abstraction just doesn’t make much sense.

At my last job, I started to move towards having no layer separation at all, and just wrote the DB code directly in the app logic. I think for smaller applications this saves headaches compared to maintaining an arbitrary separation.

What if we tried to capture some of these details in the data API? What does a general algebra of CRUD actually look like? We might have something like this:

  • Two types, MyObject and StoredMyObject, with StoredMyObject as a subtype of MyObject.
  • A function getKey which returns a key type Key. Depending on the kind of data, getKey is defined either on MyObject or StoredMyObject; i.e. in general, stored objects always have keys, but only some kinds of non-stored objects have keys.
  • create: (MyObject) -> StoredMyObject.
  • read: (Key) -> StoredMyObject.
  • delete: (Key).
  • Updating has various approaches:
    • replace: (Key, MyObject) -> StoredMyObject.
      • Sometimes this will be equivalent to delete-then-create, but not generally.
      • (Key, MyObject) could (should?) be just (MyObject) if getKey is defined on MyObject.
    • If we want patching, we need a type MyObjectPatch so we can define patch: (Key, MyObjectPatch) -> StoredMyObject.
      • We might also have some function derive: (StoredMyObject, ...) -> MyObjectPatch if we need an existing object to create patches. (... is the description for some kind of change operation.) The familiar case of retrieving an object, changing a field, then saving it is then actually something like this:

        x = read(k)
        xp = derive(x, ...)
        x = patch(k, xp)
    • For update conditions, replace and patch might need to accept an extra Condition argument.
  • We might also have a function core: (StoredMyObject) -> MyObject which returns the blueprint for recreating a stored object.

Note that a key-value store doesn’t fit the above schema. Key-value stores have a read which does (Key) -> Value; we would need to augment it so it looks like read: (Key) -> (Key, Value).

Essentially, this is an outline of how you might design a Data Mapper (versus an Active Record).

Overall, given how much I read about software, I was surprised to feel like I was figuring this out myself. Surely saving and loading state with databases is something the vast majority of software does? Yet I feel like I’ve heard more about how to avoid state with e.g. functional programming than I have heard how to deal with state well. There is the great talk Are We There Yet by Rich Hickey, which is about modelling change and time within a program. And the Ruby Object Mapper seems like a comprehensive set of data mapper building blocks. Is there any other material out there on this kind of thing?