CRUD is fake; or, what should a data mapper look like?

23 February 2023

At my last job, I wrote a lot of backend code and had plenty of time to think about storage APIs or “data layers.”

Commonly we think of data layer in an application as a resource or object store, along with CRUD methods (Create, Read, Update, Delete) to interact with it. Unfortunately, this is one of those things that feels like a neat pattern, but isn’t. When you dig into it, a lot of unspecified details come to the surface:

What are we creating? In some cases it is a whole object, but often there are parts that are added by the creation process, such as DB-generated ID or creation time.
What does update really mean? Replacing an object wholesale is possible, but again, frequently we want to only update parts of an object (so the DB can reconcile non-conflicting changes), or provide conditions like “only update if it hasn’t been changed since I read it.”
- Should it be possible to update an object without starting with a retrieved copy? E.g. “set field F to value X” doesn’t require a read first.
For reading, what is the input? We need our stored objects to have a unique derivable “key.” Frequently this is a DB-generated row number or random ID, but we can have unique keys be part of the data itself.
What are the error conditions? What happens if we update an object that’s been deleted; should it be recreated? What happens if we create an object with a key that already exists? Or delete a key which doesn’t exist? Shall we have mustExist, mustNotExist, allowMissing, and/or allowExists modifiers? Shall we represent clashes in the same error channel as DB errors, or a different one?
Maybe we shouldn’t have create or update, and instead have a “set”/”ensure”?

The idea of a clean separation between app layer and data layer starts to dissolve. For example, which layer should handle assigning creation times? Is there even a correct answer to that question? Different storage backends or databases might work so differently that trying to cover them in a single abstraction just doesn’t make much sense.

At my last job, I started to move towards having no layer separation at all, and just wrote the DB code directly in the app logic. I think for smaller applications this saves headaches compared to maintaining an arbitrary separation.

What if we tried to capture some of these details in the data API? What does a general algebra of CRUD actually look like? We might have something like this:

Two types, MyObject and StoredMyObject, with StoredMyObject as a subtype of MyObject.
A function getKey which returns a key type Key. Depending on the kind of data, getKey is defined either on MyObject or StoredMyObject; i.e. in general, stored objects always have keys, but only some kinds of non-stored objects have keys.
create: (MyObject) -> StoredMyObject.
read: (Key) -> StoredMyObject.
delete: (Key).
Updating has various approaches:
- replace: (Key, MyObject) -> StoredMyObject.
  - Sometimes this will be equivalent to delete-then-create, but not generally.
  - (Key, MyObject) could (should?) be just (MyObject) if getKey is defined on MyObject.
- If we want patching, we need a type MyObjectPatch so we can define patch: (Key, MyObjectPatch) -> StoredMyObject.
  - We might also have some function derive: (StoredMyObject, ...) -> MyObjectPatch if we need an existing object to create patches. (... is the description for some kind of change operation.) The familiar case of retrieving an object, changing a field, then saving it is then actually something like this:
    x = read(k) xp = derive(x, ...) x = patch(k, xp)
- For update conditions, replace and patch might need to accept an extra Condition argument.
We might also have a function core: (StoredMyObject) -> MyObject which returns the blueprint for recreating a stored object.

Note that a key-value store doesn’t fit the above schema. Key-value stores have a read which does (Key) -> Value; we would need to augment it so it looks like read: (Key) -> (Key, Value).

Essentially, this is an outline of how you might design a Data Mapper (versus an Active Record).

Overall, given how much I read about software, I was surprised to feel like I was figuring this out myself. Surely saving and loading state with databases is something the vast majority of software does? Yet I feel like I’ve heard more about how to avoid state with e.g. functional programming than I have heard how to deal with state well. There is the great talk Are We There Yet by Rich Hickey, which is about modelling change and time within a program. And the Ruby Object Mapper seems like a comprehensive set of data mapper building blocks. Is there any other material out there on this kind of thing?