At my last job, I wrote a lot of backend code and had plenty of time to think about storage APIs or “data layers.”
Commonly we think of data layer in an application as a resource or object store, along with CRUD methods (Create, Read, Update, Delete) to interact with it. Unfortunately, this is one of those things that feels like a neat pattern, but isn’t. When you dig into it, a lot of unspecified details come to the surface:
- What are we creating? In some cases it is a whole object, but often there are parts that are added by the creation process, such as DB-generated ID or creation time.
- What does update really mean? Replacing an object wholesale is possible, but again, frequently we want to only update parts of an object (so the DB can reconcile non-conflicting changes), or provide conditions like “only update if it hasn’t been changed since I read it.”
- Should it be possible to update an object without starting with a retrieved copy? E.g. “set field F to value X” doesn’t require a read first.
- For reading, what is the input? We need our stored objects to have a unique derivable “key.” Frequently this is a DB-generated row number or random ID, but we can have unique keys be part of the data itself.
- What are the error conditions? What happens if we update an object that’s been deleted; should it be recreated? What happens if we create an object with a key that already exists? Or delete a key which doesn’t exist? Shall we have
mustExist
,mustNotExist
,allowMissing
, and/orallowExists
modifiers? Shall we represent clashes in the same error channel as DB errors, or a different one? - Maybe we shouldn’t have create or update, and instead have a “set”/”ensure”?
The idea of a clean separation between app layer and data layer starts to dissolve. For example, which layer should handle assigning creation times? Is there even a correct answer to that question? Different storage backends or databases might work so differently that trying to cover them in a single abstraction just doesn’t make much sense.
At my last job, I started to move towards having no layer separation at all, and just wrote the DB code directly in the app logic. I think for smaller applications this saves headaches compared to maintaining an arbitrary separation.
What if we tried to capture some of these details in the data API? What does a general algebra of CRUD actually look like? We might have something like this:
- Two types,
MyObject
andStoredMyObject
, withStoredMyObject
as a subtype ofMyObject
. - A function
getKey
which returns a key typeKey
. Depending on the kind of data, getKey is defined either onMyObject
orStoredMyObject
; i.e. in general, stored objects always have keys, but only some kinds of non-stored objects have keys. create: (MyObject) -> StoredMyObject
.read: (Key) -> StoredMyObject
.delete: (Key)
.- Updating has various approaches:
replace: (Key, MyObject) -> StoredMyObject
.- Sometimes this will be equivalent to delete-then-create, but not generally.
(Key, MyObject)
could (should?) be just(MyObject)
ifgetKey
is defined onMyObject
.
- If we want patching, we need a type
MyObjectPatch
so we can definepatch: (Key, MyObjectPatch) -> StoredMyObject
.-
We might also have some function
derive: (StoredMyObject, ...) -> MyObjectPatch
if we need an existing object to create patches. (...
is the description for some kind of change operation.) The familiar case of retrieving an object, changing a field, then saving it is then actually something like this:x = read(k) xp = derive(x, ...) x = patch(k, xp)
-
- For update conditions,
replace
andpatch
might need to accept an extraCondition
argument.
- We might also have a function
core: (StoredMyObject) -> MyObject
which returns the blueprint for recreating a stored object.
Note that a key-value store doesn’t fit the above schema. Key-value stores have a read which does (Key) -> Value
; we would need to augment it so it looks like read: (Key) -> (Key, Value)
.
Essentially, this is an outline of how you might design a Data Mapper (versus an Active Record).
Overall, given how much I read about software, I was surprised to feel like I was figuring this out myself. Surely saving and loading state with databases is something the vast majority of software does? Yet I feel like I’ve heard more about how to avoid state with e.g. functional programming than I have heard how to deal with state well. There is the great talk Are We There Yet by Rich Hickey, which is about modelling change and time within a program. And the Ruby Object Mapper seems like a comprehensive set of data mapper building blocks. Is there any other material out there on this kind of thing?