This is the story of my way of describing database access in an object-oriented way. It is quite straightforward, but it matches badly with articles I read. This document is mainly meant to order my thoughts on the subject. This document is also meant to start some discussions.
Let's start simple. There's only one table, and we only read from it. There are a lot of tables like this in the real world, like the employees table of a company. Our application uses the central employees table, but is not allowed to modify it.
My first thought of building an object model for this, is creating two classes: a
clsEmployees
collection and a
clsEmployee
object. The
clsEmployees
collection does the database access and is a factory that generates
clsEmployee
objects.
Our application asks the
clsEmployees
collection for a
clsEmployee
object without worrying how the collection got it in the first place. That's the collection's own responsibility. It is tempting to pass an id to
clsEmployee
's constructor, but then
clsEmployee
should ask
clsEmployees
for it's own data. This would mean that
clsEmployee
should know
clsEmployees
somehow, which is not necessary and therefore not preferrable. In languages that do not require you to handle the memory allocatation of objects, you should always be warned that bidirectional links must be broken before your application can shut down.
With less trouble,
clsEmployees
can pass
clsEmployee
's constructor its data or an object containing its data, like the
Fields
collection of a
Recordset
.
It would look like this in UML. The constructors are called "
construct
" in the images. The collection has a constructor as well, so that it can be tested with mock data in a temporary dabatase. It is normally constructed at the start of the application.
The
item
method is a factory method that looks up the data in the
DatabaseTable
and returns a brand new
clsEmployee
object. I would like to limit the scope of
clsEmployee
's constructor to the collection only, but as far as I know only Eiffel gives me that possibility.
Because generating a
clsEmployee
object is the responsibility of the
clsEmployees
collection alone, we can introduce an optimization: lazy computation. Lazy computation means that something is not computed until you ask for it and, if possible, once only. This means that the collection stores the generated objects internally to save database access time.
The
item
method is still a factory method, but instead of creating an object right away, it first checks if it is already present in the storage. If it is found, the existing object is returned. If not, a new object is created from the database, stored in the collection's own storage and passed as the return value.
This approach has a few advantages and drawbacks. First, the object will never change after creation, whatever happens in the database. For a central employee table, this is not necessarily a drawback.
An advantage is that two object references representing the same
clsEmployee
object actually refer to the same object. A disadvantage is that the system's memory might fill up with objects that will never be used anymore.
It is up to the developer to judge how often the objects will be used and if cleanup actions are necessary.
This was all fairly straightforward. Note that I introduced two classes for my table: a collection class and a member class. The database handling is the responsibility of the collection, not of the member objects. This is something I rarely encounter in examples found on the net. In those examples, objects are often responsible for all their data traffic. As a "bridge" to the next sections, I'll give a few other reasons why a member object is not necessarily responsible for database access.
First, an object can never create itself. Object creation is something that is started before the object itself exists. Like, no robot can build itself. It can build copies of itself, or repair itself, but not build itself. And it is not something I would want either. If the object is stored in a database somehow, I would like to create the object from the database. So I would ask the database (represented by the collection class) to create the object for me.
Object deletion follows the same story. I cannot give an object the command to "get lost" if I'm holding it. And I am holding it when I invoke a method. I should tell someone else to forget about it: the collection class.
There is only one thing that is the responsibility of the object itself, and that is its own data. Now there's a tricky one. The object's own data is linked to the database, isn't it? So, the object is responsible for the data in the database! Well, it is responsible for the actual data, but not for the database handling. That is still the responsibility of the collection. So if an object gets modified by the application and wants its database representation to be updated, it should ask his collection (which represents the database) to do the update. This means that objects that can be modified should know its "parent", the collection.
Let's consider another example. We have a
clsIdeas
collection, containing
clsIdea
objects. Although there is an employeeId field in the database, we will not consider any link with the employees table for now. We will look into that later. To create a
clsIdea
object, we ask the
clsIdeas
collection to create one for us, by invoking the
add
method of the
clsIdeas
collection. Note that the ID of the idea is generated by either the database or the collection.
Another approach is seen in Microsoft's Data Access Objects library: Some objects are created by a
create
method of a collection, but they are not added to that collection immediately. You have to use the collection's
add
method to make the created object persistent. An advantage of this is that you can store a "finished" object: you can set properties and fill collections contained by the object (for example, a
TableDef
object contains a
Fields
and an
Indexes
collection)
before you add it to the database. However, each member object has to know its parent collection and whether it was already stored or not.
Deleting seems even simpler than adding: just let the collection provide a
delete
method. This method then deletes the record from the database and from the collection's own storage. Once the collection's delete method has been called, no member object encapsulating the same record can be created anymore.
There is one little problem, however. When the collection removes the object reference from its own storage, other pieces of the application might still have a reference to it. The object is then removed from the collection, but it still exists in other parts of the application! These references now point to an object that is alas not deleted, but rather obsolete. So, before the object is deleted by its collection, it should be marked as obsolete for whoever else is holding a reference to it. There are to ways to let the other parts of the application know that the object is no longer valid:
- raise an "
onDeletion
" event,
- implement an
isDeleted
method to make a check possible,
- throw an exception or raise an error when a method is called after the obsolte marking (but not the
id
method and the isDeleted
method, as you still need them).
Combining them is even better: anyone listening to events is informed immediately, prudent code can check first and the rest will have to learn "the hard way."
The
onDeletion
event can be implemented on both the collection and the member classes: objects holding one reference recieve the event from the member object, while objects with a collection of deletable member objects probably know their collection.
This whole business of deletion events if you have references outside of the table wrappers (the lazy collection classes). The need to raise events can in many cases be overcome by using
partial collections, which we will discuss later.
As mentioned before, a member object must know its parent collection to be able to update its properties in the database. When an update is necessary, the member object will ask its collection to update the database. But when is it necessary? You probably don't want an update each time a property (one database field) is changed. If your application is guaranteed the only instance to access the database record, it might be sufficient to do all updates at the end of your application session. This is probably not the case if your database has referential integrity checks and your application uses more related tables.
All in all, you probably won't get around to implementing an update
method on the member object to do the update manually.
Even is your update is triggered "manually", it is good to check if an update is really necessary. This is not the case if a property is set, but only if it has changed. In that case, a private isDirty
or hasChanged
property is set. Your update method check this property and only does the update when it is true.
In a relational database, relations between tables are enforced with keys. Our object ID that we encountered for identifying the object is often a primary key. If records from other tables have a relation to this table, then those records have a field with the primary key value of the record in the first table.
In our object model, we now have a choice. Our methods can return an object or just its id. The id is like a voucher: its worthless as it is, but you can always change it for something real. using just the ID instead of the object itself is called a "weak reference". Passing the objects instead of just IDs is called a "Strong reference".
Weak references are simple to program, but are not as clear as strong references. The drawback of using strong references is that it creates more references in your object model. The main advantage is a better encapsulation of the database. More on this subject is discussed in the section about
layers in an application.
From the above database schema, you might derive a
clsIdea
object like the one on the left (weak reference), or the one on the right. I delibarately do not show what is stored in the object exactly. If you use strong references, you can store the ID internally, or the object. You can store the ID internally only if your object knows how to reach the
clsEmployees
collection.
Please note that storing the object does not mean that you use more memory. If you store an object, you generally only store a pointer to a memory location. Especially if you use
lazy collections, any instance to a particular object is actually a reference to the same object. The references themselves do not consume much memory.
(or: member objects containing member objects from other collections)
Sometimes an object has a collection as one of its members. Our
clsEmployee
object for instance, might have an
ideas
method, returning a collection of ideas for that one employee. That method cannot return a
clsIdeas
collection, as that was the collection holding all ideas for all employees, not just this one.
This means we need an intermediate class, say
clsIdeasForEmployee
. The intermediate class has an
item
method that takes two parameters: the employee ID and the idea number. It also needs a
countForEmployee
method, so we can determine how much ideas there are for a specific employee. The "subcollections" often work on iteration basis: we know which employee to look for in the
clsEmployees
collection, but we don't know beforehand which ideas he or she has.
This implies we have to change the lazy behaviour of the
clsIdeas
collection. Upon a request for data about an employee, we now need to read of entire group of ideas for that employee. Otherwise it is not possible to calculate the count. In fact, the count will generally be called first, to see if there are any data.
For the intermediate class,
clsIdeasForEmployee
, you have the choice between delegating all calls to
item
and
count
to
clsIdeas.item
and
clsIdeas.countForEmployee
and storing the subcollection in the
clsIdeasForEmployee
object itself, without delegation after the subcollection has been read from the database. If ideas can be added to an employee, delegation is probably the easiest way.
Looking at the UML scheme, you can see that the
clsEmployees
collection has to know the
clsIdeas
collection for passing it to the
clsEmployee
objects, which use it to create a
clsIdeasForEmployee
subcollection. In general, collections of "containing" objects should know all collections of the "contained" objects. Some constructors can get quite a large number of parameters this way. Later on in this document, we will see
another way of implementing subcollections.
Until now, our little application was the only entity that modified or retrieved data in the database. But other applications (or even other instances of the same application) could do this just as well. Here we have a problem. The database generally cannot give a signal that the data we hold has changed. So we have to look for changes ourselves. In general, this would call for a
refresh
method in our member object. This method asks its collection to supply the current state of the object in the database. In addition, we need a refreshment strategy. In other words, we need to consider when to call the
refresh
method:
- automatically when a getter is called,
- manually,
- after a time-out
The first option seems ideal, but has a few drawbacks. First, it consults the database each time a field is queried, which is more than each time a record is queried. This also leads to the possibility that the object can change
while its properties are being queried. It then looks like you queried one object state, but you have in fact queried more states that can combine to an inconsistent state (a set of property values the object never had and never could have had).
To overcome this, you could make the
refresh
method public and call it whenever you want to start reading the state. You know the object remains consistent between two
refresh
calls.
And I wouldn't be me if I didn't try to find some in-between strategy to get some of both. The last option is something that I considered for an application once, but probably is the worst of the three. The idea was that a user could never accomplish the querying of the object in two states within, say, half a second. In my case, the user had to scan the same barcode two times and press a button in between. I was quite certain I could do the barcode check well within half a second while the time between two barcode checks was almost guaranteed to be more than half a second. The object would then automatically refresh when I called a getter and the last refresh was more than half a second ago. If you ever want to implement this, think of what can go wrong when the timing is bad...
If there is no cross-over, that is, no member object is part of more than one collection, you can use partial collections. You then have a separate storage for each collection, but the building mechanism can be shared. The advantage is that looking the members up is now faster, as the storages themselves are smaller (your data is better organized). On the other hand, the application gets slightly bigger. One of the beauties of this mechanism is the separation of the generation of the members and their storage (which is still lazy).
So in our example, each clsEmployee object now has a partial clsIdeasForEmployee
collection, that stores the clsIdea
objects in its own storage, but gets them from a central clsIdeaMaker
object.
It's the clsIdeaMaker
that converts the data in the database to an actual object.
In general, not every value that could be stored in a database, is allowed to be stored. It is very tempting at first to put every restriction in the persistent object itself. It is, as you can think, just the one, restricted, object. But there are advantages to splitting them into dumb data objects and chaperonnes.
The dumb data objects accept any data that matches the appropriate method parameter types, and their sole responsibility is encapsulating records.
The chaperonnes are now responsible for the validity of their state.
One of the major advantages is testability: database actions can be tested without the need to build an entire object structure as valid environment first; the chaperonnes can be tested without the need to setup a mock database. In fact, you can use the self-shunt pattern to test the chaperone objects.
The chaperone objects can now have a "mind of their own": sending warning messages, asking for user confirmation, etc. This behaviour is then no longer coupled with storing data. Data-storing objects must generally be reliable and having a mind of their own makes it more difficult to implement and test persistence. As mentioned before, testing the combination of storage and "pig-headedness" can be quite complicated.
We already saw that persistent objects can have their database id as a property. Objects with id properties are not a nice example of encapsulation. However, we must know the id value if we want to link the object to a record in the database. Therefore, some objects will have an id property, whether you like it or not. Especially in the 'dumb data' layer, this is not a code smell. Outside of this layer, you must ask yourself why you need the id there. If the id is something that is entered or shown in the user interface layer, you will not have much of a choice either. If your data objects are in a separate package or project, you can limit the visibility to that package.