Wrapping tables and records

This is the story of my way of describing database access in an object-oriented way. It is quite straightforward, but it matches badly with articles I read. This document is mainly meant to order my thoughts on the subject. This document is also meant to start some discussions.
Let's start simple. There's only one table, and we only read from it. There are a lot of tables like this in the real world, like the employees table of a company. Our application uses the central employees table, but is not allowed to modify it.

Factories

My first thought of building an object model for this, is creating two classes: a clsEmployees collection and a clsEmployee object. The clsEmployees collection does the database access and is a factory that generates clsEmployee objects.
Our application asks the clsEmployees collection for a clsEmployee object without worrying how the collection got it in the first place. That's the collection's own responsibility. It is tempting to pass an id to clsEmployee's constructor, but then clsEmployee should ask clsEmployees for it's own data. This would mean that clsEmployee should know clsEmployees somehow, which is not necessary and therefore not preferrable. In languages that do not require you to handle the memory allocatation of objects, you should always be warned that bidirectional links must be broken before your application can shut down.
With less trouble, clsEmployees can pass clsEmployee's constructor its data or an object containing its data, like the Fields collection of a Recordset.
It would look like this in UML. The constructors are called "construct" in the images. The collection has a constructor as well, so that it can be tested with mock data in a temporary dabatase. It is normally constructed at the start of the application.
The item method is a factory method that looks up the data in the DatabaseTable and returns a brand new clsEmployee object. I would like to limit the scope of clsEmployee's constructor to the collection only, but as far as I know only Eiffel gives me that possibility.

Laziness

Because generating a clsEmployee object is the responsibility of the clsEmployees collection alone, we can introduce an optimization: lazy computation. Lazy computation means that something is not computed until you ask for it and, if possible, once only. This means that the collection stores the generated objects internally to save database access time.
The item method is still a factory method, but instead of creating an object right away, it first checks if it is already present in the storage. If it is found, the existing object is returned. If not, a new object is created from the database, stored in the collection's own storage and passed as the return value.
This approach has a few advantages and drawbacks. First, the object will never change after creation, whatever happens in the database. For a central employee table, this is not necessarily a drawback.
An advantage is that two object references representing the same clsEmployee object actually refer to the same object. A disadvantage is that the system's memory might fill up with objects that will never be used anymore.
It is up to the developer to judge how often the objects will be used and if cleanup actions are necessary.

This was all fairly straightforward. Note that I introduced two classes for my table: a collection class and a member class. The database handling is the responsibility of the collection, not of the member objects. This is something I rarely encounter in examples found on the net. In those examples, objects are often responsible for all their data traffic. As a "bridge" to the next sections, I'll give a few other reasons why a member object is not necessarily responsible for database access.
First, an object can never create itself. Object creation is something that is started before the object itself exists. Like, no robot can build itself. It can build copies of itself, or repair itself, but not build itself. And it is not something I would want either. If the object is stored in a database somehow, I would like to create the object from the database. So I would ask the database (represented by the collection class) to create the object for me.
Object deletion follows the same story. I cannot give an object the command to "get lost" if I'm holding it. And I am holding it when I invoke a method. I should tell someone else to forget about it: the collection class.
There is only one thing that is the responsibility of the object itself, and that is its own data. Now there's a tricky one. The object's own data is linked to the database, isn't it? So, the object is responsible for the data in the database! Well, it is responsible for the actual data, but not for the database handling. That is still the responsibility of the collection. So if an object gets modified by the application and wants its database representation to be updated, it should ask his collection (which represents the database) to do the update. This means that objects that can be modified should know its "parent", the collection.

Adding records

Let's consider another example. We have a clsIdeas collection, containing clsIdea objects. Although there is an employeeId field in the database, we will not consider any link with the employees table for now. We will look into that later. To create a clsIdea object, we ask the clsIdeas collection to create one for us, by invoking the add method of the clsIdeas collection. Note that the ID of the idea is generated by either the database or the collection.
Another approach is seen in Microsoft's Data Access Objects library: Some objects are created by a create method of a collection, but they are not added to that collection immediately. You have to use the collection's add method to make the created object persistent. An advantage of this is that you can store a "finished" object: you can set properties and fill collections contained by the object (for example, a TableDef object contains a Fields and an Indexes collection) before you add it to the database. However, each member object has to know its parent collection and whether it was already stored or not.

Deleting records

Deleting seems even simpler than adding: just let the collection provide a delete method. This method then deletes the record from the database and from the collection's own storage. Once the collection's delete method has been called, no member object encapsulating the same record can be created anymore.
There is one little problem, however. When the collection removes the object reference from its own storage, other pieces of the application might still have a reference to it. The object is then removed from the collection, but it still exists in other parts of the application! These references now point to an object that is alas not deleted, but rather obsolete. So, before the object is deleted by its collection, it should be marked as obsolete for whoever else is holding a reference to it. There are to ways to let the other parts of the application know that the object is no longer valid:Combining them is even better: anyone listening to events is informed immediately, prudent code can check first and the rest will have to learn "the hard way."
The onDeletion event can be implemented on both the collection and the member classes: objects holding one reference recieve the event from the member object, while objects with a collection of deletable member objects probably know their collection.
This whole business of deletion events if you have references outside of the table wrappers (the lazy collection classes). The need to raise events can in many cases be overcome by using partial collections, which we will discuss later.

Modifying record properties

As mentioned before, a member object must know its parent collection to be able to update its properties in the database. When an update is necessary, the member object will ask its collection to update the database. But when is it necessary? You probably don't want an update each time a property (one database field) is changed. If your application is guaranteed the only instance to access the database record, it might be sufficient to do all updates at the end of your application session. This is probably not the case if your database has referential integrity checks and your application uses more related tables.
All in all, you probably won't get around to implementing an update method on the member object to do the update manually.
Even is your update is triggered "manually", it is good to check if an update is really necessary. This is not the case if a property is set, but only if it has changed. In that case, a private isDirty or hasChanged property is set. Your update method check this property and only does the update when it is true.

"Strong" and "weak" references

In a relational database, relations between tables are enforced with keys. Our object ID that we encountered for identifying the object is often a primary key. If records from other tables have a relation to this table, then those records have a field with the primary key value of the record in the first table.
In our object model, we now have a choice. Our methods can return an object or just its id. The id is like a voucher: its worthless as it is, but you can always change it for something real. using just the ID instead of the object itself is called a "weak reference". Passing the objects instead of just IDs is called a "Strong reference".
Weak references are simple to program, but are not as clear as strong references. The drawback of using strong references is that it creates more references in your object model. The main advantage is a better encapsulation of the database. More on this subject is discussed in the section about layers in an application.
From the above database schema, you might derive a clsIdea object like the one on the left (weak reference), or the one on the right. I delibarately do not show what is stored in the object exactly. If you use strong references, you can store the ID internally, or the object. You can store the ID internally only if your object knows how to reach the clsEmployees collection.
Please note that storing the object does not mean that you use more memory. If you store an object, you generally only store a pointer to a memory location. Especially if you use lazy collections, any instance to a particular object is actually a reference to the same object. The references themselves do not consume much memory.

Relations between tables

(or: member objects containing member objects from other collections)

Sometimes an object has a collection as one of its members. Our clsEmployee object for instance, might have an ideas method, returning a collection of ideas for that one employee. That method cannot return a clsIdeas collection, as that was the collection holding all ideas for all employees, not just this one.
This means we need an intermediate class, say clsIdeasForEmployee. The intermediate class has an item method that takes two parameters: the employee ID and the idea number. It also needs a countForEmployee method, so we can determine how much ideas there are for a specific employee. The "subcollections" often work on iteration basis: we know which employee to look for in the clsEmployees collection, but we don't know beforehand which ideas he or she has.
This implies we have to change the lazy behaviour of the clsIdeas collection. Upon a request for data about an employee, we now need to read of entire group of ideas for that employee. Otherwise it is not possible to calculate the count. In fact, the count will generally be called first, to see if there are any data.
For the intermediate class, clsIdeasForEmployee, you have the choice between delegating all calls to item and count to clsIdeas.item and clsIdeas.countForEmployee and storing the subcollection in the clsIdeasForEmployee object itself, without delegation after the subcollection has been read from the database. If ideas can be added to an employee, delegation is probably the easiest way.
Looking at the UML scheme, you can see that the clsEmployees collection has to know the clsIdeas collection for passing it to the clsEmployee objects, which use it to create a clsIdeasForEmployee subcollection. In general, collections of "containing" objects should know all collections of the "contained" objects. Some constructors can get quite a large number of parameters this way. Later on in this document, we will see another way of implementing subcollections.

Changes from the database

Until now, our little application was the only entity that modified or retrieved data in the database. But other applications (or even other instances of the same application) could do this just as well. Here we have a problem. The database generally cannot give a signal that the data we hold has changed. So we have to look for changes ourselves. In general, this would call for a refresh method in our member object. This method asks its collection to supply the current state of the object in the database. In addition, we need a refreshment strategy. In other words, we need to consider when to call the refresh method: The first option seems ideal, but has a few drawbacks. First, it consults the database each time a field is queried, which is more than each time a record is queried. This also leads to the possibility that the object can change while its properties are being queried. It then looks like you queried one object state, but you have in fact queried more states that can combine to an inconsistent state (a set of property values the object never had and never could have had).
To overcome this, you could make the refresh method public and call it whenever you want to start reading the state. You know the object remains consistent between two refresh calls.
And I wouldn't be me if I didn't try to find some in-between strategy to get some of both. The last option is something that I considered for an application once, but probably is the worst of the three. The idea was that a user could never accomplish the querying of the object in two states within, say, half a second. In my case, the user had to scan the same barcode two times and press a button in between. I was quite certain I could do the barcode check well within half a second while the time between two barcode checks was almost guaranteed to be more than half a second. The object would then automatically refresh when I called a getter and the last refresh was more than half a second ago. If you ever want to implement this, think of what can go wrong when the timing is bad...

Partial collections.

If there is no cross-over, that is, no member object is part of more than one collection, you can use partial collections. You then have a separate storage for each collection, but the building mechanism can be shared. The advantage is that looking the members up is now faster, as the storages themselves are smaller (your data is better organized). On the other hand, the application gets slightly bigger. One of the beauties of this mechanism is the separation of the generation of the members and their storage (which is still lazy).
So in our example, each clsEmployee object now has a partial clsIdeasForEmployee collection, that stores the clsIdea objects in its own storage, but gets them from a central clsIdeaMaker object. It's the clsIdeaMaker that converts the data in the database to an actual object.

"Dumb data" layers

In general, not every value that could be stored in a database, is allowed to be stored. It is very tempting at first to put every restriction in the persistent object itself. It is, as you can think, just the one, restricted, object. But there are advantages to splitting them into dumb data objects and chaperonnes.
The dumb data objects accept any data that matches the appropriate method parameter types, and their sole responsibility is encapsulating records.
The chaperonnes are now responsible for the validity of their state.
One of the major advantages is testability: database actions can be tested without the need to build an entire object structure as valid environment first; the chaperonnes can be tested without the need to setup a mock database. In fact, you can use the self-shunt pattern to test the chaperone objects.
The chaperone objects can now have a "mind of their own": sending warning messages, asking for user confirmation, etc. This behaviour is then no longer coupled with storing data. Data-storing objects must generally be reliable and having a mind of their own makes it more difficult to implement and test persistence. As mentioned before, testing the combination of storage and "pig-headedness" can be quite complicated.

ID properties

We already saw that persistent objects can have their database id as a property. Objects with id properties are not a nice example of encapsulation. However, we must know the id value if we want to link the object to a record in the database. Therefore, some objects will have an id property, whether you like it or not. Especially in the 'dumb data' layer, this is not a code smell. Outside of this layer, you must ask yourself why you need the id there. If the id is something that is entered or shown in the user interface layer, you will not have much of a choice either. If your data objects are in a separate package or project, you can limit the visibility to that package.