There seems to be a widening split between mainframe and PC programmers in the design of user interfaces and databases. At the root of it is the not so new concept of "object oriented" programming. This month I take a new look at the contrasts and similarities between object oriented programming and relational databases.
In the large systems world, the state of the art of database technology is SQL. In turn, SQL is based on the concept of a relational database. This has been the dominant technology now for anywhere from five years to twelve or more years, depending upon how you view it.
All relational concepts are based on work done by Dr Ted Codd published in the early seventies. SQL is by no means a perfect implementation of relational concepts, but it has reached the top of the tree. You can run IBM's preferred database on anything from OS/2, OS/400 and VSE through VM to MVS. The underlying database may vary, but it is always reached through SQL.
But in the PC world, the latest talk is of Object Oriented techniques. We can develop better, reusable code in less time using Object Oriented languages, say PC pundits. The display metaphors of Windows and the Macintosh are well suited to object orientation. Some of the database suppliers are talking of object oriented databases.
How can we reconcile using SQL, an old technology, in a world where junior programmers tell us that "object oriented" is the solution to development backlogs?
Dr Bruce Lindsay, a researcher with IBM into relational technology, and a long standing member of the SQL and DB2 development teams, gave his views on this to a group from GUIDE, the IBM user group.
His basic thesis is that "Object Oriented" programming involves processing data item by item (when applied to database access), and so is inherently a Third Generation Language methodology; which we chose to put behind us as old fashioned ten years ago!
To understand this argument, we need to remove some of the mystique of both relational and object oriented technologies, by translating some of the special terms used to talk about them.
Mostly, it means using a different vocabulary from what we are used to. For example, in "Object Oriented", we could say that a program works because instances of objects send messages to their methods. In normal programming terms, we could say that fields from a database record are processed by procedure calls to subroutines that process these data elements.
"Object Oriented" | Relational |
---|---|
message | procedure call |
method | subroutine body |
object | row or tuple (record) |
instance variable | column (field) |
The basic terminology given above is the key to understanding Object Oriented programming ("OO" or "OOP") for us older programmers.
Object Oriented adds in a few more special concepts to our vocabulary. These concepts are put forward as the main advantages of the Object Oriented approach.
When you first study OOP, a new language comes with it. Terms like "objects" and "classes" come out, along with "encapsulation", and "inheritance".
The underlying concept is that of "types", or objects. All data in the system is an instance of a type, or "class". A new type can be built up from the definitions of other types, by describing how it differs from it's parent types. So new classes can be less general examples of the parent types, building up into a hierarchy of types.
Subtypes are more specialised than their parent(s); in fact sub- types form a superset of their parents, incorporating variations on the methods (subroutine calls, remember?) used to access types. One interesting property of a subtype is that you can substitute a subtype for any instance of it's parent types. This is especially important, as it is one of the ways by which Object Oriented programming allows code reuse. I'll keep coming back to examples of code reuse.
A number of jargon terms are used to describe objects. One such is encapsulation, which is another of the foundations of this new school. In encapsulation, the behaviour, or implementation, of a type is kept clearly separated from it's use. So the code written to define behaviour of a type is part of the object definition.
One of the words that I dropped just above was "inheritance"; this mechanism is probably the property of OOP that contributes the most to reducing programming effort, so it is important to understand it properly.
A type works exactly like it's parent unless you tell it otherwise. So all methods (subroutines) are borrowed from the parent, unless you explicitly redefine them. An object can have multiple parent types (multiple inheritance); when identically named methods come from both parents, the programmer must be resolve the conflict. How exactly this is done varies from implementation to implementation.
Those are the main features of OOP. Any object oriented programming language will offer inheritance and encapsulation. We can see how these features in particular allow a rigid framework of the definition of data, which allows programmers to easily reuse code. Now that we have the basic terminology clear, I would like to look a little deeper to reveal a few restrictions of this technique.
Object Oriented Programming is Navigational. This is the main distinction between Third and Fourth generation languages; so all Objected Oriented Programming Languages are third generation. Although data is disguised as objects, you must explicitly define all handing of data.
Object relationships must be managed by programs or by methods; explicitly, not implicitly. So, for example, you can't have an implicit display routine, where the request to locate a record (or "tuple") will result in the relevant data items being displayed on the screen, or in a report. Instead, the programmer must explicitly write code to move data from the object to the screen.
This restriction also means that there is no automatic handling of collections of objects (one to many relationships). In addition, there is no referential integrity; that means that we can't use query optimisation at the database level (which is important for the language designers); the programmer must do optimisations each time in his application, instead of being able to allow the system to do it for him.
Because of the way that object definitions are concealed from the programmer, and the way that methods are implemented as sub- routines, OOPS implement something known as an Object Identifier. This is a hidden datafield that gives each object a unique id, allocated by the language mechanism.
This conflicts with the relational "view" concept, as views are derived and encapsulated from the underlying data, according to the query. In the relational model, data is independent from the physical implementation of the data structures. A view will allow you to select only the data items that are relevant to a particular application; unrequired data is concealed and protected from the application.
A "message" (procedure call) always has to go to a specific object (in other words, the object id is a mandatory parameter to the call). This is an obstacle to encapsulation of data, as methods must always have full access to each object. So OO doesn't allow us to have logical view of data very easily.
Having broken down some of the jargon of the Object Oriented programmers, it is only fair to see how we can match this off against standard Relational model descriptions of information.
Relational models of data came about because we wanted a way of managing data that reflected the way we thought about our data.
A database is a repository of information that can share it's information between users and programs that need the data. The information held is "persistent". That means it is consistent and repeatable. An employee's address should be the same from where ever you need to access it, and however it is updated. To do this, information has to be stored in a form that is independent of the programs that use it (can you now see how we might be diverging from the Object Oriented model?).
As information is valuable, a database must be protected. Only authorised programs and users should have access the individual elements of data. The information should be protected so it is recoverable in the event of accidents, and (most important of all) has "integrity". This reflects back on "consistency"; a database may use many different internal mechanisms to optimise storage. If an employee's address changes, then any references to that address should also be changed, all at the same time.
Part of the Relational model is the "Data Model", a way of describing information held in a Relational database independently of how the data is stored. It uses it's own special jargon to describe data, and allows us to use a special language, called the Data Manipulation Language (or DML) to access information.
In Object Orient terminology, we could refer to the "tuples" and "relations" as constructors, and DML operations such as "select", "join" and "project" as methods.
The DML will work in an identical way no matter how the organisation of the data changes, based on association of related data. This is not possible in the Object Oriented model, with it's insistence on associating language constructs and operations with the physical structure of the data.
For example, a view that takes information from several different physical structures (objects), or derived data items, that are calculated from other data items, are not possible in the current Object Oriented languages.
The Relational model includes a number of provisions to protect data integrity within the Data Definition Language and the DML. These are extensions to computer languages required for relational access, such as NULL values, data ranges, distinct values, rules and triggers and so on.
Most of these things could be included in the Object Oriented model by a few extensions. Some of the newer ideas are already part-way there. For example, the concept of composite, or complex, objects. This is where an object is structured to contain other objects or data formats.
We could do this in older database models, like the CODASYL model, in a limited way, by copying the data items to each new location. In Object Oriented models, the component objects are held together by passing references (pointers) to the objects. In the Relational model, we pass the value of an object, to preserve integrity; or we pass a description (an SQL statement) of the data required.
This matches very neatly the trend towards late binding of data and operators in Object Oriented languages.
The clash between the Object oriented and Relational models is one between database navigation and query based data access. An Object Oriented language must explicitly search for and process each of a set or records, with explicit code to manage data integrity. A Relational language automatically understands and conceals the architectural framework of the database.
Objects are fixed, rigid structures. A Relational query is very flexible, and can adapt to changes in the underlying physical structure, or spontaneous changes in the nature of a user's request.
In concept, it would be possible to revise the relational model to benefit from objects. Adding types to SQL, for instance, would dramatically increase the power of the language. That way, we could specify that a particular field was "money" or "date", instead of having to force money into a generalised "number" field. This would mean that the field definition would match what the data really was. This would allow the SQL language designers to select operators at run-time, so that programmer defined operations could be set up to work on "money" or "time" values. This would also allow much better optimisation of queries, as knowledge about the data could be applied to the search algorithms. Of course, in practice, many SQL real-world implementations _do_ allow type definitions, simply because it makes everybody's life much easier, even though it may not conform to "pure" relational theory.
Some of the concepts of object oriented programming have an immediate application. In particular, the concept of types and inheritance, combined with composite objects, contribute to programming efficiency in any language. However, when applied to database design, relational technology has more to offer. The ability to embed small, non-procedural queries in programs, and the concept of a logical view into an abstract database are vital. Object oriented programming does not address the fundamental problem of database; that of integrity. Referential integrity absolutely requires that all data is passed by value; the use of objects forces data passing by reference. This is acceptable within a program, but only just; in dealing with a database, the lack of referential integrity is unworkable.
Object Oriented programming requires you to use special programming languages, such as C++, Objective C or Smalltalk. These languages are inherently Third Generation, or procedural. The new gurus of Object Oriented programming have ignored the productivity benefits gained by Fourth Generation language design. The relational model gives you an independent, non- procedural language (SQL), which can be used with any other language. SQL procedures can be embedded in any normal language, Fourth Generation or Third Generation, including Object Oriented languages. This has the advantage of preserving your investment in existing code. Millions of lines of COBOL and other languages are used in large application suites are still running happily today. These can be _reused_ in gradual upgrades and integrated with SQL calls, gaining the advantages of relational databases, instead of having to suffer a complete rewrite in an objected oriented language.
The new technology is good; we can't doubt that. But we can get better results faster today by making use of existing code.