segunda-feira, 1 de dezembro de 2008

Storage Toolbox

Storage Toolbox

Storage Toolbox concept is very simple: we need a toolbox to handle data preservation. By data preservation is meant persistence (store in hard copy location) and/or prevaylance ( store in soft copy location).

Data preservation is a common issue in enterprise systems. Historically data preservation was made on hard disk, i.e. files. Files where not transactional and that implied in the need for data base management systems (DBMS). Nowadays systems interact with DBMS by means of SQL phases. The interaction is mediated by special software know as a Driver. Different technologies exist for those Drivers. In java, the JDBC Drivers are used. Other technologies include de ODBC Drives and the ADO Drivers.

Interactingwith DBMS via SQL is not that simple. Each DBMS vendor incorporates special SQL syntax. These differences in the syntax create different SQL dialects. This is not good for the application programmer. If, on the other hand, you use only standard SQL syntax your program will not be beneficing from the DBMs performance boost features

Dependency on DBMS SQL dialects and features ( other that dialects , like procedures and triggers) really implies in short-lived applications.If you need to change to another DBMS product you cannot do it without the cost of reimplementing the DBMS communication layer (supposing you have one for starters. If do not, the cost will even higher).

The goal for the Storage Toolbox is to provide an abstraction layer that isolates all this problems from the application programmer and application code.

Data Storage

The Storage Toolbox does not assume the data is stored in a DBMS. Today DBMS are commodities and modern development techniques hunger for more lose-ended data repositories. Two forces drive this choice. 1) a more unit testing oriented programming demands the storage features to be pluggable in order to be replaced by stubs or mocks. 2) more RAM available at a lower cost enables applications to perform better if the data objects are maintained in memory for longer periods of time. This new approach of transacional data repositories in RAM memory is being called prevaylance and is know possible and competitive with the DBMS option. At the end of the day the data still needs to be stored in hard copy repositories. Not destroying and creating object all the time from DBMS informations improves the over all application performance.

In memory data stores is really very useful for testing and development as it allows delaying the database design for later after you already defined and tested the data model. Also provides an intuitive option for object cache.

A more modern approach to data models take advantage of Object Oriented techniques and isolates the developer from Table and Column searching issues providing object search and edition techniques instead.

MiddleHeaven Storage Toolbox uses the concept of a logical DataStorage managed by a StoreKeeper Store keepers are mediators for the real data preservation mechanics for either a DBMS, a XML file, a prevaylant system or a common in memory List. MiddleHeaven Storage Toolbox provides total abstraction of the data preservation features.

Storing and Retrieving

For every database system (managed or not) you always have set of basic operations you need. You need to be able to add new data to the storage and have to be able to retrive data data. Further you commonly need to update the data already in storage. Finally you need to be able to delete de data in the storage (even thought deletion operations are considered armful as they may characterize information destruction).

Java being a Object Oriented language all data needs to be place in objects, and we find two classifications of those objects: objects that contain only plain data ( like text, data, logic values, numeric values) and objects that contains aggregations of the first objects. The first objects are classified as primitives ( as they are not composed) and the second are named data aggregates. In java, data aggregates are often objects with a Bean like contract ( a brunch of attributes with no further associated behaviour other than read and write those attributes).

In a Object Oriented application is natural to use beans and primitives to represent data. And even further is natural to think that these beans are the data representations of our application entities. MiddleHeaven Storage Tooblox, thus, chooses to manipulate these beans and primitives directly and not to expose to the programmer the real storage structures (either tables, or maps, etc.)

When retrieving the data from a data storage is often possible to retrieve the same data for different proposes. Normally a query based approach is used. DBMS use SQL phases to build those queries but from a Object Oriented and a Java perspective, SQL is a pour approach. SQL is not OO and it is really more used like a protocol that as a query object (even thought it is one).

The MiddleHeaven Storage Toolbox chooses to use object oriented structures - named Criteria - to provide this query object feature. SQL or any other "protocol"-like query language ( like XPath or XQuery , for XML) is not used by the programmer. Instead, those languages are used by the StoreKeeper in order to communicate with the underlying data storage. What this really means is that you will program your application to talk to a DataStorage and that's it. If you later decide to change the keeper (i.e. the underlying storage technology) you simply can.

Something about entities

What is an entity ? An entity is something whose instances have Identity. Identity is a intrinsic property that is different for each entity's instance (i.e. each instance has its own identity). Think on a persons identity. Witch property you will use as Identity ? The right awnser is: none. None of the properties that characterize a person (height, eye color, finger prints, DNA) are the persons identity. Some, are closely related like DNA and finger prints, but, they are only related, they are not the identity of a person it self. So, identity is something abstract. Mathematics is almost solemnly based on the concept of identity. When you wright 2 = 3 it means "the identity of 2 is the same as the identity of 3". ( This is false because identity is , by definition, different for each number.)

In Java all objects have identity. Each object is different from the other object. You can assert and compare identity with the == operator ( = is the assignment operator, not the assert identity operator). However this JVM enforced identity is to strong for enterprise application proposes. You need a identity that you can define and compare. This means you must allow to have two or more object that are not the same and still represent the same instance of the entity, i.e. they share a common identity.

Accross history several informations about an entity have been used to harness/simulate identity. Names and identification numbers are the most common. However, today, the best property you can use is a property with no meaning, i.e. a "virtual" property that you create specially to decide about the equalness of the identity. Normally an integer number suffices, even thought some times you need a Universal Unique Identifier (UUID)

MiddleHeaven Storage Toolbox use the Identity type as an abstraction for the identity property. Implementations can then chose a suitable implementation of Identity according to the entity at hand.

Model

Ilustration 1: Storage Toolbox main types
The MiddleHeaven Storage Toolbox implements a agnostic domain store. This means it is not limited to DBMS queries but can be used with other technologies that could be used to create databases e.g. XML.

The main type are the DataStorage , Criteria and Query interfaces. DataStorage allows for interaction with the data storage. Criteria objects are implementations for the Query Obejct pattern and can be used to specify complex queries. The DataStorage will convert them into Query objects that represent the query results. Nothinmg is said about at what moment the query is really performed on the datastorage. By design the query should only be perform when one on the Query methods is invoqued.

Extra information about how the query should behave can be passed as second parameter. This hints inform the datastorage how the data will be read and this allows for optimization by using patterns like Fastlane Reader or Flyweight.

Criteria objects can be build by the programmer but the CriteriaBuilder class provides a fluent interface for this task. Also, using the CriteriaBuilder you end up with code that closely resembles a SQL query, being easy to read and change, but the advantage of strong typing. Criteria objects are created by invoking the search method on the criteriaBuilder. You can use static import to further simplify the query as showned here:

1
2 Criteria someCriteria = search ( Subject. class )
3 .and ( "name" ) .not () .eq ( "Jack" )
4 .orderBy ( "name" ) .asc ()
5 .all () ;
6

Code1: CriteriaBuilder example

StoreKeeper

All DataStorage operations are really delegated to a StoreKeeper. StoreKeeper is responsible to really change or retrive the data from the real data sotrage. MiddleHeaven now implements the DataBaseStorageKeeper for access to DBMS via JDBC, a XMLStorageKeeper and a InMemoryStorageKeeper. At this point both this storekeepers are being explored in order to obtain a agnostic enough model for the keepers across different data preservation APIs.

The DataBaseStorageKeeper goal is to able to communicate with any DBMS. In order to accommodate several different dialects a DatabaseDialect type was introduced. It performs all the SQL/JDBC related operations including the generation of SQL statements. DatabaseDialect encapsulates he creation of comands for most of the SQL standard operations, including creating tables and reading and changing the database model. The DataBaseStorageKeeper obtains the commands form the dialect and then performs the operations in a DBMS independent way. Out-of-the box, at the time this is being written, MiddleHeaven Storage Toolbox supports PostgreSQL 8.3, HSQL 1.8 and SQL Server 2005 dialects. DataBaseStorageKeeper uses a Datasource from were to obtain javax.sql.Connection.

None of the StorageKeepers performs any transacional control. However, if the underlying storage is not transactional it may help to support integration with the Transnational Toolbox.

Storable and StorableEntityModel

StoreKeeper handles collections of Storables. Each object passed to the DataStorage is converted to a Storable before being passed to the keeper. Storables allow for control of persistance properties of the object other than the data provieded by the object. The fields and values of the storable are abstracted by a StorableFieldModel. The set of all fields' StorableFieldModels form the StorableEntityModel for the entity. StorableEntityModel is a agnostic abstraction for the entity from the keeper point of view. For now a simple implementations based on the StorableDomainModel is provided. The rational is that the persistence model for the entity is conceptually decoupled from entity model it-self. The goal is to provided the means to implements complex multiplicity relations between entities and the underlying data structures ( tables, files, etc...) thus not relying on a simple one-to-one relation multiplicity.

The model also acts as a factory for entity instances. This is essential to mapping and loading new data objects from the underlying data.

Under the hood

MiddleHeaven's Storage Toolbox is an agnostic Domain Store pattern implementation. Some API already support this patterns like JPA and Hibernate. Conceptually is possible to implement a StoreKeeper to use those API, however they are extremely focusses on SQL and DBMS making direct use of concepts like Table , Primary Key and Automatic Key Generation. DataStorages also have key (identity) generation but is totally decouple from the DBMS. For example, when the DBMS nativly supports sequences, the keeper can return an encapsulation of that on a Sequence object. When not, the keeper can simulate the sequence by other means.

On the other hand MiddleHeaven's Storage Toolbox lacks many of the optimizations performed by Hibernate or JPA, like generational cache. MiddleHeaven's Storage Toolbox is designed to be able to provide the same functionality by decorating (Decorator Pattern) data storages with other data storages enabling the application to use a cached version of any other underlying data storage. This is still a work in progress at this moment.

The trade-off for this toolbox was not to depend on any other API has none of the available is agnostic enough. This implies in a grater effort to implement and test the toolbox, but provides greater flexibility.

Storable is an internal type used to control persistence state. Any object passed to the DataStorage is converted in a Storable. this is archived by means of bytecode manipulation allowing the original object to be mutated to an object that extends the original class and implements the Storable interface. This is one of the reason why the store method returns an object of the same type. This object is not the same object passed to the method, it's now a managed object. If this object is further used and changed by the application those alterations are recorded. When the object is again passed to the store method the data storage can identify the changes and act accordingly. If, for example, no changes where made, the method will simply return. Also, any object returned by the Query interface is a managed object, by the same reasons.

Limitations

As the MiddleHeaven Storage Toolbox is based on entity objects there is less room to work with the storage native data elements themselves. Meaning that you are not supposed to work with tables , columns , rows or xml directly. For that kind of interaction you will use other technologies and toolboxes ( some of which are used internally by the implementation).

A second limitation is that , for now, only a Domain Driven Datastorage is provided. Even thought the structure does limit this design this is the most usefull implementation nowadays. Support for an ad doc datastorage could be implemented in the future according to demand.