Low-level Infrastructure Components

While the primary focus of the OERA OSI is service components to implement OERA architectures, any framework also has a need for a library of basic, low level components on which to base other code. Thus, the OERA OSI will include a library of such low level components.

Collection Classes for OO Relationships

In Object-Oriented programming one object interacts with another object via a relationship, i.e., a direct link between the two objects. When this relationship is one to one, then one object will simply have a reference to the other object. But, if the relationship is one to many, then there is a need for an intermediary structure to hold the references for the many end of the relationship. No intermediary structure is shown in a UML diagram, since the notation on the relationship indicates the nature of the relationship, but when implementing the model in OO code, generic infrastructure is required to manage this relationship. These generic infrastructure classes are often called “collection” classes.

The 1 end (“parent”) of the 1 to many relationship has a reference to the collection and the collection contains references to the objects at the many end (“child”) of the relationship. By navigating the methods of the collection class, the parent can access the members of the collection, i.e., can traverse the relationship.

In my 2006 implementation of collection classes ( http://www.oehive.org/CollectionClasses ) I used the Java model for collection and map classes and followed it quite closely, except for the differences which I believed were indicated by the choice of temp-tables as the basis of implementation. The current revision was originally motivated by the desire to add modern error handling facilities and to switch from an internal iterator to a separate iterator, i.e., to enable multiple iterators on the same collection. In the process of considering a later version of the Java libraries and whether or not the functionality of the original collection classes should be extended, several ideas developed for possible alternatives to the use of temp-tables, at least for some types of collections.

Further consideration made us realize that the Java collection hierarchy covers much more than just collections for object relations, e.g., queues, so it was decided to step back from the Java implementation and concentrate first on the needs for defining relations. This, and a general desire to make these core objects as small and light as possible has led to a proposed structure which has little relationship to the Java hierarchy. In particular, the initial release will focus strictly on the needs for implementing relations. The various other possible collection and map type functionality will be considered at a later time.

This documentation is being made available prior to development of code and testing in order to allow for feedback and comment, so that there is opportunity to influence the development.

Core Concepts - Order

The most common form of relationship collection is ordered only by the order in which elements are added to the collection. The parent using the collection may have added the elements in some special order, but most commonly the order is of no significance because any operation performed on the collection will be performed on all members of the collection and it is thus order is of no significance.

It is possible to conceive of a set in which there was no predictable order (called a Bag in Java parlance), but it is unlikely that this provides any better utility than order by addition and so is not included in the current work. In particular, it is unclear how one insures that each element in a Bag is processed once and only once.

A second type of relationship connection is ordered by identity, i.e., by the identifier for the object (in ABL, this is given by int(ObjRef). Like addition order, one will typically build such a collection completely, make one or more iterations through all the elements doing some processing, and then delete the collection with or without persisting. The major difference with a collection ordered by identity over one ordered by addition is that it is faster and easier to identify whether a particular object is part of a collection. This could be important if the parent needs to enforce no duplicates (see Core Concepts – Duplicates).

A third type of relationship collection is ordered by some other value, typically an attribute of the elements. These relationships are fairly uncommon in traditional 3GL OO, but might seem more “natural” to an ABL programmer because of the analog to tables and keys, but in most relationships, the order is not relevant and the use of a key would be unnecessary overhead. In these collections, values are added as key/value pairs instead of just individual object elements, where the key is the basis for ordering and the value is the object element. In Java terminology, these are known as Map classes rather than Collection classes.

In our hierarchy we are providing two forms of the key/value classes, one which is merely ordered by the key, but where duplicate keys are allowed and the other where the collection is ordered by the key and the key is unique and thus provides a unique identifier for a specific object. Java Map classes only provide the later type.

Core Concepts – Duplicates

In any ordinary object relationships, one would never have duplicates because it simply makes no sense to have the same object twice on one end of a relationship. Some Java Collection objects enforce this rule and others don’t.

In designing these collection classes, we decided to omit checking for duplicates for two reasons. First, in any implementation which does not provide direct indexing by object identity, searching every element already in a collection to see if it matches a proposed new element is expensive overhead. Second, in most cases, the very nature of adding elements to a collection will not lead to duplicates, so incurring the overhead of checking for duplicates is wasted effort. Compare, for example, to populating a temp-table with the lines for a particular order based on the persisted data – would you feel the need to include logic to check to see whether an order line was already in the table or would you assume, rightly, that sequential reading of the persisted data will result in no duplicates.

We are electing to provide a Contains() operator on our collection classes which the parent can use to test for duplicates if a situation arises in which it may not be clear if a particular object is already in the collection.

With key-value sets there is some question about what one means by “no duplicates”. In the Java Map classes, no duplicate keys are allowed based on the idea that each key should uniquely identify an object, but there is no restriction on duplicate objects, i.e., the same object can be pointed to by different keys. This makes little sense in a relationship collection. Consequently, we have provided two types of key/value pair objects, one of which limits to unique keys and the other of which doesn’t. The former is like the Java Map class and the later provides for collections ordered by an attribute, but not uniquely identified by them. Depending on implementation, these would or would not support duplicate objects, but there seems little reason to allow duplicates if it is easy to prevent.

Core Concepts – Model Hierarchy

In constructing a model hierarchy for these collection classes there are a couple of issues. One is that we want to have an interface for Iterator, which is used in many places, and then an interface which defines the base for each of the two basic types of relationship collections – those with simple elements and those with key/value pairs. If we had inheritance of interfaces (rumored to be coming in OE11.0) we would have the base interface for the type inherit from Iterator and that base interface could be implemented by any class in the hierarchy.

But, in the absence of interface inheritance in the current language, we have a problem because we can’t use the base type interface as the type of a parameter in place of any of the concrete classes derived from it since it would then be missing the properties of Iterator. Therefore we have adopted the approach of creating an abstract class which implements the two interfaces, but with all abstract methods and then the concrete classes inherit from the abstract class, providing implementations for the methods.

Also, for each base type, there are multiple possible implementations (see Implementation Technologies) where each implementation would have the same overall signature, but might differ in performance, capacity, memory footprint, or whatever (see Performance Testing). Since it seems possible that one implementation will be preferable in some circumstances and another implementation preferable in others, we are electing to use a naming structure which will allow more than one concrete implementation for any abstract base type. E.g., there is an abstract base type called aIDSet, which is the collection class type for elements ordered by identity. If we created two concrete classes based on this abstract class, one using work-tables and one using temp-tables, we would call them WTIDSet and TTIDSet respectively. Both would have exactly the same signature and could be used interchangeably for the other by identifying the reference as aIDSet, but one or the other might be preferable in any given circumstance depending on requirements.

See the UML Diagrams for specifics of the structures of the two hierarchies.

Implementation Technologies

In the 2006 implementation, all collection classes were implemented using temp-tables based on the conclusion that temp-tables were a natural ABL language feature which provided a superset of capabilities needed for collection management. While this approach was effective and simple, since then concern has been raised about performance issues when the number of temp-tables in a session becomes large and about the excessive memory footprint and possibly instantiation penalty when a temp-table is used for a small and simple collection.

Unfortunately, ABL does not provide language constructs at a low enough level to make collections as simple as they are in 3GLs. That will require PSC implementing collections as a native language feature, which does not currently seem to be on the roadmap. Therefore, in this implementation we are exploring two other language constructs to provide support for collections – arrays and work-tables.

Arrays seem like a natural mechanism to support collections except that in ABL they cannot be dynamically resized once they are instantiated. While there special cases where one knows the upper bound maximum size of a collection in advance, the typical case is that one does not. Therefore, we are experimenting with an approach in which an initial array is sized to either a default size or a size specified by the programmer. If that array becomes full, then a second array is instantiated at a size which is the size of the original plus a default or programmer supplied growth factor. This will continue until a maximum of ten arrays is use. Logically they will be treated as one long array with the offset of any given index being computed as needed. While not completely open ended, like a temp-table, this approach can provide a relatively small initial array with a minimal memory footprint and yet can expand to handle an extremely large number of objects, probably more than can reasonably fit in memory.

Work-tables are a deprecated ABL language feature since, for most purposes their functionality is better covered by temp-tables. However, it seems likely that they have less memory overhead for small collections and for any collection which is just going to be accessed serially, their lack of an index is not a flaw. Currently, the documentation does not indicate that Progress..Lang.Object is a valid datatype for a work-table field. In practice, the compiler accepts such a definition, but a program will freeze if one tries to make an assignment to the field. Inquiries are in process about the possibility of getting full work-table support for Progress.Lang.Objects, but in the meantime a hybrid approach will be used, i.e., a work-table which provides the sort order for identity access with a pointer to an array which contains the actual object.

Performance testing will be used to determine if more than one implementation should be provided for each base type and which implementation will be preferred for the base type. Temp-tables are likely to continue to be used for key/value pair types, although this is only a small percentage of collection instances.

Performance Testing

Most relationship collections will contain only a handful of elements, but it is possible for a relationship to contain a large number of elements. Therefore, we are concerned with both the memory footprint of a class when it contains a small number of elements and with the performance of that class when it contains a large number of elements. In particular, the 2006 implementation used temp-tables throughout. This meant that they could be scaled to arbitrarily large numbers of elements and still perform adequately, but it also meant that a rather large amount of memory was consumed for a small collection.

Therefore, in developing these relationship collections classes we are looking at minimal memory footprint, scaling of memory footprint, and performance in a number of tasks. The tasks we have identified are:

Add N objects to collection;
Clear collection of N objects;
Delete M random objects from a collection of N;
Find M random objects from a collection of N by position (applicable only to List?*);
Find M random objects from a collection of N by identity;
Find M random objects from a collection of N by key;
Iterate through M instances in a collection of N starting with first;
Iterate through M instances in a collection of N starting with last;
Delete M consecutive instances from a collection of N starting with I;
Add M consecutive instances to a collection of N starting with I (only List?*)

Tests will be based on varying values of M, N, and I to identify any possible non-linearities. Testing will begin with smaller values and will be discontinued when results are judged unacceptable. Not all tests apply to all relationship base class types, i.e., one can only test access by keys in a base type which has keys.

It should be recognized that some of these tests will represent extreme stress tests compared to what one expects in practice in an application. For example, deleting 1000 random objects from a set of 10,000 would be quite extraordinary, both because few collections are likely to have even as many as 100 elements and because normal processing will be to fill a collection, process all elements one or more times, and then to clear and delete the collection as a whole. I.e., one typically will never delete even one element until such point as one is clearing the whole collection, which may be a far more efficient operation than deleting elements one by one. Nevertheless, relationship navigation is extremely common in OO programming, so identifying possible inefficiency is highly desirable.

* Note: List is a Java collection class type in which elements can be addressed by position, i.e., Nth. Currently, this does not seem to be a requirement for relationship collections and so is not included in the initial implementation, but it does seem like a useful class in other contexts and so is likely to appear in a later round of development.

Properties and Operators

Java collection classes have been provided with a fairly rich set of properties and operators, presumably based on the idea that these classes can then be used for a richer range of rôles, just as there are many classes in that same hierarchy which are clearly not intended for relationships at all. In the present implementation, the decision was made to focus on the essentials of object relationships with the idea of creating efficient, low impact infrastructure for OO programming and then to consider later whether there are other classes with other functions which could also be created. Therefore, this implementation uses minimal properties and operators.

Note that we are using a leading "a" to indicate an abstract class and a leading "i" to indicate an interface.

aSet
This is the abstract class for the most common collection type, that ordered only by the sequence in which elements are added to it. It has two read-only properties – Size (number of elements) and IsEmpty. It has and four operators:

Add – adds the object in its parameter to the collection;
Remove – removes the object in its parameter from the collection;
Clear – removes all elements from the collection;
Contains – returns a logical indicating whether the object parameter is in the collection; and
Get Iterator - returns an Iterator object on the class.
Note that Contains will be inefficient on a large aSet and aIDSet is preferred if this is a regular need.

aIDSet
This is the abstract class for the simple collection type ordered by object identity. It has all of the same properties and operators as aSet plus one additional operator Get which has an integer argument and which returns the object with that identity.

aAttrKeySet
This is the abstract class which takes key/value pairs in which the key is considered an identifier for the object, i.e., keys must be unique. It has the same two properties as aSet. Its operators are:

Add – parameters are a key/value pair which is added to the collection;
GetValue – returns the object corresponding to the key in the parameter;
RemoveKey – removes the key/value entry specified by the key in the parameter;
RemoveValue – removes the key/value entry corresponding to the object in the parameter;
Clear – removes all entries
Contains – returns a logical indicating whether the object parameter is in the collection; and
GetIterator – returns an Iterator on the keys.

aAttrSortSet
This is the abstract class which takes key/value pairs, but the key is considered only an attribute which defines sort order, not a unique key. It has the same properties and operators as aAttrKeySet except that GetValue is replaced by GetValues which return a collection of all objects in the collection with the key in the parameter.

iIterator
This is the interface which defines the signature for all collection iterators. It has two read-only properties – HasNext and HasPrev. Its operators are:

First – positions at the first object in the collection and returns it;
Next – advances to the next object in the collection and returns it;
Last – positions at the last object in the collection and returns it;
Prev – advances in reverse order to the previous object in the collection and returns it; and
Remove – removes the object at the current position from the collection.

Note that we have considered whether the key/value collection classes should have both GetAttrIterator, i.e., on the keys, and GetIDIterator, i.e., on the ID of the objects, as this would be easy to do with a temp-table implementation. Currently, we are leaning toward just GetIterator on the keys as this seems in keeping with the needs of relationships.

UML Diagrams

Following are the UML class diagrams for the proposed implementation.

Note that RelationAttrSetOps will not be included in the initial implementation, but its definition is provided here to indicate where these operations will be implemented when and if they are needed.

iSet Family (click for larger version)

See below for higher resolution PDF.

iAttrSet Family (click for larger version)

See below for higher resolution PDF.

Iterators

One of the factors which motivated this revision of Collection classes was a discussion on PEG which convinced me that I should allow for multiple iterators. In the Java Collection classes, Iterator is a separate interface and the GetIterator() method implements that interface to provide an Iterator for any given collection class. Since Iterator is separate and not limited to one instance, one has the structure to provide multiple simultaneous iterators for the same collection. In starting to work on this revision, we assumed that we would take this approach and you can see this reflected in the previously published material on this project.

However, in approaching actual coding of the project, we realized that there is an inherent problem because of the "intimacy" of iterators with respect to their collections. How can an iterator navigate the implementation structure of the collection without either having parts of that implementation be public, which would violate encapsulation, or having the collection have navigation methods, which would seem to violate normal form since both the collection and the iterator would be exposing very similar operations.

One solution proposed would be to have the navigation methods on the collection not be public to any except the iterator. That would still violate notions of OO purity, though less blatantly, but this is an academic question in the end since ABL has no level of protection which would provide this kind of limited access.

Another proposed solution was to have the navigation methods on the collection reference an "ID", the ID identifying the "current" record for a particular iterator. I.e., GetNext(ID) is GetNext relative to ID. If this ID was the actual object identifier, i.e., int(Obj), then the Iterator could determine the ID for the current element on its own. Otherwise, it would have to obtain this ID from the collection. Unfortunately, with the proposed array implementation for the most common type of collection, having the object ID would require an expensive search of the ID to identify the current record. Only an ID of the current element would be efficient there.

It was then observed that, while multiple iterators might sometimes be required of some types of collections, they seem highly unlikely in a relationship collection. They are unlikely because a relation is a connection between two object types in a parent child type of connection and the normal expectation is that if the parent does some operation on the children, it will do it to each of them, i.e., that one will navigate from the beginning to the end, an most often within the scope of a single method.

This raised the possibility that we would move the iterator back into the collection instead of making it something separate. Having thought of this, it was immediately appealing since it would guarantee that the navigation would be efficient relative to the specific implementation, but that all details of the implementation would be encapsulated within the class. This lead me to think that the Java structure is actually a less than ideal decomposition of the problem space since the need for navigation is inherent in the collection and separating them requires an overly intimate knowledge to be shared.

Having made this observation, I has now occurred to me that it would be possible to support multiple iterators within the collection itself by the use of named or numbered iterators. For example, GetIteratorID() could return a sequentially assigned integer which identifies a particular iterator. A small array in the collection, indexed by this integer, could then contain a value for the current record on that iterator. For an array implementation, it might be the current element number. For a work-table implementation it might be the Object ID to use in a find. For a temp-table implementation it might be the same object ID or a recid. Alternatively for a temp-table, one might have a separate query per iterator, but that seems like it would add weight. While the object ID and recid would require a FIND, this would be a C level operation and only one ABL instruction.

Revision of other parts of this specification is pending confirmation of our adoption of this approach.