Implementation Technologies

Sat, 2010-06-26 21:06 — tamhas

In the 2006 implementation, all collection classes were implemented using temp-tables based on the conclusion that temp-tables were a natural ABL language feature which provided a superset of capabilities needed for collection management. While this approach was effective and simple, since then concern has been raised about performance issues when the number of temp-tables in a session becomes large and about the excessive memory footprint and possibly instantiation penalty when a temp-table is used for a small and simple collection.

Unfortunately, ABL does not provide language constructs at a low enough level to make collections as simple as they are in 3GLs. That will require PSC implementing collections as a native language feature, which does not currently seem to be on the roadmap. Therefore, in this implementation we are exploring two other language constructs to provide support for collections – arrays and work-tables.

Arrays seem like a natural mechanism to support collections except that in ABL they cannot be dynamically resized once they are instantiated. While there special cases where one knows the upper bound maximum size of a collection in advance, the typical case is that one does not. Therefore, we are experimenting with an approach in which an initial array is sized to either a default size or a size specified by the programmer. If that array becomes full, then a second array is instantiated at a size which is the size of the original plus a default or programmer supplied growth factor. This will continue until a maximum of ten arrays is use. Logically they will be treated as one long array with the offset of any given index being computed as needed. While not completely open ended, like a temp-table, this approach can provide a relatively small initial array with a minimal memory footprint and yet can expand to handle an extremely large number of objects, probably more than can reasonably fit in memory.

Work-tables are a deprecated ABL language feature since, for most purposes their functionality is better covered by temp-tables. However, it seems likely that they have less memory overhead for small collections and for any collection which is just going to be accessed serially, their lack of an index is not a flaw. Currently, the documentation does not indicate that Progress..Lang.Object is a valid datatype for a work-table field. In practice, the compiler accepts such a definition, but a program will freeze if one tries to make an assignment to the field. Inquiries are in process about the possibility of getting full work-table support for Progress.Lang.Objects, but in the meantime a hybrid approach will be used, i.e., a work-table which provides the sort order for identity access with a pointer to an array which contains the actual object.

Performance testing will be used to determine if more than one implementation should be provided for each base type and which implementation will be preferred for the base type. Temp-tables are likely to continue to be used for key/value pair types, although this is only a small percentage of collection instances.

Printer-friendly version
Login to post comments

Comments

Fri, 2010-12-03 16:36 — jonb

resize array

"Arrays seem like a natural mechanism to support collections except that in ABL they cannot be dynamically resized once they are instantiated."

How about using something like this:

CLASS ArrayList:
   DEFINE PRIVATE VARIABLE mainArray AS CLASS Progress.Lang.Object EXTENT NO-UNDO.
   DEFINE PRIVATE VARIABLE mainArrayLast AS INTEGER NO-UNDO INITIAL 0.
   DEFINE PRIVATE VARIABLE mainArraySize AS INTEGER NO-UNDO INITIAL 0.
   
   METHOD PUBLIC VOID append(INPUT element AS CLASS Progress.Lang.Object):
      ensureAvailable(1).
      mainArrayLast = mainArrayLast + 1.
      mainArray[mainArrayLast] = element.
   END METHOD.
   
   METHOD PRIVATE VOID ensureCapacity(INPUT capacity AS INTEGER):
      DEFINE VARIABLE i AS INTEGER NO-UNDO.
      DEFINE VARIABLE newSize AS INTEGER NO-UNDO.
      DEFINE VARIABLE newArray AS CLASS Progress.Lang.Object EXTENT NO-UNDO.
      IF capacity > mainArraySize THEN DO:
         newSize = 2 * capacity + 10.
         EXTENT(newArray) = newSize.
         DO i = 1 TO mainArrayLast:
            newArray[i] = mainArray[i].
         END.
         EXTENT(mainArray) = ?.
         mainArray = newArray.
         mainArraySize = newSize.
      END.
   END METHOD.
   
   METHOD PUBLIC VOID ensureAvailable(INPUT howMuch AS INTEGER):
      ensureCapacity(mainArrayLast + howMuch).
   END METHOD.
END CLASS.

Each time the capacity is exceeded, the array is copied into a newly allocated array that is just over twice as big. The amortized complexity for appending is still O(1) since the growth is exponential.

Fri, 2010-12-03 17:38 — tamhas

One could benchmark this,

One could benchmark this, but it is really just a flavor of the idea I proposed of using N arrays and initializing each new one as needed larger according to a growth factor. My proposal requires a bit more computation to find the entry, but avoids the copy. I came up with that idea because I was concerned about the cost of the copy.

Fri, 2010-12-03 20:25 — jonb

The main advantage of using

The main advantage of using a single array is code simplicity. I doubt the performance difference would be too significant.

Fri, 2010-12-03 20:56 — tamhas

Like I say, I considered

Like I say, I considered this approach, but I was concerned about performance when the array size got large because of the time required to do the copy. But, this certainly could be benchmarked, so I will!

Wed, 2010-06-30 14:33 — Guillaume Morin

EXTENT 999999

I'm still stuck in 10.1C, so I cannot try Object arrays yet.
What is the memory footprint of an empty Object array defined with 999999 extents ?

Wed, 2010-06-30 18:55 — tamhas

Arrays

A non-initialized array is just a null pointer. Once a extent has been defined, there are 2 bytes per entry until the entry gets a value. When it gets a value, then the size is whatever the size is of the value. In this case, it is just an object reference, which I am going to assume is something like the size of a regular integer.

Note, however, that our approach is unlikely to every define an array with extent 99999. For most collections, even having 100 elements is pretty large. If we allocate the first array for 100, then a whole lot of collections will fit in that first array and the others will remain uninitialized. If we get to the 101st entry, then we allocated the second array to 200. Then we are fine until the 301st entry when we will allocate the third array to 400 elements. This continues up to the 10th array which would be 51200 elements for a grand total of 102300 possible in all 10. Both the initial array size and the growth factor will be settable and we will decide on some defaults.

Thus, the idea is to have something where the memory footprint can be very small when the collection is small, but the potential expansion is quite large and "large" is controllable by the programmer. The transition between these should be reasonable steps.

While this is not theoretically open ended like a TT, there are practical limits to the number of objects in memory, regardless of the collection implementation.

The OpenEdge Hive

More Navigation

OERA Open Source Initiative