Standard libraries of all modern programing languages contains classes like Scala's `HashSet` and `TreeSet`, that implements one of the fundamental mathematical abstraction - a set. These implementations requires explicit enumeration of each element of a set. But when we deal with ordered sets and maps, this approach becomes ineffective or even impossible in case of continuous domains. Here we have to operate with intervals rather than with separate elements.
Some time ago project Ordset has been started to provide a general purpose Scala library, which models ordered sets and maps. The main goal is to get clean and easy to use API similar to math notation keeping under the hood all complexities of lazy partial computations, caching, etc. Among other key features are the following.
- Support of different kinds of domains: discrete / continuous, bounded / unbounded.
- All sets and maps are implemented as immutable persistent data structures.
- Focus on strong typing in both public and internal API.
Now work is in progress, and any contribution or feedback will be highly appreciated.
What is a problem
As ordinary sets, totally ordered sets and maps are widely used mathematical concepts and can be found in various fields. Let's consider, for example, simple work schedule. Dates are elements of ordered domain, and each date can belong to one of the sets: workingDays, weekends or vacations.
Suppose, we store only weekends and vacations intervals. Then one can get working days with the following relation:
We get ordered map if each element of ordered domain is mapped to some value. Generally, in HR systems all attributes of persons or departments are period-dependent. For example, a person can change team through time, so we have ordered map:
Now suppose, we need to count work days of person in "team A":
It turns out, that fast and memory efficient computation of relations like (1) and (2) is not a trivial task. We can make several observations.
- Assume, we store intervals in a tree like structure. Each single operation (union, intersection, etc.) can lead to vast tree rebuilding, which is computational expensive and makes structure sharing inefficient.
- It may be possible, that person has been working in organization for, say, 10 years and joined team only last month. Thus, there is no point to compute relation (1) on full period just to throw away most part of the result at the next step.
Both these items push us to implement lazy operations. This will allow to compute only relevant parts of a set and only when they are really required. At the same time we should avoid long chains of lazy operations due to performance reasons, so we also need to cache intermediate results.
As a result, we will have several implementations for ordered sets: one (or more) with basic data structures like trees and others - with lazy operations like union, intersection, etc. Now it becomes critically important to unify logic for all domain types: bounded discrete, bounded continuous, unbounded discrete, unbounded continuous. Otherwise, we risk to get explosive growth of complexity.
Bounds and intervals
A good starting point is to consider a single bound of continuous ordered set. There are several kinds of them.
So we can write down something like that.
sealed trait Bound[+E] { | |
/** Returns bound element. */ | |
def element: E | |
/** Returns `true` if bound includes `element`. */ | |
def isInclusive: Boolean | |
/** Returns `true` if bound doesn't include `element`. */ | |
def isExclusive: Boolean = !isInclusive | |
} | |
object Bound { | |
/** Upper bound of a set. Corresponds to conditions x < E and x ≤ E */ | |
final case class Upper[+E]( | |
override val element: E, | |
override val isInclusive: Boolean | |
) extends Bound[E] | |
/** Lower bound of a set. Corresponds to conditions x > E and x ≥ E */ | |
final case class Lower[+E]( | |
override val element: E, | |
override val isInclusive: Boolean | |
) extends Bound[E] | |
} |
If we have order for elements, we can build one for bounds:
- Compare two bound elements. If they are equal, then go to step 2, otherwise return result of comparison.
- Assign the number to each bound depending on its kind. Let's call it "offset".
Pic. 5. Bounds offset
Compare offsets and return the result.
For discrete domains we can always convert any exclusive bound into inclusive one:
where N is arbitrary element, for which exists corresponding adjacent element in a given domain.
Thus, for discrete ordered sets exclusive bounds are optional but nevertheless useful. We can also construct order for them:
- If bound is exclusive, convert it into inclusive one.
- Compare two bound elements and return the result.
The fact that one can build order for bounds is extremely important. Everywhere further we can replace elements with bounds and get continuous ordered sets for free. There will be no special cases for each type of domains - just operate with bounds and their order specified by domain. For instance, we can define intervals using bounds as endpoint and then calculate intersection of two intervals comparing this bounds.
And one more note. Each bound, in effect, defines some ordered set. For example, we can treat b1 = x > 0 both as bound at zero element and also as all elements, that satisfy given condition. This set can be inverted, and the resulting one will be bounded by some other bound b2 = x ≤ 0. So we can introduce flip operation, which will be helpful later: b1 = flip(b2) and b2 = flip(b1).
Now let's take a look at intervals. To work with unbounded domains we will adopt following class hierarchy for intervals.
Bounded domains require only two subtypes: `Between` and `Empty`. Since unbounded sets are more general in this sense, we can limit ourselves to considering only them below.
Segment sequence
Cause we want to get immutable persistent structures, trees seem like a good choice to store our sets and maps. But what exactly should we keep in tree nodes? The first thing that comes to mind are intervals, just like on Pic. 3. However, let's remember again set inversion. With this approach each inversion would require to build completely new tree with cost about O(n). Further we will see, that this operation might be as simple as flipping of one bit.
Also let's consider map operator.
On the image above we can see intervals itself and gaps - elements of domain, that are not included in set. How should map operator act on such gaps? Should it just ignore them? But it may be useful to map intervals into gaps and vise versa. To provide this we can choose one special value to represent a gap, so they will become indistinguishable from ordinary intervals.
Now we can think of ordered sets just as mapping to boolean value:
where e - element of ordered domain; "false" value corresponds to elements, which are not included in set.
This fact allows us to introduce common supertype for sets and maps and move into it almost all logic of search, transformation, etc. But this is only the first win. We also got another important property - all intervals now follows each other continuously, hence there is no need to store entire interval in a tree node. It's enough either upper or lower bound; by default we will prefer upper bounds.
Thus, we can store upper bounds in some data structure, that provides effective search (tree, array, etc.). And to retrieve lower bound of interval we just need to find previous bound and flip it.
We'h got some new abstraction, that can represent both ordered sets and maps for all kinds of domains under consideration. Let's call it segment sequence. This term is intended to emphasize new invariants, that must be provided:
- segments follow each other without gaps and intersections;
- adjacent segments have different values.
The last requirement prevents from accumulation of excess bounds. If an operation produces adjacent segments with the same value, bound between them must be dropped so that both segments were merged.
Another distinction of segment from ordinary interval is that it is inextricably related with its sequence. So having a single segment we are able to get previous and next segments, move to the first and last segments of sequence or just get reference to the original sequence.
package core { | |
type OrderedSeq[E] = SegmentSeq[E, Boolean] | |
type OrderedMap[E, V] = SegmentSeq[E, V] | |
} | |
trait SegmentSeq[E, V] { | |
/** Returns first segment of sequence. */ | |
def firstSegment: Segment[E, V] | |
/** Returns last segment of sequence. */ | |
def lastSegment: Segment[E, V] | |
/** Returns segment that contains specified `bound`. */ | |
def getSegmentForBound(bound: Bound[E]): Segment[E, V] | |
/** Applies `mapFunc` to each segment of original sequence and returns new one. */ | |
def flatMap[U](mapFunc: V => SegmentSeq[E, U]): SegmentSeq[E, U] | |
/** Applies `mapFunc` to each segment of original sequence and returns new one. */ | |
def map[U](mapFunc: V => U): SegmentSeq[E, U] | |
/** Applies `zipFunc` to pairs of segments of original sequence and `other`. */ | |
def zip[U, W](other: SegmentSeq[E, U], zipFunc: (V, U) => W): : SegmentSeq[E, W] | |
// ... and many more methods | |
} | |
trait Segment[E, V] { | |
/** Sequence to which segment belongs. */ | |
def sequence: SegmentSeq[E, V] | |
/** Value associated with segment. */ | |
def value: V | |
/** Returns first segment of sequence. */ | |
def moveToFirst: Segment[E, V] | |
/** Returns last segment of sequence. */ | |
def moveToLast: Segment[E, V] | |
// ... and many more methods | |
} |
For now briefly consider a possible implementation of segment sequence using a binary tree. In the next posts we are going to cover this topic in more details. Sequence contains two fields: `root` - the root node of binary tree and `lastValue` - value of last segment, for which we don't store bound in tree. Each node corresponds to some segment. It contains upper bound of segment and its value. Also it may contain up to 2 references to children nodes.
Finally, let's return to our inversion problem. Now we can easily solve it by adding just one flag into constructor. `complementary` parameter allows to create direct and inverted sequence based on the same tree. A simplified code is listed below.
class TreeOrderedSet[E]( | |
val root: Tree[Bound[E], Boolean], | |
val lastValue: Boolean, | |
val complementary: Boolean, // inverts values of segments if `true` | |
) extends SegmentSeq[E, Boolean] { | |
override def getSegmentForBound(bound: Bound[E]): TreeSegment[E, Boolean] = | |
// `root.findNextNode` returns closest node with key not less than `bound`. | |
// Implementation is simplified. We also should consider case, when nothing is found. | |
// This means that last segment (without upper bound)) must be created. | |
new TreeOrderedSet.TreeSegment(root.findNextNode(bound), this) | |
protected def getNodeValue(node: Tree[E, Boolean]): Boolean = | |
complementary ^ node.getValue | |
} | |
object TreeOrderedSet { | |
case class TreeSegment[E]( | |
val node: Tree[Bound[E], Boolean], | |
override val sequence: TreeOrderedSet[E] | |
) extends Segment[E, Boolean] { | |
override def vaue: Boolean = sequence.getNodeValue(node) | |
} | |
} |
Conclusion
We've introduced central concept of the project - segment sequence, but there is still a lot of work ahead. In the next post we will design elementary transformation operations for sequences and will pick a curious data structure to provide their effective implementation. And for while one can take a look at some examples in the project.
Комментариев нет:
Отправить комментарий