The challenge now is how to translate the set of extents into either one of these data structures so that a search on these data structures will give us the answer that we’re looking for: how many extents contain this number?

The problem doesn’t map itself nicely to an existing data structure because:

a. there are a lot of numbers (billions) and a lot of extents (millions) – this is not a show stopper, but it surely makes our life more difficult

a. given a number we need to find the total number of extents containing it, not just one value, so we need some sort of counting

b. the number might not be in the data structure (i.e. is a number contained by the extent but not part of the ‘sides’) – data structures and algorithms usually answer the question: is the element in the list (yes/no) and if it is, give me the element

d. binary trees and hash tables have one key, not 2 keys (the ‘sides’ of the extent)

So let’s see what we can learn from each of these constraints.

Constraint a.

– Since the number of numbers far exceeds the number of extents, perhaps we should try to optimize the search rather than the build up of the data structure to represent the extents. In particular, during the build up of the extents data structure, we should perhaps add some metadata that could help us speed up the search.

Constraint b.

– As we’ve eliminated the linear search, we won’t be able to search each extent for each number, which would be a very bad idea. So we need some meta-data that would count somehow the numbers of extents – maybe this meta-data is built when we construct the extents data structure (as the point above would indicate)

Constraint c.

– If finding an element is not good enough, then we still need be able to walk the structure. And maps/sets and hash tables (in C++) have the lower_bound, upper_bound and equal_range which are ultimate searches. They’re used to figure out where a position of an element would be if you wanted to insert it.

Constraint d.

– We need to flatten the structure, so each extent is represented by one key instead of the two sides

Now looking at the properties of extents, we can determine that any set of overlapping extents can be broken down into another set of non-overlapping extents. For instance, let’s assume we have the following scenario:

[A, B), [C, D) where A < C < B < D (sequence 1)

In other words:

….[A….[C…B)….D)….

This can be broken down into the following set of disjointed extents:

[A, C), [C, B), [B, D) (sequence 2)

Each non-overlapping extent can hold an ‘overlapping count’ which keeps the number of overlapping extents that include the non-overlapping extent. In other words, how many extents from sequence 1 contain each extent in sequence 2.

In our example, each extent in sequence 2 will have the following ‘overlapping count’s:

[A, C) – 1 (contained in [A, B) only)

[C, B) – 2 (contained in both [A, B) and [C, B))

[B, D) – 1 (contained in [C, D) only)

This ‘overlapping count’ happens to be exactly what we want in order to solve the problem. As such, if we have such a data structure, all we have to do is look-up the extent that contains the number and return the ‘overlapping count’. Because the extents are non-overlapping, there will be only one extent containing the number.

Tags: code, Programming

## Leave a Reply