A solution – part I

The key to solve the problem posted yesterday is to determine how to structure the data such that:

a. the memory footprint

b. speed of search

are acceptable to the user.


From a memory footprint, ideally we would like to store all our data in memory, otherwise the performance will be poor. However, we only need to store the extents in memory in a data structure and then read each number from the file and search the extents data structure. The number of extents is limited to 50 millions. So, just to store extents in memory we’ll need:


2 X 4bytes X 50mil = 400 mil bytes ~ 381 MB


But this is just for the actual data, we’ll also need some meta-data. If we want to keep the data in a linked list or a binary tree, then we need 8 more bytes (assuming a 32bit machine). That doubles the amount to 762 MB. While this is huge amounts of memory to be used by a single process, it is still workable, as most machines have 1-4 GB of RAM these days anyway. Let’s work with the assumption that we can store all the extents in memory.


For the speed of search, as we have a big amount of data to process, it all boils down to the complexity of the algorithm used for the search. Mainly this complexity can be linear, logarithmic or constant (amortized). From the start we eliminate the linear complexity as it is likely to perform extremely poor on large sets of data. That was quickly confirmed during some initial tests where I had to stop the application after a half an hour as I was getting bored.

Eliminating the linear algorithms means that we automatically eliminate the vector or linked lists, as searching through these data structures has linear complexity.

And that leaves us with logarithmic or constant (amortized) complexity algorithms and the data structures that come with that: balanced binary trees and hash tables respectively. We would prefer hash tables if possible as they perform better than balanced binary trees, but binary trees still offer acceptable performance.


The challenge now is how to translate the set of extents into either one of these data structures so that a search on these data structures will give us the answer that we’re looking for: how many extents contain this number?


Tags: ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: