The key to solve the problem posted yesterday is to determine how to structure the data such that:

a. the memory footprint

b. speed of search

are acceptable to the user.

From a memory footprint, ideally we would like to store all our data in memory, otherwise the performance will be poor. However, we only need to store the extents in memory in a data structure and then read each number from the file and search the extents data structure. The number of extents is limited to 50 millions. So, just to store extents in memory we’ll need:

2 X 4bytes X 50mil = 400 mil bytes ~ 381 MB

But this is just for the actual data, we’ll also need some meta-data. If we want to keep the data in a linked list or a binary tree, then we need 8 more bytes (assuming a 32bit machine). That doubles the amount to 762 MB. While this is huge amounts of memory to be used by a single process, it is still workable, as most machines have 1-4 GB of RAM these days anyway. Let’s work with the assumption that we can store all the extents in memory.

For the speed of search, as we have a big amount of data to process, it all boils down to the complexity of the algorithm used for the search. Mainly this complexity can be linear, logarithmic or constant (amortized). From the start we eliminate the linear complexity as it is likely to perform extremely poor on large sets of data. That was quickly confirmed during some initial tests where I had to stop the application after a half an hour as I was getting bored.

Eliminating the linear algorithms means that we automatically eliminate the vector or linked lists, as searching through these data structures has linear complexity.

And that leaves us with logarithmic or constant (amortized) complexity algorithms and the data structures that come with that: balanced binary trees and hash tables respectively. We would prefer hash tables if possible as they perform better than balanced binary trees, but binary trees still offer acceptable performance.

The challenge now is how to translate the set of extents into either one of these data structures so that a search on these data structures will give us the answer that we’re looking for: how many extents contain this number?

### Like this:

Like Loading...

*Related*

Tags: code, Programming

This entry was posted on July 21, 2008 at 12:35 pm and is filed under Programming, Uncategorized. You can follow any responses to this entry through the RSS 2.0 feed.
You can leave a response, or trackback from your own site.

## Leave a Reply