Archive for July, 2008

C++ Smart pointers

July 30, 2008

What are pointers? A way of indirectly refer a memory location together with a hint about the type, a hint that the compiler uses to restrict operations on the memory location. They can point to compiler controlled memory (e.g. stack) or programmer controlled memory (dynamically allocated on heap/free store).
The main benefit offered by dynamically allocated memory is that they allow programmers to control the scope of a certain memory location, be it object/POD type/raw memory. In contrast, the scope of automatic or stack based objects/memory is maintained by the compiler and the scope is very clear (basically the inner most {} block).
With the benefit comes the main drawback … that it’s the programmer’s responsibility to allocate and de-allocate. And since that’s not always trivial, in C++ (which doesn’t have garbage collection), smart people have invented smart pointer.

A smart pointer is composed of a part that’s dynamically allocated (the pointer) and a part that’s on the stack which gives it the ‘smartness’. When the part on the stack goes out of scope (controlled by the compiler), it drags out of scope the dynamic part as well. Its purpose is to avoid leaking the memory pointed by the pointer in the ubiquitous scenario where the programmer doesn’t cover all the corner cases (and the corner cases increased dramatically with the introduction of exceptions in C++).

A C++ smart pointer looks something like:

class smartPtr
{
   smartPtr(pointer* ptr)
   {
     store ptr
     take ownership of the pointed memory/object
   }
   ~smartPtr()
   {
     deallocate ptr (calls destructors of objects)
   }
}

Taking the ownership of the allocated memory is the how the smart pointer operates. However, there are problems: What if two smart pointers try to get the ownership of the same allocated memory. Well, that’s bad and it should not be allowed. There are two main cases when this scenario can happen:

1. The programmer explicitly creates two smart pointers and initializes them with the same pointer
This is obviously a programming error and not much we can do about it except educating programmers and firing the ones that fail to learn.

2. The compiler creates copy of the smartPtr (via copy constructor, or operator=)
The compiler creates such copies when the programmer explicitly calls these two operations or in some cases where it needs to create implicit copies (returning by value, passing parameters by value).
In this case, the smart pointer implementer CAN do something about it. And there are 3 main things he can do:

a. Disallow copy constructor/operator= for the smart pointers
This is the safest option. However, it’s also the most limiting and users will soon shun the implementation because of that.

b. Transfer ownership to the new smart pointer from the old
As such, even if you have multiple copies, only one actually has the ownership. This work nicely when returning values from functions, as you want to transfer the ownership in that case anyway. It doesn’t work well with passing parameters to functions as the programmer needs to remember that after passing it by value, she can’t do anything else with the smart pointer. It’s also impossible to use in STL containers as STL is allowed to make copies of objects it holds in containers. Such a smart pointer will lose the ownership of the pointed memory in such operations.

An example of such approach is the std::auto_ptr part of STL.

c. Maintain a reference count
This strategy is must better as the smart pointer tracks how many pointers point to the allocated memory/object. It only deallocates when the last smart pointer goes out of scope. The only problem with this approach is when you end up with circular dependencies between reference counted objects. When there is such a circular dependency, none of the memory allocated in the circular dependency is de-allocated. These circular dependencies are not uncommon.

An example of such approach is the boost::shared_ptr part of boost.

Deperation, thy name … Vista

July 29, 2008

I mean Mojave. I mean Vista. I really mean Vista.
As someone already mentioned, the experiment uses Adobe Flash and not Silverlight. Or could you be convinced it’s really Silverlight too?

A solution – part III (final)

July 22, 2008

So adding everything up…

We’re going to read the extents one by one into memory. As we read an extent, we’ll break it up into a set of non-overlapping extents, each holding a counter of how many overlapping extents contain it.
Having non-overlapping extents means that in fact we’re ordering the extents. In the previous example, we have A < C < B < D. This ordering allows us to use data structures that order their elements, a binary tree for instance. Each ‘side’ of the extent becomes a node in the binary tree. And each node can store the number of extents including the extent either to its right or to its left. To be able to use lower_bound, we’ll store the number of extents including the extent to the left of the node value.

As such, in our example, we’ll have a binary tree like this:

   C
 /   \
A     B
        \
         D

And the value they store are:
node_value(A) = 0 (there are no extents to the left of A, in other words in (-inifinity, A)
node_value(C) = 1 (there is only one extent overlapping the [A, C) extent
node_value(B) = 2 (there are 2 extents overlapping [C, B) extent)
node_value(D) = 1 (…)

For our extents, we will have to build a data structure like this. Let’s assume we have a number x that we want to check against our extents. Let’s assume that we have

A < C < x < B < D

The result in this case should be 2. Remember that our overlapping extents are [A, B) and [C, D) where A < C < B < D.
That means that x belongs in both [A, B) and [C, D) extent.
Let’s see how to get this result with our data structure. Let’s see what happens if we call lower_bound(x) on our data structure. From MSDN: lower_bound “Returns an iterator to the first element in a map with a key value that is equal to or greater than that of a specified key.”
So, calling lower_bound(x) will return an iterator pointing to B. If we get the value associated with node B, that is node_value(B), we get our result: 2.

I leave it as an exercise for you to write the code in your favourite language.

The lower_bound algorithm has logarithmic complexity, which is within our constraints. It might be possible to improve this solution by using hash tables which have amortized constant complexity. However, there’s one problem, hash tables are not ordered as binary trees are, as such, it is not possible to get from a node B to its immediate neighbours C and D for example. The solution presented here relies on that, as whenever a new extent is inserted in the binary tree, it needs to update the ‘overlapping counters’ of all the nodes within the extent.

If you find a better solution, please let me know.

A solution – part II

July 21, 2008

The challenge now is how to translate the set of extents into either one of these data structures so that a search on these data structures will give us the answer that we’re looking for: how many extents contain this number?

The problem doesn’t map itself nicely to an existing data structure because:

a. there are a lot of numbers (billions) and a lot of extents (millions) – this is not a show stopper, but it surely makes our life more difficult

a. given a number we need to find the total number of extents containing it, not just one value, so we need some sort of counting

b. the number might not be in the data structure (i.e. is a number contained by the extent but not part of the ‘sides’) – data structures and algorithms usually answer the question: is the element in the list (yes/no) and if it is, give me the element

d. binary trees and hash tables have one key, not 2 keys (the ‘sides’ of the extent)

So let’s see what we can learn from each of these constraints.

Constraint a.

– Since the number of numbers far exceeds the number of extents, perhaps we should try to optimize the search rather than the build up of the data structure to represent the extents. In particular, during the build up of the extents data structure, we should perhaps add some metadata that could help us speed up the search.

Constraint b.

– As we’ve eliminated the linear search, we won’t be able to search each extent for each number, which would be a very bad idea. So we need some meta-data that would count somehow the numbers of extents – maybe this meta-data is built when we construct the extents data structure (as the point above would indicate)

Constraint c.

– If finding an element is not good enough, then we still need be able to walk the structure. And maps/sets and hash tables (in C++) have the lower_bound, upper_bound and equal_range which are ultimate searches. They’re used to figure out where a position of an element would be if you wanted to insert it.

Constraint d.

– We need to flatten the structure, so each extent is represented by one key instead of the two sides

Now looking at the properties of extents, we can determine that any set of overlapping extents can be broken down into another set of non-overlapping extents. For instance, let’s assume we have the following scenario:

[A, B), [C, D) where A < C < B < D (sequence 1)

In other words:

….[A….[C…B)….D)….

This can be broken down into the following set of disjointed extents:

[A, C), [C, B), [B, D) (sequence 2)

Each non-overlapping extent can hold an ‘overlapping count’ which keeps the number of overlapping extents that include the non-overlapping extent. In other words, how many extents from sequence 1 contain each extent in sequence 2.

In our example, each extent in sequence 2 will have the following ‘overlapping count’s:

[A, C) – 1 (contained in [A, B) only)

[C, B) – 2 (contained in both [A, B) and [C, B))

[B, D) – 1 (contained in [C, D) only)

This ‘overlapping count’ happens to be exactly what we want in order to solve the problem. As such, if we have such a data structure, all we have to do is look-up the extent that contains the number and return the ‘overlapping count’. Because the extents are non-overlapping, there will be only one extent containing the number.

A solution – part I

July 21, 2008

The key to solve the problem posted yesterday is to determine how to structure the data such that:

a. the memory footprint

b. speed of search

are acceptable to the user.

 

From a memory footprint, ideally we would like to store all our data in memory, otherwise the performance will be poor. However, we only need to store the extents in memory in a data structure and then read each number from the file and search the extents data structure. The number of extents is limited to 50 millions. So, just to store extents in memory we’ll need:

 

2 X 4bytes X 50mil = 400 mil bytes ~ 381 MB

 

But this is just for the actual data, we’ll also need some meta-data. If we want to keep the data in a linked list or a binary tree, then we need 8 more bytes (assuming a 32bit machine). That doubles the amount to 762 MB. While this is huge amounts of memory to be used by a single process, it is still workable, as most machines have 1-4 GB of RAM these days anyway. Let’s work with the assumption that we can store all the extents in memory.

 

For the speed of search, as we have a big amount of data to process, it all boils down to the complexity of the algorithm used for the search. Mainly this complexity can be linear, logarithmic or constant (amortized). From the start we eliminate the linear complexity as it is likely to perform extremely poor on large sets of data. That was quickly confirmed during some initial tests where I had to stop the application after a half an hour as I was getting bored.

Eliminating the linear algorithms means that we automatically eliminate the vector or linked lists, as searching through these data structures has linear complexity.

And that leaves us with logarithmic or constant (amortized) complexity algorithms and the data structures that come with that: balanced binary trees and hash tables respectively. We would prefer hash tables if possible as they perform better than balanced binary trees, but binary trees still offer acceptable performance.

 

The challenge now is how to translate the set of extents into either one of these data structures so that a search on these data structures will give us the answer that we’re looking for: how many extents contain this number?

Problem

July 20, 2008

A while ago, in an interview, I was given this problem:

Write a program that runs on a normal desktop PC and reads closed
extents from “extents.txt” and points from “numbers.txt”, and writes, to
standard output, the number of extents that contain each point, in the
order the points are given, one result per line. A closed extent [A,B]
contains a point (P) iff A <= P <= B.

extents.txt” contains extents, one per line, with each line in the
format A <whitespace> B. The file contains no more than 50 million
extents. “numbers.txt” contains a list of points, one per line. The file
may contain billions of points. Both files may contain duplicates, and
may or may not end with a <newline>. You may assume that the input is
valid, and that all numbers can be represented by an unsigned integer.

We are looking for EFFICIENCY, SIMPLICITY and COMPACTNESS.

Example 1:

extents.txt
0 10
100 131
132 133
11 32

numbers.txt
1
2
4
12
53
23
53
12
32

output
1
1
1
1
0
1
0
1
1

Example 2:

extents.txt
0 10
0 10

numbers.txt
3
3

output
2
2

Agility

July 17, 2008

One thing I like about agile software development is the daily scrum, the daily meeting.
It works well when:
a. The meeting is very short … 15-30 minutes and you discuss what you’ve been doing for the previous day and what the plan is for the day and the roadblocks.
b. No project managers are around, or if they are, they’re not allowed to talk … maybe just keep the minutes 😉
c. The longer debates are taken offline (however, for some strange reasons, too many discussions taken offline are buried until the next meeting)
d. You have a moderator that enforces these rules and the agenda

It doesn’t work well when:
(when all the cases above are not satisfied plus)
a. Discussion goes off-topic in any direction, or it’s high-jacked by other people with different agendas – any other meetings should be separate and have a clear agenda
b. People are not punctual for the meeting, so you end up waiting in the meeting room for people to arrive (most scrums however take place around desks)
c. You have excessively verbose people

Here are few reasons why you should have a daily scrum:
a. Gives everyone (and especially the team leader) a very good picture of where the development is, any roadblocks, any inefficiencies, etc. so the team can react more quickly and optimize down time. Many teams have members that are not so proactive, so daily scrums are especially useful to get useful information out of these people.

b. Motivates people to keep up the good work. The productivity of people is never constant, there are ups and downs. Same with the morale and motivation. As such, some people become sometimes less proactive and procrastinate more. Daily scrums force them in the open, so procrastination will be obvious to the rest of the team. As a consequence, people will procrastinate less on average. They act like a form of stick, for which sometimes you need a bad cop in the group.
These daily scrums, especially if they happen in the morning, they can kick-start the morning more easily.

c. Makes the team more cohesive, people tend to communicate more and communication becomes easier among the team members. It’s a form of team building. As such, the atmosphere should be friendly and open during these meetings, however care should be paid not to contradict point b (you still need to motivate and a bit of pressure is good) and not to digress into lengthy off-topic conversations.

It’s imperative however not to limit the communication in team to these meetings, in particular, issues shouldn’t be postponed to next meeting but raised as they happen. As such, communication within the team should be encouraged and people should be encouraged to make themselves accessible to other team members.

Short selling clamp down

July 16, 2008

After the FSA in the UK has moved to clamp down on short selling during rights issues, after the HBOS drama, the US is looking to follow suit in introducing measures to prevent some short selling that could lead to the collapse of more financial institutions.

Short selling is the practise of borrowing equity and selling it, in the hope that it will drop in value and purchasing it later will yield profit for the short seller. It’s the opposite of long buying, where a buyer buys the stock in the hope that it will appreciate over time. Most of the short selling is done on margins, meaning that the short sellers only pays for a margin of X% (e.g. 10%) without having to finance the whole purchase.

Short selling is not only a legal practise, but it benefits the entire economy as a whole, as it makes the prices more informative. The idea is that value traders who recognize the fundamental value of a stock would short the stock if they see it appreciate above the fundamental value, thus pushing the price towards the fundamental value. So, short selling makes the prices more informative by making the price converge from values greater than fundamental value to the fundamental value of the stock.
The long buy is supposed to do the opposite when the price is below the fundamental value. Then, value traders and arbitrageurs will buy the stock, thus pushing the price towards its fundamental value.
In the end, you always know how much a company is worth by looking at its stock price. At least in theory…

The problem is that fundamental value is not commonly known, but it is calculated by different analysts based on all the available public information. Surprisingly, given the same information, analysts many times reach different conclusions.
The problem is aggravated in periods of market turmoil and economic changes, as the ones we’re experiencing at the moment. In these periods, valuations are extremely difficult as many valuation models need to be recalibrated based on the overall economic conditions. In other words, it’s very difficult to get the fundamental values correctly.
And even if you get the values correctly, values traders might not be able or willing to finance their positions against the market. Especially now, the credit crunch has made the credit more expensive and more difficult to get. But also the sentiment has changed dramatically to doom and gloom and in such a climate, people tend to be overly pessimistic, overreact and some of them stay out of the market. And while they oversell the stocks, it’s very costly and difficult to stand against the market. So you end up with a situation of partial market failure.

In such conditions, the liquidity of the market dries up and it’s much easier for speculators to manipulate the prices. The vast majority of hedge funds are shorting the stocks at the moment. As such, the prices are particularly depressed and the value traders sit and watch before acting, waiting for a clearer picture of what’s happening to the fundamentals.
In such market conditions, you need regulation to prevent short sellers from amplifying the panic and ultimately leading to the collapse of financial institutions which in turn will affect the economy as a whole.

As such, it’s a good idea to impose restrictions on short selling in the short run, but these restrictions will have to be removed when the market starts functioning properly again and I think the FSA, SEC and the Fed will do the right thing. But the only fact that they have to recourse to such measures confirms once again we’re in a bear market that’s here to stay…

Unilateral and bilateral searching

July 16, 2008

Imagine you want to buy a new web camera. You go onto google, type ‘web camera’, you click the first link that takes you to website1 and you look at the price. Let’s assume you know exactly what you want, so you only check the price of the model that you want. You face a choice: buy now or keep searching. You choose to write down the price and return to the google result listing. You click on the second link that takes you to website2 and you repeat the process. You compare the lowest price of the two and again you have a choice: Buy the best price or keep searching.

In theory you could repeat the process ad infinitum, or until you exhaust the whole google/yahoo/msn etc. databases and you go through all the providers. Then you can continue the search by visiting the high street shops in your city and compare the prices in the same fashion. In the end you will find your best buy.

The main problem with the process above is that it is time consuming. That means costly. You can estimate the cost of your time spent on this by putting a value on it. For instance, if you earn X$ an hour, you can easily calculate in the end how much the search has cost you. At any point, you can calculate an average cost per new search by looking at how costly previous searches were. You might discover that the longer you search, the more expensive each search becomes as a lot of the information from the search engines contains a lot of false positives, the further you navigate away from the first page.

So, at any point, you can calculate the cost of you choice between buying now and the cost of searching and even a probability of finding a better price. You can calculate the point when you can stop the search in terms of the cost of searching and the cost of your time. As such, you can stop when the cost of the next search is greater than the cost of your time spent on the search. You also have to consider the probability that the new search will improve the best price you already discovered.

That’s an unilateral search and it’s characterized by a dynamic side, You, searching for the web camera, which web camera is the static side, which doesn’t change (You can always return to a previous result).

Bilateral searching is different from unilateral searching in that both sides are dynamic and searching.

Imagine you’re looking to buy a house. You can repeat the process described in the unilateral search, but as you keep searching, the other party keeps searching too. So in the end, it might not always be possible to return to your previous best buy. Your previous best buy might decide to sell the house to someone else who offers a better price, while you keep searching.

As such, the calculation of cost for bilateral searching is more complicated and it needs to take into consideration the risk of losing the best buy while you keep searching for price improvement. That means usually that you would stop sooner in a bilateral search than in a unilateral search. Alternatively, you might not lose best buys or take much longer to trade as you lose some opportunities and you have to restart your search.

This risk of losing the current offer also forces us to settle for good buys rather than best buys.

On the house market, this is evident as buyers are quick to put offers in a sellers’ market than in a buyers’ market. In a buyers’ market, they take more time to search, as they know the risk of the ‘best buy’ going away is diminished.

The bilateral search applies to many searches in real life: house purchases, renting, buying unique goods (e.g. art, second hand cars, etc.), financial trading of securities, and many others. Interestingly, this applies in a more general context and it affects searching for a partner for example…

July 15, 2008

I have just finished reading Harris’ “Trading and exchanges“, after more than 4 months of carrying it back and forth on my way to work. It’s a very accessible read and it’s written in an academic style. So in the end I was studying it rather than reading it. That explains why it took me 4 months to go through. However it’s a great book about trading, exchanges, regulation and a very good introduction for everyone working in the financial sector. Highly recommended.
Some knowledge about how markets work is required, but the book talks about markets in general without focusing in particular on equities or bond markets, while citing examples from different markets. Next, I’m reading about the life and career of Bill Gross, the Warren Buffet of fixed income and PIMCO’s chieftain.