Posts tagged hardware
Cycling through the Integer range – A Fermi problem
May 24th
After graduating in economics during the summer of 2005, I went interviewing for a business analyst position in a couple of business consulting firms (e.g. Mckinsey & Company).
Since, real life, business dilemmas requires estimating and decision making under uncertainty (not all of the required information is available nor it is accurate), a major part of the interview for these type of firms is confronting you with the “How many pay phones are there on the island of Manhattan?” type of problems, also known as Fermi’s problems.
Although that, at first, these problems seem quite puzzling, given that you remain focused, methodical and leverage a modest amount of common sense, it all gets pretty easy. The “trick” is to combine basic facts which you already know, with some four grader algebra, doing this brings you to good enough estimates.
Allow me to introduce to you a quick CS Fermi problem that someone through around while in the office. The problem might also be presented during an interview with a fresh graduate student candidates. Here goes:
Running on your average home computer (A single 2Ghz core), how long would it take for this Java program to complete it’s operation?
long startTime = System.currentTimeMillis();
for (int i=Integer.MIN_VALUE; i<Integer.MAX_VALUE; i++) {
};
System.out.println(System.currentTimeMillis()-startTime);
How long? Two nanoseconds? Three seconds? Four hours? Five years? Six centuries? Seven millenniums? What’s important here is the order of magnitude and not the exact answer, you might find this question to be trivial, but you will be surprised of how many people can’t get a clue on how to start answering it. Take thirty seconds and try to come up with your own estimation, before reading through my estimation:
Let’s compute a ball park figure:
Since an Integer is a 32Bit creature, the loop will cycle 2^32 times (about 4.3 billion times. Remember that a billion is 10^9). The 2006′s average home computer CPU runs at about 2GHz, this means that the CPU can perform two billion simple instructions per second (Complex instructions consume several CPU cycles).
The loop does three obvious operations on each cycle: (1) I is incremented. (2) the values of i and the max Integer constant are compared between (3) we jump back to the beginning of the loop.
All are fairly simple instructions (don’t have to be an assembly programmer to know that), so it’s safe to assume that these instructions are executed with in a single CPU cycle.
BTW: Instructions 2 and 3 can be combined in to a single instruction (jump is less then).
If the loop would have been coded in assembly language, my guess is that it would take 4 seconds to complete: (2 instructions) * (4*10^9) loop cycles / (2*10^9) instructions/sec = 4 seconds. Thus, we have just found the lower limit value for our answer: the Java code couldn’t execute in under 4 seconds.
My guesstimation would probably be between 4-40 seconds.
Other possible influencing factors:
(*) Now we know that Java is not effective as machine language and adds some overhead to our code. Depending on the implementation of the JVM in use, the method might be complied to machine native code, instead of executing in interpreted mode. This would improve the performance of course.
(*) As i recall, by JLS specification, the JVM is obliged to check for overflow while incrementing integers; If so, this will add a fix number of operations per loop cycle.
(*) Since our Integer isn’t volatile (a local variable can’t be volatile anyway), its value would be probably cached in one of the CPU’s registers throughout all of the loop execution. Have it been declared a volatile, the JVM would had been forced to read and write the Integer value to the machine’s main memory on each operation that involves the Integer variable. since memory CAS latency is measured in nanos as well, this should, theoretically, add a fixed cost for each loop cycle (~10-100 nanos), possibly increasing the estimation’s order of magnitude by a factor of one.
(*) Running on a multi-core chip should have no direct positive effect, as this is a single threaded program.
Here is the relevant disassembled Bytecode:
4: ldc #3; //int -2147483648
6: istore_3
7: iload_3
8: ldc #4; //int 2147483647
10: if_icmpge 19
13: iinc 3, 1
16: goto 7
Actual results:
(1) On my IBM T41 ThinkPad it took 80 seconds to complete.
(2) On my workstation at home, equipped with an Intel core2 6300 1.8GHz CPU it tool only 9 seconds to complete.
Since I can’t explain such discrepancies. I’ll have to check further and update with new information. Try it yourselves!
How does hardware evolution affect progamming language design?
Mar 30th
I’ve recently watched the interesting webcast Programming Language Design and Analysis Motivated by Hardware Evolution by Professor Alan Mycroft (Webcast’s link is accessible only from within the IBM Intranet). Ahead are a few keynotes I’ve kept.
Not everything is kept linear
As chip designers continue to scale down chips and transistors, they begin hitting design walls. Some of these walls are related to the fact that as the transistors` physical size is scaled down, some other properties of the chip do not scale linearly as well. This simplest example of this are dimensions, consider length Vs surface area: reducing a square side to 50% of its original size, will causes the square surface space to reduced by 75%, not a linear change. Different electricity characteristics might change at different rates than the rate in which length is changed.
Where is my 12Ghz CPU?
Moore’s law, which predicts the doubling of transistors quantity on a chip every ~18 months is still in effect, sadly, this doesn’t translate into clock speed. Although that, when transistors are miniaturized the distances within the chip reduces as well, and this should mean an increase in speed, but, due to heat dispersion problems (not all dimensions shrink at the same rate, generated heat is one of them, remember?) chip designers are forced to reduce the voltage in which the chip components operate. Therefore no clock speed gain.
This enables us, however, to squeeze in more cores into that optimal one cm^2 silicon pad. Hence, the multi-core technological path that the industry had resorted to in the last couple of years.
There’s always a trade-off
As the voltage in which the chip operates drops, chip designers are starting to face computation inaccuracy problems. How could we live in peace with these imprecision? the professor ponders, do we must insist on absolute accuracy? Consider the task of rendering video, do we really care about the correctness of each pixel on each of the frames, probably not, just remember those old analog VCR and audio cassettes, they were highly inaccurate and still were able to deliver the goods. We might decide to compromise on accuracy, some of the time, in order to benefit on speed, just another type of trade-off. Programming language designers should assist chip designers by developing programming languages that are able to operate in a world of non absolute certainty.
Also think about the build-in error correction mechanisms put in to network protocol stacks.
Better on one world, worse on the other
A major problem with multi-core chips processing, is that although inter-cores communication enjoy a high bandwidth (2.5GB/s), it is stained by a high latency (~75 clock cycles) .
Another problem is that programs are written based on a shared memory model, in which all cores must coordinate when accessing the shared main memory, core’s caches must also be refreshed quite often. While this doesn’t seems a major problem for dual or quad cores, think on how this heavily limits performance on a, not so futuristic anymore, 128 cores chip.
Trying to refrain from shared main memory access might turn the table on some of the disciplines we got accustomed to think of as obvious. For example, when you code a parametrized function you declare how parameters are passed; either by reference, or by value. Declaring this during coding time (rather then deciding this during runtime) can be regarded as “early-binding”. From a performance perspective, everybody knows that passing by reference is, almost always, faster than passing by value (assuming you don’t intend on changing the passed value). This preferred way of action might not hold true on a multi-core system that will have to incur an expensive overhead when it access the data which the reference point to in the shared main memory, no such price has to be paid if the parameter is past by value. One way in which future programming languages might deal with this is to allow for late-binding of the parameters passing method. When running on a chip with only a few cores, a pass by reference will occur, just as, when running on a cores enriched chip a pass by value will be selected. This is true when the pass by reference/value makes no difference to the program logic (no changes to the parameter’s data are visible to the method caller, nor the parameter data is accessed concurrently), and therefore both could be used interchangeably.
Future languages will need to support this “late-binding” feature and others like it.
Summing up
It will be interesting to keep follow of these hardware to software trends of mutual influences.


Via e-mail