Posts Tagged ‘performance’

IBM’s PLDE seminar 2010 – Review

Wednesday, April 14th, 2010

I spent today at the IBM Programming Languages and Development Environments Seminar 2010, that took place at the beautiful Haifa Research lab mount Carmel campus. Things worth mentioning:

Gilad Bracha, father of Java Generics and auto-boxing, spent 60 minutes repenting Sun’s Java 1.0 early design mistakes, such as allowing primitives and static members into the language. IMHO the lecture itself was so-so. Gilad pointed out Java’s soft spots, but didn’t bother presenting the crowd what he views as the alternatives. What he did suggest was to check out his new baby programing language Newspeak (something for the purists I guess).

Perhaps some of Java’s charm at the early days was its simplicity and low learning curve, I’m not sure that a semantically perfect Java (could there by anything like this?) using nested classes instead of static members would have enjoyed the same mojo.

In one additional interesting lecture, Kathy Barabash, talked about how data structures with a sequential references object graph (say a LinkedList) do not allow traditional concurrent GC Tracing algorithms to scale on many-core (i.e., massive multi-core) platforms.

What good is your new 1,024 cores Intel processor if the desktop widget nuclear explosion simulation flickers because it can only scale on 400 of the available cores, right?

ConcurrentHashMap fat memory footprint

Tuesday, March 23rd, 2010

While running product sizing tests, we’ve found that an over enthusiastic usage of ConcurrentHashMap (CHM) had evaporated  a good ~170MB of much needed heap space (we ran with a 1.5GB heap).

As it turns out, a empty CHM weighs around 1700B. Yes, I’m talking about a map with no entries at all, just the plumbing!
We used a CHM to store user session attributes, having 100,000 user sessions generated 100K CHM instances worth 170MB of heap (100K times 1.7KB).
We took measurements using the super Eclipse MAT.

The obvious solution for saving these scares 170MB, was to switch from a CHM to a Hashtable. A Hashtable cost only around 150B per instance (8% of a CHM).
Other possible solutions could have been: moving to a list structure (seek time is not an issue as we rarely have more than 4-5 attributes per session), or resorting to a an array of Objects.

Change implications:

1. Performance - The product doesn’t have any user scenario that cause multiple threads to concurrently access the same session attributes map, so we don’t expect any performance loss, on the contrary, I’m expecting a hashtable to prove faster for single thread access, over a CHM.

2. Thread safety is a low risk aspect, as both CHM and HT provide the same basic guarantees for a single API operation (e.g., map.get(key)).

To conclude, a CHM is a good idea when you have a shared map structure suffering from a high R/W thread access contention. But dragging behind itself such a large memory footprint, CHM is not ideal to use in masses, or when concurrency performance is not the focus.

P.S
A CHM automatically allocates 16 segments, each with a 16-element array – one best practice is to measure the average map population during your product’s sizing tests, and initialize the CHM with the minimum initialCapcity and loadFactor, required to contain your usage.

Concurrent Modification Exception

Monday, February 8th, 2010

I ran into a ConcurrentModificationException (CME) during stress testing.
What does CME actually mean?
It means that you’ve modified (add, remove, update) your Collection while you’ve been iterating over it (usually in a multi-threaded fashion, but it can occur in a single thread that modifies while iterating).

A few more things to note about CME:
Best effort detection
- If you see a CME printout, first off, consider yourself lucky, CMEs are thrown only in best effort. In another universe, the concurrent modification would not have been detected, causing your collection to become corrupted, instead of fast-failing with a CME.

IDing the problem – Like deadlocks, CME’s are easy to pinpoint once you inspected the exception’s stack trace.

Avoiding CME:

  1. ListIterator
    To modify a collection by the same thread that is currently iterating on it, use a ListIterator that will allow you to perform both.
    Drawbacks – single thread solution only.
  2. Naive solution: Synchronizers
    Use locks to for mutually excluding traversal and modification operations.
    Advantages – easy to code.
    Drawbacks – very long lock periods while iterating.
  3. CopyOnWrite
    Take advantage of the Java.util.concurrent collections like: CopyOnWriteArrayList, CopyOnWriteArraySet. If you require a map then grab CopyOnWriteMap from Apache (this guys have been doing Sun’s dirty work for years now).
    Advantages – very good reading performance (no locks are used, instead visibility is obtained via map member volatility).
    Drawbacks – very bad write performance on large maps.
    Conclusion – use for seldom mutating collections.
  4. toArray()
    toArray will create a new array holding a copy of your Set (Map.keySet() for a Map).
    You can then iterate over the array, freely modifying the original collection (the array doesn’t change of course).
    Advantages – write operations are cheap.
    Disadvantages – copying the entire set could be expensive if it occurs too often, and/or the set is very large.
  5. Concurrent Collections
    If you want to go heavyweight, consider using: ConcurrentHashMap (or one of its package friends).
    Once you create an iterator over a ConcurrentHashMap (CHM), it does not freeze the collection for traversal, updates to the collection may or may not appear during the traversal (weakly consistent).

The approach I ended up taking
My use case was seldom modifying a ~ten items cache. A copyonwrite map was what I used.
In other cases I had, ConcurrentHashMap was the easiest solution (though make sure your code can live in peace with the CHM weak consistency property).

Best pic idea I could think of to visualize Threads :)

Why is Thread.sleep() inherently inaccurate

Sunday, August 23rd, 2009

Avi Ribchinsky, a friend and a college of mien, is transitioning from C++ to the Java world. He had been playing with Thread.sleep(), when he noticed that the sleep method might oversleep more than ordered, and moreover, it could also under sleep (see Fig 1). Coming from the C++ world, that surely caught him surprised ;)

Fig 1.

Thread.sleep() under sleeping

Thread.sleep() under sleeping

How is sleep implemented in Java anyway?

Avi came asking me if I knew anything about it, I was wondering myself how such a common and important method could be faking in the way shown above. Is it the OS? a Bug in the specific JRE version used? Maybe the API doesn’t guarantee milliseconds precision to begin with?
Thinking about all of these factors, we realized that we don’t really know how the JVM implements the sleep method functionality, my best guess would have been that the process registers itself in the OS for a wake up call, and the OS wakes the process via a software interrupt. OK, time to search the web.

The following article gives a very detailed answer, explaining that sleep is implemented by a thread giving up its OS scheduling quantum back to the scheduler, on the next execution quantum the thread gets, it has the chance to wake up and continue processing, or again continue sleeping.
Therefore, the accuracy resolution of sleep is directly dependent on the process scheduling resolution of the operating system in usage. Since windows XP process scheduling resolution is roughly 10ms, the sleep mechanism, in the Avi’s example, might had prefered to under sleep “a little” rather than oversleeping “a lot”, by waking himself in the current scheduling cycle quantum, rather than in the next, future, quantum.

The article also mentions that the inaccuracies are worsened when a process with a higher scheduling priority, than the sleeping process, is in a runnable state.

I assume that, running on a Hypervisor with course grained process scheduling would also produce greater inaccuracies.

sleeping

Conclusion

You can’t rely on the millisecond accuracy of the sleep method. Take a before and after time measurament to find the actual time spent sleeping, in order to avoid ever increasing inacurracies.
Sleep tight :)

Saving on memory usage in Java #1 – the Byte.valueOf method

Saturday, December 27th, 2008

Say you wanna keep in memory a list of martial arts experts and their respective shoe size. One way to implement it would be to populate a Map structure with the following sets of key and value:

Map map = ...
map.put("Jean-Claude Van Damme", new Byte(45));
map.put("Jet Li", new Byte(45));
map.put("Chuck Norris", new Byte(112));
...
map.put("person number million", new Byte(45));

What if your JVM runs on a Lego mechanical computer that has a very limited amount of memory, you would probably want to save on memory wherever possible.

A Lego computer

A state of the art 3Hz Lego computer

Autoboxing anybody?

Keeping in mind that an object instance weights much more than just the primitive it holds, as it hold additional “plumbing” data (monitor, etc). Even an Object class instance weighs 8 bytes while not holding to any application information. What about keeping only primitives as the map value?
Autoboxing, introduced in Java 5 onwards, allows to pass a byte primitive argument instead of a Byte object instance argument in the following manner:

map.put("Bruce Lee", 42);

Does this help us avoid the costly Byte Objects? Not really, the auto-boxing feature, as the name hints, just statically replaces the 42 literal with a new Byte object instance, this is done during compilation. So there’s no real saving opportunity here, and we’re back where we started.

AutoBoxing

AutoBoxing

How about a plain old cache?

Examining the code above, you notice that you are creating one million unique Byte objects to hold the fighters’ shoe size, even though there are only 256 different shoe size values. Is this a venue for saving?
Considering the fact that Byte objects are immutable, why not have just a single Byte object for each distinct byte value (we’ll need only 256 instance to cover all values). This way we’ll pass the same Byte instance to all people with a 45 shoe size, Jean-Claude and Jet-Li map in our case. This will reduce the number of Byte instance from a million to only 256. Sounds super!

Memorizing too many objects is hard

Too much objects in memory

How do you implement this? You’ll might rush into initializing an array of 256 Byte objects during application start-up, giving birth to something of this sort:

// init instances array
int RANGE_OF_VALUES = 2^8; // we don't care about negatives
Static Byte constShoeSizes = new Byte[
RANGE_OF_VALUES];
for (byte b=0; b<
RANGE_OF_VALUES; b++) {
constShoeSizes[b] = new Byte(B
yte.MIN_VALUE + b);
}
map.put("Jean-Claude Van Damme", constShoeSizes[45]);
map.put("Jet Li", constShoeSizes[45]);
map.put("Chuck Norris", constShoeSizes[112]);

Enter the valueOf() method

WHOA! Hold you horses! Doesn’t this use case seems to be just too common and trivial?! haven’t the Java language designers and implemented came accross the same problem? Surely, some of the JRE classes themselves must have Byte instances data members. In an effort to reduce the JRE memory footprint, won’t the JRE programmers cache instances using something very much like the static Byte array we implemented ourselves?
The short answer of course is YES! Java 5 presents a new overloaded Byte.valueOf(byte b) method. This method returns a reference to a Byte instance taken from a shared cache. This trivial cache strategy save memory and CPU, as there’s no need to construct new objects and later on garbage collect them.
Here’s the relevant Byte.valueOf method source code taken from Byte.java source:

private static class ByteCache {
private ByteCache(){}

static final Byte cache[] = new Byte[-(-128) + 127 + 1];

static {
for(int i = 0; i < cache.length; i++)
cache[i] = new Byte((byte)(i - 128));
}
}
...
public static Byte valueOf(byte b) {
final int offset = 128;
return ByteCache.cache[(int)b + offset];
}

Using the valueOf method, here’s how the final version of our code will look like:

map.put("Jean-Claude Van Damme", Byte.valueOf(45));
map.put("Jet Li", Byte.valueOf(45));
map.put("Chuck Norris", Byte.valueOf(112));

Wrapping up quickly:

  1. From Java 5 onwards, use the valueOf method for Number extenders like: Byte, Short, and Integer. Notice that as the Integer object has 2^32 different values, only the (-128) to 127 values range is cached. Meaning that expression (Integer.valueOf(129)==Integer.valueOf(129)) will always be false, since it returns a new Integer object on every call.
    Other object types (Double,Float, etc…) valueOf method does not implement a cache at all. If your value range is limited in nature, you might choose to create a caching scheme of your own.
  2. Always be on the lookout and Inspect repetetive Instance creation closely, see if you can avoid it by referencing an shared immutable object, or by borrowing an instance from an object pool.
  3. Strings can have an even larger space and time performance gains than numbers objects, though at the same time they are inherently harder to reuse. You might want to take time to learn about Strings instances reuse strategies; start with the String.intern() method.

Cycling through the Integer range – A Fermi problem

Saturday, May 24th, 2008

After graduating in economics during the summer of 2005, I went interviewing for a business analyst position in a couple of business consulting firms (e.g. Mckinsey & Company).

06232007living.jpgSince, real life, business dilemmas requires estimating and decision making under uncertainty (not all of the required information is available nor it is accurate), a major part of the interview for these type of firms is confronting you with the How many pay phones are there on the island of Manhattan?” type of problems, also known as Fermi’s problems.
Although that, at first, these problems seem quite puzzling, given that you remain focused, methodical and leverage a modest amount of common sense, it all gets pretty easy. The “trick” is to combine basic facts which you already know, with some four grader algebra, doing this brings you to good enough estimates.

midtown-manhattan-city-street.jpg

Allow me to introduce to you a quick CS Fermi problem that someone through around while in the office. The problem might also be presented during an interview with a fresh graduate student candidates. Here goes:

Running on your average home computer (A single 2Ghz core), how long would it take for this Java program to complete it’s operation?

long startTime = System.currentTimeMillis();
for (int i=Integer.MIN_VALUE; i<Integer.MAX_VALUE; i++) {
};
System.out.println(System.currentTimeMillis()-startTime);

How long? Two nanoseconds? Three seconds? Four hours? Five years? Six centuries? Seven millenniums? What’s important here is the order of magnitude and not the exact answer, you might find this question to be trivial, but you will be surprised of how many people can’t get a clue on how to start answering it. Take thirty seconds and try to come up with your own estimation, before reading through my estimation:

Let’s compute a ball park figure:
Since an Integer is a 32Bit creature, the loop will cycle 2^32 times (about 4.3 billion times. Remember that a billion is 10^9). The 2006′s average home computer CPU runs at about 2GHz, this means that the CPU can perform two billion simple instructions per second (Complex instructions consume several CPU cycles).

The loop does three obvious operations on each cycle: (1) I is incremented. (2) the values of i and the max Integer constant are compared between (3) we jump back to the beginning of the loop.
All are fairly simple instructions (don’t have to be an assembly programmer to know that), so it’s safe to assume that these instructions are executed with in a single CPU cycle.
BTW: Instructions 2 and 3 can be combined in to a single instruction (jump is less then).
If the loop would have been coded in assembly language, my guess is that it would take 4 seconds to complete: (2 instructions) * (4*10^9) loop cycles / (2*10^9) instructions/sec = 4 seconds. Thus, we have just found the lower limit value for our answer: the Java code couldn’t execute in under 4 seconds.

My guesstimation would probably be between 4-40 seconds.

Other possible influencing factors:
(*) Now we know that Java is not effective as machine language and adds some overhead to our code. Depending on the implementation of the JVM in use, the method might be complied to machine native code, instead of executing in interpreted mode. This would improve the performance of course.

(*) As i recall, by JLS specification, the JVM is obliged to check for overflow while incrementing integers; If so, this will add a fix number of operations per loop cycle.

(*) Since our Integer isn’t volatile (a local variable can’t be volatile anyway), its value would be probably cached in one of the CPU’s registers throughout all of the loop execution. Have it been declared a volatile, the JVM would had been forced to read and write the Integer value to the machine’s main memory on each operation that involves the Integer variable. since memory CAS latency is measured in nanos as well, this should, theoretically, add a fixed cost for each loop cycle (~10-100 nanos), possibly increasing the estimation’s order of magnitude by a factor of one.

(*) Running on a multi-core chip should have no direct positive effect, as this is a single threaded program.

Here is the relevant disassembled Bytecode:
4: ldc #3; //int -2147483648
6: istore_3
7: iload_3
8: ldc #4; //int 2147483647
10: if_icmpge 19
13: iinc 3, 1
16: goto 7

Actual results:
(1) On my IBM T41 ThinkPad it took 80 seconds to complete.
(2) On my workstation at home, equipped with an Intel core2 6300 1.8GHz CPU it tool only 9 seconds to complete.

Since I can’t explain such discrepancies. I’ll have to check further and update with new information. Try it yourselves!