P_W999

Because life has no try-catch

Java Performance with JaPe

1 Comment

One of the hot topics at a client I’m working is “performance”: everything has to be super fast, response times must be super low while the applications are supposed to create unicorns that shit rainbows at the click of a button. When an application is slow, we usually look at the following aspects:

  • Slow web services
  • Memory and garbage collection
  • Anything database related
  • Filesystem interactions

And usually we dig deep using tools like Dynatrace to analyze the performance (or the lack of).

(c) Dilbert.com

What we usually forget is that behind all these web services, database calls etc… is just a lot of Java code. We all now how to achieve high performance on an architectural level, but what if we go a level lower ? What’s the fastest way to do for-loops ? Which Java Collection is the fastest ? Nobody seems to know 8O  .

That’s why I wrote JaPe, a simple Java Performance benchmark. It’s a simple Java benchmark program I wrote to test some commonly used implementation of java.util.Collectionjava.util.Map and some other common things like String concatenation and the differences between primitives and their object counterparts. An hour of coding I thought, I couldn’t be more wrong (ps. if you’re just interested in results and answers, then scroll down to 1/4 of this post).

Pitfalls

While writing this little benchmark I encountered several pitfalls which would either have a positive or negative influence, rendering the results and my benchmark useless.

Garbage Collection

One of the great aspects of Java is that you don’t have to clean up; the Garbage Collector will make sure that all non-referenced objects are removed from memory sooner or later. The downside is that you, as a developer, have nearly no control over this garbage collector, it will run when it has to and nothing will prevent it from running while JaPe is performing it’s tests. I decided that this is not a bad thing, since garbage collection is a part of the Java platform and trying to exclude it from the benchmarks would be like cheating. After each test (and sometimes even more often) I would call the GC myself to make sure that each test starts with a clean heap.

All tests are performed with the following memory arguments: -Xms768m -Xmx768m -Xmn736m -XX:PermSize=16m -XX:MaxPermSize=128m -XX:SurvivorRatio=64 -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseConcMarkSweepGC -Xloggc:gc.log

For those who don’t understand them:

  • Xms: initial heap size
  • Xmx: maximum heap size
  • Xmx: new generation size
  • XX:PermSize: initial permanent generation size
  • XX:MaxPermSize: maximum permanent generation size
  • XX:SurvivorRatio: ratio new generation/survivor spaces
  • PrintGC: print GC details
  • PrintGCDetails: more verbose GC info
  • PrintGCTimeStamps: add a timestamp to the GC info
  • UseConcMarkSweepGC: use the concurrent mark sweep collector
  • Xloggc: dump all GC info to a file instead of printing it the sysout

The survivor spaces and old generation are kept extremely small since no data is persisted and promotion to tenured is nearly non-existent during the benchmark. The following screenshot shows the memory usage of the JaPe benchmark. During the complete lifecycle of the benchmark, almost no data is kept in-memory. The 42 seconds of Garbage Collection are mostly due to explicit calls to System.gc() which happen outside the tests and have no influence on the results.

Memory Using During JaPe Benchmark (1=sleep, 2=warm-up, 3=tests, 4=export Excel)

Memory Using During JaPe Benchmark (1=sleep, 2=warm-up, 3=tests, 4=export Excel)

Bad Timers

In my first attempt I used System.nanoTime() to measure the time needed to execute each test-method, and I would use this 100.000 times. I quickly discovered that this method is not very precise and calling it this many times introduced at least some degree of error in my results. When running the following snippet before each test, I got averages between 1500 ns and 2500ns ! Repeating this 100.000 give me a random offset between 0 and 100ms which is huge on a test that would only take 40ms to complete it’s 100.000 iterations.

long nanos = 0;
long start;
for (int i = 0; i < 10000; ++i) {
	start = System.nanoTime();
	nanos += (System.nanoTime() - start);
}
double precision = nanos / 10000.0;

The following bug gives a bit more info about the nanoseconds problem: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6440250

After discovering the limitations of the nanosecond, I decided to measure the total execution time in milliseconds of each test-method . This way I would also measure the time needed to process the for-loop, but this offset is the same for each test and can thus be ignored. In order to improve the accuracy, I also repeated certain operations 40 times to make sure the total time would at least exceed 1ms.

To further improve the accuracy I did the following steps:

  • Sleep the main thread so the JVM can finish it’s initial work
  • Call System.runFinalization() to make sure the JVM is ready (as suggested in bug-id 6440250)
  • Pre-run each test to make sure all classes are loaded

Runtime Optimizations

This was probably one of the more difficult issues and solving it involved writing a custom ClassLoader; Java seems to optimize itself at runtime and somehow it managed to make the same Boolean-test run 10-100x faster during the JVM’s lifetime !

Turns out you can bypass this behavior by unloading the test-classes. To make sure this happens, all classes in the be.pw999.jape package are loaded and cached in a separate classloader, while all other classes are loaded using the sun.misc.Launcher$AppClassLoader. Once all tests are finished, the classloader’s cache is cleared, all remaining objects are nullified and a call to System.gc() is all that’s left to unload all classes and the JVM is ready again to re-run each test with a different initial size.

Class unloading between two tests clearly visible in JConsole

Class unloading between two tests clearly visible in JConsole

The Benchmark

Very simply put, this is what JaPe does, described in some pseudo-code-like language:

JaPe {
	for (TestClass clazz : testClasses) {
		Test test = JaPeClassLoader.getClass(clazz).newInstance();
		test.initialize();	//test-object
		for (TestMethod method : test) {
			long totalExecutionTime = 0;

			if (canRunIfinitely(method)) {	//Method does not need re-init of test-object
				long start = currentMillis();
				for(0..LOOPS) {
					invokeMethod(test, method);
				}
				totalExecutionTime += (currentMillis() - start);
			} else {
				int limit = getLimit(method);	//How often it can run before test-object must re-initialize
				int executions = LOOPS / limit;
				int leftOvers = LOOPS % limit; // LOOPS = (executions * limit) + leftOvers
				for (0..executions) {
					long start = currentMillis();
					for (0..limit) {
						invokeMethod(test, method);
					}
					totalExecutionTime += (currentMillis() - start);
					test.initialize();
					doGarbageCollectionFromTimeToTime();
				}
				long start = currentMillis();
				for(0..leftOvers) {
					invokeMethod(test, method);
				}
				totalExecutionTime += (currentMillis() - start);
			}

			if (dirtiesTestObject(method)) {
				test.initialize();
			}

		}
		test.destroy();
		doGarbageCollection();
	}
}

For the subjects I wanted to test, a test-class was created. Such test-class is an implementations of the ITest interface and a good example would be the AbstractCollectionTest . Each test-class has several test-methods and each test-method is invoked 100.000 times. The total execution to invoke each test-method 100.000 will be recorded and this is the value shown in the charts.

During the initialization phase, object can be initialized with a given size, in the results here-under, these initial sizes were 1024 and 4096. A Collection for example would be initialized to contain 1024 items after which test would be executed against this Collection. During a second iteration, this Collection would be initialized to contain 4096 items.

All the test-classes  run inside a separate scope, with their own classloader which is garbage collected after all tests have ended:

try {
	JaPeClassLoader classLoader = new JaPeClassLoader(log.getClass().getClassLoader(), new URL[]{new File("./").toURI().toURL()});
	{
	    Class<!--?--> runnerClazz = classLoader.loadClass("be.pw999.jape.TestRunner");
	    Class<!--?--> settingsClass = classLoader.loadClass("be.pw999.jape.constants.Settings");
	    settingsClass.getDeclaredField("LIST_SIZE").setInt(null, Settings.LIST_SIZE * Double.valueOf(Math.pow(4.0, k)).intValue());

	    Object runner = runnerClazz.newInstance();
	    Method m = runnerClazz.getMethod("run");
	    m.invoke(runner);
	    //runner.run(args);
	    m = null;
	    runner = null;
	}
	classLoader.clear();
} catch (... lot's of exceptions here)

}
GCManager.forceGC();

The Results

Collections

Iterating

The following charts show the total execution time needed to iterate different Java collections containing either 1024 or 4096 items. The charts contain implementations of java.util.Collection and java.util.List, the latter having more methods and thus more results. For comparison, a String-array was also tested. The following code quickly explains each test:

//Convert to array and (For loop by index with cached length)
String[] array = read.toArray(new String[0]);
int l = array.length;
for (int i = 0; i < l; ++i) {
    z = array[i];
}

//For loop by index
for (int i = 0; i < read.size(); ++i) {
    z = read.get(i);
}

//For loop by index with cached length
int l = read.size();
for (int i = 0; i < l; ++i) {
    z = read.get(i);
}

//For loop enhanced
for (String s : read) {
    z = s;
}

//Iterate with iterator
Iterator it = read.iterator();
while (it.hasNext()) {
    z = it.next();
}

And the results are:

Execution time for iterating different Java Collection with 1024 items

Execution time for iterating different Java Collection with 1024 items

Execution time for iterating different Java Collection with 4096 items

Execution time for iterating different Java Collection with 4096 items

The first thing to notice is how fast arrays are. The second thing to notice is how slow the LinkedList implementation is when iterating by index. I would have to multiply the scale by 8 to make it fit on the charts !

In general, based on these results, the fastest way to loop a collection is:

  • If possible: use a simple array
  • Implementation of java.util.Set: use enhanced for-loop
  • Implementation of java.util.Collection: convert to array, and loop it by using: int i = 0; i < l; ++i

Other functions

The following chart shows some other, often used, functionalities of java.util.Collection and java.Util.List.

Common Java Collection functions on collections with 1024 items

Common Java Collection functions on collections with 1024 items

Common Java Collection functions on collections with 4096 items

Common Java Collection functions on collections with 4096 items

A bit surprising, but not really, is that Sets perform better when removing an object by reference, probably because they’ve already calculated the hashcode of every object in the set.

Maps

Iterating

Everybody knows you should iterate a java.util.Map via it’s Entry-Set, but how much slower is it when you don’t ? The following tests were executed on a Map with 1024 and 4096 items.

//Loop map via key look-up using iterator"
Iterator it = read.keySet().iterator();
while (it.hasNext()) {
    z = read.get(it.next());
}

//Loop map via key look-up using enhanced for-loop
for (String s : read.keySet()) {
    z = read.get(s);
}

//Loop map via entrySet using iterator
Set<Map.Entry<String, Double>> set = read.entrySet();
Iterator<Map.Entry<String, Double>> it = set.iterator();
while (it.hasNext()) {
    z = it.next().getValue();
}

//Loop map via entrySet using enhanced for-loop
Set<Map.Entry<String, Double>> set = read.entrySet();
for (Map.Entry<String, Double> s : set) {
    z = s.getValue();
}

And the results are:

Iterating Java maps with 1024 values

Iterating Java maps with 1024 values

Iterating Java maps with 4096 values

Iterating Java maps with 4096 values

I think it’s safe to say that looping a Map via it’s KeySet is a bad idea. In the most ideal case, this needs 50% more time to complete. The worst-case scenario (TreeMap) was 13,6x times slower ! It went completely off-scale…

In general, it doesn’t seem to matter if you use an Iterator or an enhanced for-loop, the differences are small enough to say that both methods are equally fast.

Other functions

The following chart shows some other, often used, functionalities of java.util.Map.

Common Java Map functions on maps with 1024 items

Common Java Map functions on maps with 1024 items

Common Java Map functions on maps with 4096 items

Common Java Map functions on maps with 4096 items

Not much to say here, unless you really bother discussing a difference of 100 milliseconds over 100.000 iterations ;) .

Strings

String concatenation is often considered to be slow and preferably done using a StringBuilder or StringBuffer. To test this I concatenated a whitespace character to an existing String and this 100.000 times.

String concatenation and toString() on 1024 chars

String concatenation and toString() on 1024 chars

String concatenation and toString() on 4096 chars

String concatenation and toString() on 4096 chars

Concatenating Strings with the + operator is indeed a lot slower compared to using a StringBuffer or StringBuilder. Doing 100.000 String concatenation with the + operator is roughly 1900 times slower compared to a properly initialized StringBuffer or StringBuilder.

The toString() method on the StringBuilder and StringBuffer seem extremely slow, but you have to consider that the time shown in the charts is the time needed to call the toString() method 100.000 times.

Numerics

What is faster, Numeric implementations or primitives ?

Three difference test were executed for addition, subtraction, multiplication and division of numerics and this for integers and double:

  • with primitives: only using primitives
  • with instanceof objects: release on Integer.valueOf(..) and Double.valueOf(..) to aquire the Numeric implementation
  • with fixed objects: uses Numeric objects declared and initialized on class level

These test are not influenced by the initial capacity (1024 or 4096), but I’m showing both results anyway, just to show that they are nearly identical :mrgreen: .

Numerics tests (1024)

Numerics tests (1024)

Numerics tests (4096)

Numerics tests (4096)

The results are obvious: primitives are a lot faster, even when caching the Numeric object it’s still a lost faster. Calling the instanceOf() method is clearly a lot slower and should be avoided if possible.

Booleans

Which is faster Boolean b = != Boolean.TRUE or boolean b = !false;

These test are not influenced by the initial capacity (1024 or 4096), but I’m showing both results anyway, just to show that these are also nearly identical :mrgreen: .

Inverting a boolean (1024)

Inverting a boolean (1024)

Inverting a boolean (4096)

Inverting a boolean (4096)

Pretty neat huh, inverting a Boolean is twice as slow as inverting a boolean.

Looping

What’s the fastest way to do a for-loop ? The loops will run from 0 to the initial size (1024 or 4096) and this 100.000 times.

Looping from 0 to 1024

Looping from 0 to 1024

Looping from 0 to 4096

Looping from 0 to 4096

We’re talking about a small difference, but it seems like < is a tad slower than <= whereas ++i and i++ doesn’t seem to change a thing: so do whatever suits your needs :)

For the record, all tests were executed on a Dell Inspiron 6400 with Linux Mint installed on it and the performance governor was loaded:

 INFO | 27/12 13:05:24 | TestRunner | ************************************************************************
 INFO | 27/12 13:05:24 | TestRunner | *  Initial object size:       4096                                     *
 INFO | 27/12 13:05:24 | TestRunner | *  Ms time precision:         1ms                                      *
 INFO | 27/12 13:05:24 | TestRunner | *  Number of loops:           100000 (Integer.MAX_VALUE/21474)         *
 INFO | 27/12 13:05:24 | TestRunner | * CPU Mhz:                    1600                                     *
 INFO | 27/12 13:05:24 | TestRunner | * CPU Model:                  Core(TM)2 CPU         T5200  @ 1.60GHz   *
 INFO | 27/12 13:05:24 | TestRunner | * CPU Vendor:                 Intel                                    *
 INFO | 27/12 13:05:24 | TestRunner | * Input args:                 -Xms768m -Xmx768m -Xmn736m -XX:PermSize=16m -XX:MaxPermSize=128m -XX:SurvivorRatio=64 -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseConcMarkSweepGC -Xloggc:gc.log -Dfile.encoding=UTF-8 *
 INFO | 27/12 13:05:24 | TestRunner | * Java Arch DM:               32                                       *
 INFO | 27/12 13:05:24 | TestRunner | * Java Vendor:                Oracle Corporation                       *
 INFO | 27/12 13:05:24 | TestRunner | * Java Version:               1.7.0_09                                 *
 INFO | 27/12 13:05:24 | TestRunner | * Max CMS Old Gen (mxb):      32Mb                                     *
 INFO | 27/12 13:05:24 | TestRunner | * Max CMS Perm Gen (mxb):     128Mb                                    *
 INFO | 27/12 13:05:24 | TestRunner | * Max Code Cache (mxb):       32Mb                                     *
 INFO | 27/12 13:05:24 | TestRunner | * Max Mem (runtime):          756Mb                                    *
 INFO | 27/12 13:05:24 | TestRunner | * Max Par Eden Space (mxb):   713Mb                                    *
 INFO | 27/12 13:05:24 | TestRunner | * Max Par Survivor Space (mxb):11Mb                                    *
 INFO | 27/12 13:05:24 | TestRunner | * OS Arch:                    i386                                     *
 INFO | 27/12 13:05:24 | TestRunner | * OS Name:                    Linux                                    *
 INFO | 27/12 13:05:24 | TestRunner | * OS Version:                 3.5.0-17-generic                         *
 INFO | 27/12 13:05:24 | TestRunner | ************************************************************************

The complete output can be found on BitBucket.

Small disclaimer: I’m not a benchmark/performance ‘expert’/’professional’, I don’t claim to be one either. I’m just a guy who likes developing and JaPe was definitely a good challenge. JaPe is open-source licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 2.0 Belgium License.

Author: Phillip

Belgian Java EE consultant, software developer, technology enthusiast and digital photographer.

One thought on “Java Performance with JaPe

  1. Pingback: 2013 in numbers | P_W999

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 31 other followers