Monday, May 23, 2011

Memory optimizations for Static Data - ByteArrays et all

This article details our course taken to statically load and query for a large number of Java objects in memory. We started with using 5-6GB per million objects using a hashmap. It currently stands at 150 Mb per million objects. Thats a 40X improvement over the Java Hashmap. If you think loading large amounts of data into RAM and querring it fast is something you are interested in, read on...

Java HashMaps - A logical starting point was using a Hashmap. While using a hashmap, speeds were awesome (~3Million QPS per thread). Space though was painfully bloated. Storing about 1 Million objects in RAM required ~6GB of space. The actual size of the data was as follows -
  • 15 strings of about 20 characters each. Assuming 2 bytes per character - 15 X 20 X 2 ~600Bytes
  • 15 other ints, floats etc. 4 X 15 = 60 Bytes.
  • Total = 660 bytes
Data size for 1 Million objects - 660 bytes X 1 Million = 660 MB
Actual RAM used by java hashmap and underlying objects - 6GB

Performance details for the Java Hashmap are below
Objects loaded Space usedLoad factor(Actual space/data size)Queries per second - Single threadtime for 1000 queries(single thread)
1 Million5-6 GB9-103 Million.3ms
I'm considering time for 1000 lookups because a typical request to our application required ~1000 lookups. Speed was a non issue but the size of data was a big one. There is approximately a 9X bloat for storing various references, indexes and bloated Java Strings!!!

The optimization machinery needed to spring into action. Here is the problem statement -
  • Need to statically load several million Java objects into RAM. (Several = As many as possible upto ~45Million)
  • Each of the objects contained about 15 Strings and about 15 other native data types.
  • All lookup will be key based. public Bigbject get(String key)
  • Searching on this data should be extremely efficient. A couple of thousand queries in a couple of ms is an absolute must.
Several alternatives were tried and with various degrees of success. Let me order the solutions in an increasing orders of success.

Judy arrays with JNI bindings - This was meant to be a replacement for the Java Hashmap. This Trie based structure(in c) guaranteed efficient memory usage and extremely fast access. Thanks to Carl Lobo for the JNI bindings. Using this turned out to be quite a problem. At the time of writing this, we only had a Linux based judy library so the Windows devs needed to deploy remotely to test their stuff. The JNI bindings were not complete so you needed to add to that to explore more complex functionality. But still, we got quite a boost in memory usage. Judy would return us a number that would be the lookup for in a Java Arraylist. Space -time characteristics achieved by this are detailed below.
Objects loadedSpace usedLoad factor(Actual space/data size)Queries per second - Single threadtime for 1000 queries(single thread)
1 Million1.7 GB2.5-31.5-2 Million.6ms

While there was 3X gains of memory, the bloat was still substantial.

A simple Sorted Java Array - Hold on a second, did I just say Array! Is the solution to this convoluted problem a simple Array. Turns out it does improve space requirements and speeds are quite acceptable. This solution was to just do binary searches over an array of objects sorted by the key. We also happen to have eliminated some of the data that we were loading into the object. the data in the new object was about 500 bytes. Space time characteristics were as below.
Objects loadedSpace usedLoad factor(Actual space/data size)Queries per second - Single threadtime for 1000 queries(single thread)
1 Million1 GB20.5 Million2ms

Something though still made us look further, why the hell is the load factor greater than 1? Also some of our strings were repeated. Should the load factor not be actually less than 1? Also we had a realization that an Empty String in Java takes about 40 bytes. In addition, our Strings were English Strings so we just wanted to store one byte per character. These thoughts led us to the native ByteArray implementation.

Native Byte Arrays - What was this. All the objects as just one big chunk of bytes allocated statically. Something like
Also since all our Strings were English, we decided to only store 1 byte per character.
To get to the 50th element you would just do this calculation.
Elem(50) = 49 X ObjSize to 50 X ObjSize - 1

Due to the english constraint, size of a single object was reduced to ~300 bytes.

Helpers were written to ensure that transforming(to and from Java objects), comparison and binary search could be done easily. For the Strings, we made the following assumptions -
  • Strings would be English
  • String length will not exceed 30 characters. If it does exceed, they will be truncated.
Not only did this meant that the load factor will not exceed 1, but we were actually able to do better. :)
All Strings that existed were put into 1 static Bytearray. And each of the objects just had references to this big bytearray of Strings. This meant that each distinct String would be only stored once. All occurances would just be references to the Big String bytearray.
Space time characteristics of the ByteArray based structures are below.
Objects loadedSpace usedLoad factor(Actual space/data size)Queries per second - Single threadtime for 1000 queries(single thread)
1 Million150Mb0.5300 Thousand3 ms

Shortcomings and fixes - The current constraint of having to truncate > 30 character strings can be avoided if just have string terminations by a special character and all references to the string bytes refer to the actual byte index at which they are stored and not the array index. Also the same Strings bytearray can be adjusted to include unicode instead of ASCII.
We are yet to take up these last 2 tasks and we don't really need them in the foreseeable future. I'll check if we can open source this implementation for all to use and improve.

*Please note that all credit for conceptualizing, design and implementation of the ByteArray Structure goes to Harish Chiugurupati. I am merely someone who understands the problem that was solved and the solution.

2 comments:

Paras Malik said...

is there any difference in performance on Linux and windows machine ?

Prashant said...

Not really checked, but I don't think so. All operations are so raw and basic ByteArray structure. I dont see any reason why they will be different. btw. Some of the numbers listed in the post are from linux machines and some on Windows. They have been similar machines from a hardware config though.