How Much Memory Do I Need for My Data … Part 2

How much memory do I need for my data?“. — continued!

In an earlier post, the space needed to store basic data, such as numbers, was examined. However, we deferred the discussion related to more complex data structures, like the example below:

Person {
    String firstName;
    String lastName;
    int age;
}

At the time, we declared this example to be surprisingly difficult and now we will find out why.

A struct in Java

The setting in-memory-format=OBJECT allows us to dictate that data is stored deserialized in Hazelcast.

Hazelcast on the server-side is Java, so deserialized storage means Java format.

The size for Person will be maybe 36 bytes, using this algorithm:

(probably 12) + 2 * (4 or 8) + 4

We know from the previous post that the object header is likely to be 12 bytes. The references to the first and last name fields will be either 4 bytes or 8 bytes, for both. The age field, as an int is 4 bytes.

We also know the reference fields, firstName and lastName will likely need to be aligned on an 8-byte boundary, so 16 and 24 bytes offset from the start of the object.

Are 4 bytes wasted?

If the object header is 12 and the firstName field starts at offset 16, does this mean 4 bytes wasted in the middle of the object?

Possibly not, the Java compiler should be clever enough to move the 4-byte age field from the end to the start, to use this space. So the definition becomes:

Person {
    int age;
    String firstName;
    String lastName;
}

If this is the case, then the likely size of the Person object is (probably 12) + 4 + 2 * (4 or 8), 32 bytes.

However, if we change the age field from int to long, that field changes in size from 4 to 8 bytes. If it is 8 bytes, we can’t use that 4 bytes of gap to store it. So the storage is (probably 12) + 2 * (4 or 8) + 8, 40 bytes.

We would naturally assume age is measured in years and so an int would be fine. But we might later find it needs to be measured in nanoseconds, and needs to become a long

In other words, changing a 4-byte field to an 8-byte field might change the object size from 32 to 40. One field increases by 4 bytes but the object increases by 8.

This sort of oddity only makes sense if you know the intricate details of what Java is doing, something you have to know to do sizing absolutely accurately.

This also explains “maybe 36” above, it could be 32 or 40.

Shallow or deep size

In Java, compound objects are constructed by reference. The block of memory for the Person object does not hold the firstName object as a subsection. The firstName object is instead stored somewhere else in memory and the Person stores the memory address of the firstName.

What that means here is if we size the Person object we calculate the shallow size, and don’t include the first and last name objects. In other words, the answer is incomplete and therefore wrong.

We need to calculate the deep size, the size of the Person and the transitive size of all objects it refers to.

Java sizing problem

We might calculate the deep size of a Person object as 140 bytes, as 30 or so for a Person plus 50 or so for each typical length String objects for first and last name.

However, it might not be right to assume 140MB to store 1 million such Person objects.

The reason is, there are likely not 1 million unique first names or last names.

We might be clever and exploit the fact that a string in Java is a constant. Two different Person objects could safely refer to the same String object to record two people having the same first name.

In Java terms, those first names would be both “equal” and “==“.

The storage need would definitely go up to store more as additional Person objects are added, but it might not be as simple as to multiply the size of one object and extrapolate linearly.

Why deserialized?

Why would we hold our data as a Java object in Hazelcast ?

We have already suggested this is likely to use more memory than serialized. If we need to retrieve it, we need to serialize it to send it across the network to where it is needed.

The reason is for compute. To access this data for server-side compute we need it as Java. It is stored serialized we need to deserialize to use. If we run compute frequently we need to deserialize it frequently.

This then is a configuration choice. Serialized is better if we do mainly retrieval. Deserialized is better if we do mainly compute.

Important to note here, this is not a permanent decision. You may base your sizing on serialized, but later releases of your code may mandate a change to deserialized.

Now for serialized

There are several algorithms to select from for serialization, too many to discuss here.

We shall take IdentifiedDataSerializable as an example, as it’s valid for all languages not just Java.

If Person record might have a first name of “John” (49 bytes) and a last name of “Doe” (48 bytes) for a total size of 140 bytes or so. How is this serialized?

The main facet is that the receiver has to know what it is receiving in order to be able to treat it as anything other than a meaningless stream of bytes.

To do this, we code a mechanism for serialization that might look like this:

public void writeData(ObjectDataOutput out) {
    out.writeUTF(this.firstName);
    out.writeUTF(this.lastName);
    out.writeInt(this.age);
}

When serializing, Hazelcast will write 17 bytes of header, then call the “writeData” method to serialize the rest.

When “writeUTF("John")” is called, this outputs the length of the following string as four bytes, followed by “J“, “o“, “h” and “n“. So this adds 8 bytes.

When “writeUTF("Doe")” is called this adds another 7 bytes.

Finally, “writeInt(this.age)” adds 4 bytes.

So the grand total is 17 + 8 + 7 + 4, 36 bytes. Much better than 130 or so in Java.

For those curious, the actual bytes are:

0, 0, 0, 0, -1, -1, -1, -2, 1, 0, 0, 0, 3, 0, 0, 0, 4, 0, 0, 0, 4, 74, 111, 104, 110, 0, 0, 0, 3, 68, 111, 101, 0, 0, 0, 5

Those familiar for UTF codes can find “J” as hex 4A and “o as hex 6F, and figure out the rest. Remember integer fields are 4 bytes.

Don’t forget querying

Talk here of serialization and deserialization focus on storage size and transmission speed. Do not forget you may also access the data where it is stored.

If you query your data, you need to deserialize the whole object or with Portable only the fields you need.

Which is better

Smaller is better obviously when it comes to storage.

Deserialized is better if compute is more significant than retrieval.

Smaller may not matter for network transmission. Data is moved in blocks, a 1KB record and a 2KB record may transfer in the same size if the network buffer is larger than either.

Summary

Serialized will probably use less memory than deserialized. If your programmers are clever and your data permits it, this might not be true.

Serialized and deserialized sizes can be world apart, don’t use one to predict the size of the other.

Predictions on average lengths of fields such as names may also turn out to be wrong.

All of which suggests you should aim for approximate sizing, and use measurement as a safety net. Accurate sizing is very difficult.