Investigating Variable Length Messages¶

Fixed-length messages are the simplest form of message type in SBE. However, there are many cases where variable-length data fields need to be sent. To address this, we will create a new schema that includes a message type with two variable-length fields. We will then investigate how the generated Java code handles these fields.

This message is defined in the following XML schema, schema-02.xml:

schema-02.xml

<sbe:messageSchema xmlns:sbe="http://fixprotocol.io/2016/sbe"
                   package="com.shaunlaurens.pa.schema2"
                   id="1001"
                   version="1"
                   semanticVersion="pa0.1"
                   description="Schema 2 for the PA samples, version 0.1">
    <types>
        <composite name="messageHeader" 
                   description="Message identifiers and length of message root">
            <type name="blockLength" primitiveType="uint16"/>
            <type name="templateId" primitiveType="uint16"/>
            <type name="schemaId" primitiveType="uint16"/>
            <type name="version" primitiveType="uint16"/>
        </composite>
        <composite name="varStringEncoding"> <!-- 1(1) -->
            <type name="length" primitiveType="uint32" maxValue="1073741824"/>
            <type name="varData" primitiveType="uint8" length="0" 
                  characterEncoding="UTF-8"/>
        </composite>
    </types>

    <sbe:message name="MessageType2" id="1" 
                 description="A message with two var length fields">
        <field name="field1" id="1" type="int64"/>
        <data name="field2" id="2" type="varStringEncoding"/> <!-- 2(2) -->
        <data name="field3" id="3" type="varStringEncoding"/> <!-- 3(3) -->
    </sbe:message>

</sbe:messageSchema>

We have added a new composite type, varStringEncoding, which includes a length field and a varData field. The length field is a uint32 that defines the length of the varData field. The varData field is a uint8 that is variable length and uses UTF-8 character encoding to store strings.
We have added a new field, field2, to the MessageType2 message type. This field uses the varStringEncoding composite type we defined.
We have added a new field, field3, to the MessageType2 message type. This field also uses the varStringEncoding composite type we defined.

When run through the SBE tool, this schema file will result in the following Java code being generated:

Generated Java Code

└── src
    └── main
        └── java
            └── com
                └── shaunlaurens
                    └── pa
                        └── schema2
                            ├── MessageHeaderDecoder.java
                            ├── MessageHeaderEncoder.java
                            ├── MessageType2Decoder.java
                            ├── MessageType2Encoder.java
                            ├── MetaAttribute.java
                            ├── VarStringEncodingDecoder.java
                            ├── VarStringEncodingEncoder.java
                            └── package-info.java

We will focus on the generated Message Type 2 encoder and decoder code. The VarStringEncodingEncoder and VarStringEncodingDecoder classes are also generated, but are unused. The other classes are much the same as before.

Message Type 2 Encoder and Decoder¶

Static header information¶

Both the encoder and decoder classes for the message type include fixed attributes for the header data we defined in the schema. Of particular interest is the BLOCK_LENGTH, which is now only the length of the fixed length field field1.

public static final int BLOCK_LENGTH = 8;
public static final int TEMPLATE_ID = 1;
public static final int SCHEMA_ID = 1001;
public static final int SCHEMA_VERSION = 1;
public static final String SEMANTIC_VERSION = "pa0.1";

We can see:

BLOCK_LENGTH is actually the total length of the fixed length message fields in bytes.
In our case, the message type has 8 bytes for field1. Other fields are not counted.
TEMPLATE_ID is the unique identifier for the message type, as we defined in the schema.
SCHEMA_ID is the unique identifier for the schema, again, with the value set to what was provided in the schema.
SCHEMA_VERSION is the version of the schema.
SEMANTIC_VERSION is a human-readable version of the schema that we defined in the schema.

Variable-length field decoding¶

Message Type 2 adds complexity with two variable length string fields. Visually, one might expect that the buffer content (excluding any header information), looks like this with each block representing a byte:

Message Type 2 - Byte layout

We will correct this diagram later. Within the generated Java code, we can see that the MessageType2Encoder and MessageType2Decoder classes use fixed byte offsets for reading and writing the fixed data. This is the same as messsage type 1. We will focus on the variable length fields in the MessageType2Encoder and MessageType2Decoder classes.

MessageType2Decoder.java - reading the strings

public String field2()
{
    final int headerLength = 4;
    final int limit = parentMessage.limit();
    final int dataLength = (int)(buffer.getInt(limit, BYTE_ORDER) 
        & 0xFFFF_FFFFL);
    parentMessage.limit(limit + headerLength + dataLength);

    if (0 == dataLength)
    {
        return "";
    }

    final byte[] tmp = new byte[dataLength];
    buffer.getBytes(limit + headerLength, tmp, 0, dataLength);

    return new String(tmp, java.nio.charset.StandardCharsets.UTF_8);
}
...
public String field3()
{
    final int headerLength = 4;
    final int limit = parentMessage.limit();
    final int dataLength = (int)(buffer.getInt(limit, BYTE_ORDER) 
        & 0xFFFF_FFFFL);
    parentMessage.limit(limit + headerLength + dataLength);

    if (0 == dataLength)
    {
        return "";
    }

    final byte[] tmp = new byte[dataLength];
    buffer.getBytes(limit + headerLength, tmp, 0, dataLength);

    return new String(tmp, java.nio.charset.StandardCharsets.UTF_8);
}

Some things of interest in these two methods:

There are no more fixed offsets for reading the data. Now, there is internal state that is used to track the position in the buffer, called limit.
we can see that it is first reading the length of the data from the buffer, then reading the data itself.
the parentMessage is used to limit the buffer to the correct length for reading the data
excluding the state held within the limit, the reads of field1 and field2 are identical - so how could it know how to read them unless read in the order written?
given the identical nature of the reads, it seems safe to assume that out of order reads will result in invalid data.

Out of order reads after correct order writes¶

Let's try a quick experiment to see what happens when we read the fields out of order:

private static final UnsafeBuffer BUFFER =
    new UnsafeBuffer(ByteBuffer.allocate(256));
private static final MessageType2Decoder MESSAGE_TYPE_2_DECODER =
    new MessageType2Decoder();
private static final MessageType2Encoder MESSAGE_TYPE_2_ENCODER =
    new MessageType2Encoder();
private static final MessageHeaderDecoder MESSAGE_HEADER_DECODER =
    new MessageHeaderDecoder();
private static final MessageHeaderEncoder MESSAGE_HEADER_ENCODER =
    new MessageHeaderEncoder();

@Test
public void testEncodingDecodingWrongOrderRead()
{
    final int bufferOffset = 0;

    MESSAGE_TYPE_2_ENCODER.wrapAndApplyHeader(BUFFER, bufferOffset, 
            MESSAGE_HEADER_ENCODER)
        .field1(1234L)
        .field2("this is string field 2") // 1(1)
        .field3("this is string field 3"); // 2(2)

    MESSAGE_TYPE_2_DECODER.wrapAndApplyHeader(BUFFER, bufferOffset,
        MESSAGE_HEADER_DECODER);
    final long field1 = MESSAGE_TYPE_2_DECODER.field1();
    final String field3 = MESSAGE_TYPE_2_DECODER.field3(); // 3(3)
    final String field2 = MESSAGE_TYPE_2_DECODER.field2(); // 4(4)

    assertEquals(1234L, field1);
    assertEquals("this is string field 3", field2); //INVALID 5(5)
    assertEquals("this is string field 2", field3); //INVALID 6(6)
}

We write the fields in the correct order. String 'this is string field 2' goes to field 2
We write the fields in the correct order. String 'this is string field 3' goes to field 3
We read field 3 before field 2, causing the limit to be set to the wrong value
We read field 2 after field 3, causing the decoder to read the wrong value
Field 2 is read as field 3's value, so the test passes
Field 3 is read as field 2's value, so the test passes

The above test passes - despite the asserts marked 5 and 6 being incorrect. This is because the limit is set to the wrong value by the read in 3, and the decoder continues to read the wrong data.

Variable-length field encoding¶

MessageType2Encoder.java - writing the strings

public MessageType2Encoder field2(final String value)
{
    final byte[] bytes = (null == value || value.isEmpty()) ? 
        org.agrona.collections.ArrayUtil.EMPTY_BYTE_ARRAY : 
        value.getBytes(java.nio.charset.StandardCharsets.UTF_8);

    final int length = bytes.length;
    if (length > 1073741824)
    {
        throw new IllegalStateException("length > maxValue for type: " + length);
    }

    final int headerLength = 4;
    final int limit = parentMessage.limit();
    parentMessage.limit(limit + headerLength + length);
    buffer.putInt(limit, length, BYTE_ORDER);
    buffer.putBytes(limit + headerLength, bytes, 0, length);

    return this;
}
...
public MessageType2Encoder field3(final String value)
{
    final byte[] bytes = (null == value || value.isEmpty()) ? 
        org.agrona.collections.ArrayUtil.EMPTY_BYTE_ARRAY : 
        value.getBytes(java.nio.charset.StandardCharsets.UTF_8);

    final int length = bytes.length;
    if (length > 1073741824)
    {
        throw new IllegalStateException("length > maxValue for type: " + length);
    }

    final int headerLength = 4;
    final int limit = parentMessage.limit();
    parentMessage.limit(limit + headerLength + length);
    buffer.putInt(limit, length, BYTE_ORDER);
    buffer.putBytes(limit + headerLength, bytes, 0, length);

    return this;
}

In much the same way as the decoder, the encoder uses the parentMessage to limit the buffer to the correct length for writing the data. The writing of the data length followed by the daa itself is the same for both fields.

Out of order writes, correct order reads¶

public void testEncodingDecodingWrongOrderWrite()
{
    final int bufferOffset = 0;

    MESSAGE_TYPE_2_ENCODER.wrapAndApplyHeader(BUFFER, bufferOffset,
            MESSAGE_HEADER_ENCODER)
        .field1(1234L)
        .field3("this is field three") // WRONG ORDER (1)
        .field2("this is field two"); // WRONG ORDER (2)

    MESSAGE_TYPE_2_DECODER.wrapAndApplyHeader(BUFFER, bufferOffset,
        MESSAGE_HEADER_DECODER);
    final long field1 = MESSAGE_TYPE_2_DECODER.field1();
    final String field2 = MESSAGE_TYPE_2_DECODER.field2(); // CORRECT ORDER (3)
    final String field3 = MESSAGE_TYPE_2_DECODER.field3(); // CORRECT ORDER (4)

    assertEquals(1234L, field1);
    assertEquals("this is field three", field2); // INVALID 5 (5)
    assertEquals("this is field two", field3); // INVALID 6 (6)
}

We write the fields in the wrong order. String 'this is field three' goes to field 3. This sets the limit to the wrong value
We continue to write the fields in the incorrect order. String 'this is field two' goes to field 2 after we wrote field 3
We read field 2 before field 3, and the decoder's limit is correctly set
We read field 3 after field 2
Field 2 is incorrectly read as field 3's value, so the test passes
Field 3 is incorrectly read as field 2's value, so the test passes

The above test passes - despite the asserts marked 5 and 6 being incorrect. This is because the limit is set to the wrong value by the first write. The decoder behaves correctly - it is the buffer data that is invalid.

Out of order reads and writes¶

In case you were wondering if you happen to both write and read the fields in the wrong order, the test will pass with the correct data in the correct fields. This is despite the fact that the buffer data has the fields in the incorrect order.

@Test
public void testEncodingDecodingWrongOrderWriteAndRead()
{
    final int bufferOffset = 0;

    MESSAGE_TYPE_2_ENCODER.wrapAndApplyHeader(BUFFER, bufferOffset,
            MESSAGE_HEADER_ENCODER)
        .field1(1234L)
        .field3("this is field three") // WRONG ORDER
        .field2("this is field two"); // WRONG ORDER

    MESSAGE_TYPE_2_DECODER.wrapAndApplyHeader(BUFFER, bufferOffset,
        MESSAGE_HEADER_DECODER);
    final long field1 = MESSAGE_TYPE_2_DECODER.field1();
    final String field3 = MESSAGE_TYPE_2_DECODER.field3(); // WRONG ORDER
    final String field2 = MESSAGE_TYPE_2_DECODER.field2(); // WRONG ORDER

    assertEquals(1234L, field1);
    assertEquals("this is field two", field2); // CORRECT DATA
    assertEquals("this is field three", field3); // CORRECT DATA
}

The above test passes, though the buffer data is not in the expected order. The corrected buffer diagram for this particular message type is thus impacted by the order of writes:

Message Type 2 - Byte layout

Recommendations¶

While messages composed solely of fixed-length fields can accommodate any order of reads and writes, messages containing multiple variable-length fields cannot. This inconsistency necessitates consistent order of reads and writes with the schema definition, regardless of field type.

To enforce the order in which fields are written and read, you can utilize the precedence checks feature of the SBE tool. This approach helps prevent subtle bugs that can be challenging to identify. Enabling precedence checks is achieved by setting sbe.generate.precedence.checks=true in the SBE tool.

All the above examples involving an invalid order of reads or writes will fail if precedence checks are enabled. An IllegalStateException is raised with the message “Illegal field access order” and includes information about the offending field access. For more details, refer to the SBE Wiki entry on Safe Flyweight Usage.