Integers of unknown sizes.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
ltdk e223eba392 Committing stuff 1 month ago
src Committing stuff 1 month ago
.editorconfig Encoding? 5 months ago
.gitignore Initial commit 5 months ago
CHANGELOG.md Initial commit 5 months ago
Cargo.toml Slowly decoding 5 months ago
LICENSE.md Initial commit 5 months ago
Makefile Initial commit 5 months ago
README.md Encoding? 5 months ago
rust-toolchain.toml Encoding? 5 months ago
rustfmt.toml Initial commit 5 months ago

README.md

ious

Integers of unknown sizes.

Summary

An IOUS (Integer Of Unknown Size) is a variable-length integer which is encoded as part of a data stream. This is contrasted with an IOKS (Integer Of Known Size), which has a known size and used to represent the actual integer stored within an IOUS. The plural forms of IOUS and IOKS are IOUSes and IOKSes, and the names are inspired from the Rodents Of Unusual Size (ROUSes) from The Princess Bride.

While the IOUS format is generic over all sorts of parameters, it's designed so that any one implementation of it will be very easy to implement and very fast to both encode and decode. Instead of using continuation markers like most variable-length formats, it stores all length information at the beginning of the stream, often entirely avoiding the need for branching at all.

Conceptually, an IOUS indicates its length with the number of disabled bits at the beginning of the stream. In the most common case, with an IOUS formed of 8-bit units capped at 9 units of length + data, any integer between 8 and 128 bits can be decoded using four instructions:

  1. Count the leading zeroes of the first byte. (the length)
  2. Shift the first byte left by the length.
  3. Shift the first byte right by the length.
  4. Reverse the order of the bytes. (on little-endian architectures)

Although this simple case is relatively straightforward, code that operates under a completely generic form of IOUSes can be much more complicated, although any individual instance of the format can be similarly optimised to a very fast and simple set of instructions.

The IOKS format exists solely as a means to describe the endianness of an IOUS after decoding or before encoding. With byte-sized units, it's equivalent to a big-endian representation, but the possibility for alternatively sized units makes the endianness trickier to explain, and simpler to just describe as a separate format.

Units

IOUSes and IOKSes are comprised of multiple units, which represent standard fixed-length integer types. The signedness of the units extends to the signedness of the integer data itself, meaning that signed units can be used to represent signed integers and unsigned units can be used to represent unsigned integers.

IOKSes are always stored in "big-endian" format with the most significant unit first, although the endianness within units themselves is left unspecified. Unit conversion is outside the scope of this spec.

Since IOUSes are intended to be represented on their own without any additional data, they must contain at least one unit of data even if that data is an unsigned zero. Technically, unsigned IOKSes can unambiguously represent an integer zero with zero units, although the spec will assume that IOKSes have at least one unit for simplicity.

Length bits

IOUSes are designed to be very similar to variable-length-quantities (VLQs), with the difference that all length information is stored at the beginning of the IOUS. Conversion from an IOUS to an IOKS thus simply requires replacing the length information with zeroes or sign-extended bits depending on the units used.

IOUSes also have an upper bound on the number of bits that can be used to represent the length, called the ceiling, which both bounds the total size of an IOUS in units and allows representing a maximum-length IOUS with one fewer bit than otherwise required. This is specific to IOUSes, and does not affect IOKSes, since they contain no length information.

The capacity of the IOUS is the number of bits of data that can be stored at its maximum length. The ceiling plus the capacity is the maximum length of the IOUS.

Layout

To describe the layout of an IOUS, we need two parameters: a unit type and a ceiling size. To make describing things easier, we will represent the unit type by its total length in bits (unit bits) and its signedness (signed or unsigned).

Although bits may internally be stored differently for units, we will describe the format as if the bits within individual units are stored in order of descending significance, even though this may not be the case. In practice, machine instructions will allow operating on units as if they actually are stored this way, e.g. via bit-shifting, which makes this assumption reasonable.

IOUSes are split into two parts: length bits and data bits, in that order. The total number of bits must add up to a multiple of the number of unit bits, since we will be storing the IOUS as a sequence of units.

The length bits will always start with some number of disabled bits (that number can be zero). If this number of bits is less than the ceiling, it must be followed by an enabled terminating bit to indicate the end of the length. An IOUS is considered invalid if it does not have valid length bits, which can only happen if it has zero units, or is exclusively units filled with disabled bits that add up to less than the ceiling; the latter can occur if the ceiling is larger than the unit bits times the number of units in the IOUS.

The portion of the length bits without the terminating zero is called the length data.

We round up the number of bits in the length bits to a multiple of the unit bits to determine a whole number of length units, which may contain some non-length bits. We call these bits extra data bits since they are not represented in the length data, and merely add to the total number of data bits available. If there are a nonzero number of extra data bits, then the unit containing those bits is called the extra data unit.

Immediately following the length units is some number of data units, equal to the number of bits in the length data. The extra data bits, followed by the data units, constitutes the data bits. If the IOUS does not actually contain enough units to account for the extra units, it is considered invalid.

Conversion

Converting from an IOUS to an IOKS requires excluding the length units and then overwriting the length bits in the extra data unit with either zeroes or a sign extension depending on the unit type. To convert back from an IOKS to an IOUS, however, you will need the space for the length units, so that you have room to write the length bits.

Loading an IOKS into a system register simply requires padding the IOKS out to the necessary length, then converting the order of the units to the native endianness.

Creating an IOKS requires determining the number of units required, then converting the native endianness back to big-endian units. Once you have an IOKS, you can then mask in the length bits to create an IOUS.

As mentioned in the summary, one benefit of IOUSes is the ability to write fast and simple code for specific cases, and as such, you may not want to use the fully generic version in everyday cases. However, the full pseudocode for the generic conversions is left below for reference.

IOUS to IOKS

  1. If the buffer is empty, terminate with error.
  2. Set data-units to zero.
  3. Set length-units to zero.
  4. Loop the following:
    1. Set temp-unit to the unit at index length-units.
    2. Set temp-bits to the number of leading disabled bits in temp-unit.
    3. Add temp-bits to data-units.
    4. Increment length-units.
    5. If data-units is greater than the ceiling:
      1. Set data-units to the ceiling.
      2. Set temp-bits to the ceiling modulo the unit bits.
      3. If temp-bits is zero, set extra-data-bits to zero. Otherwise, Set extra-data-bits to the unit bits minus temp-bits.
      4. Terminate the loop.
    6. If temp-bits is less than the unit bits:
      1. Set extra-data-bits to the unit bits minus temp-bits minus one. The minus one is the terminating bit.
      2. Terminate the loop.
    7. If length-units is not less than the length of the buffer, terminate with error.
  5. Set past-end to length-units plus data-units.
  6. If the length of the buffer is less than past-end, terminate with error.
  7. If extra-data-bits is nonzero:
    1. Decrement length-units.
    2. Set temp-bits to the unit bits.
    3. Subtract extra-data-bits from temp-bits.
    4. Set temp-unit to the unit at index length-units.
    5. Left-shift temp-unit by temp-bits.
    6. Right-shift temp-unit by temp-bits. This will be a logical shift for unsigned units and an arithmetic shift for signed units.
  8. The IOKS begins at index length-units and terminates at index past-end.
  9. The IOUS terminates immediately after the IOKS.

IOKS to IOUS

Preparation

When converting an IOKS to an IOUS, you'll want to ensure that the IOKS is as small as possible.

For unsigned integers, this means that you'll want to remove all leading zero units and mark down the number of leading disabled bits as the "trimmable" bits.

For positive signed integers, you can perform the same conversion as unsigned integers, with one caveat: you can trim one fewer bits than you would otherwise, as this has to be retained as the sign bit.

For negative signed integers, you'll want to remove leading negative-one units and mark down the number of leading enabled bits, also being able to trim one fewer bits.

Conversion

  1. Set trimmable-bits to the number of trimmable bits. This quantity is described in the previous section.
  2. Set length-bits to the number of units in the IOKS.
  3. Set length-units to length-bits divided by the unit bits, rounding down.
  4. Set extra-data-bits to negative length-bits Euclidean-modulo the unit bits.
  5. If extra-data-bits is greater than the trimmable bits:
    1. Set data-unit to the first unit in the IOKS.
    2. Right-shift temp-unit by the unit bits minus one. This will be a logical shift for unsigned units and an arithmetic shift for signed units.
    3. Insert data-unit at the beginning of the IOKS.
  6. Set length-unit to negative one.
  7. Left-shift length-unit by extra-data-bits minus one.
  8. Set data-unit to the first unit in the IOKS.
  9. If data-unit is negative:
    1. And the bits of data-unit with length-unit.
  10. Left-shift length-unit by one.
  11. Or the bits of data-unit with length-unit.
  12. Replace the first unit of the IOKS with data-unit.
  13. Insert length-units negative-one units at the beginning of the IOKS.

License

Released to the public domain to the greatest extent possible via the CC0 License.