Time series storage format.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
ltdk b5a7e0bcf3 Initial commit 2 years ago
src Initial commit 2 years ago
.gitignore Initial commit 2 years ago
CHANGELOG.md Initial commit 2 years ago
Cargo.lock Initial commit 2 years ago
Cargo.toml Initial commit 2 years ago
LICENSE.md Initial commit 2 years ago
Makefile Initial commit 2 years ago
README.md Initial commit 2 years ago

README.md

stonks

Time series storage format.

License

Available via the Anti-Capitalist Software License for individuals, non-profit organisations, and worker-owned businesses.

Planning

The general idea is to replace RRD with something less horrendous that can be applied to different situations.

Protocol

Stonks is actually split into two protocols: the standard binary protocol, and the more readable string protocol, with the ability to losslessly convert between them. The protocol always starts with a 🕴📈↵ message, which is 1f574fe0 f1f4c80a for the binary version and STONKME\n for the string version.

Selecting a database

By default, all stonks data is relative to a stonks database. The first message in a stonks protocol selects the database. In binary, the database is announced via a length, followed by the equivalent number of bytes for an identifier. If an application only acts on a single database, this must be a single zero byte, indicating an empty identifier. The length can be sent as a single byte for values between 0 and 127, but if a longer length is needed, the highest bit of the first byte indicates that two bytes are needed for the length, resulting in a maximum size of around 32k.

For the string version, the identifier is given directly as a string, followed by a new line. New lines can be escaped in the identifier by preceeding them with a \, and slashes can be escaped with a \\. A single slash not preceeding a new line is invalid. To aid readability of the messages, a prefixed DB\s* will be ignored by the server, and the string NODB will be interpreted as an empty identifier.

After a database selection message is sent, if the database is valid, stonks will send 🆗↵ (1f197a for binary, OK\n for string) to indicate that the database is selected. If any message is deemed invalid before this point is reached, it will send 🆖↵ (1f196a for binary, NG\n for string) and close the connection. If the server wishes to defer database validation to after authentication, it can send 🆔↵ (1f194a for binary, ID\n for string) and then provide the validation message after authentication success.

Authenticating a database

Authentication is agnostic to the protocol, but will always include two message exchanges to allow usage of PAKE protocols. These work identically to the means of sending the database name, where a server that does not require authentication can simply expect an empty identifier, send an empty identifier, expect an empty identifier, send an empty identifier, and then continue as usual.

Again, to aid readability for the string version, you can start the first auth message with USER\s* and the second with PASS\s*, also allowing for NOUSER and NOPASS as empty identifiers. The server responses may also start with ID\s* and SESS\s*, allowing for NOID and NOSESS to indicate empty identifiers.

Instructions

After authentication success, the stonks server will accept a series of instructions. The basic flow for these instructions is:

  1. Client sends instruction code
  2. Server sends version code
  3. Client sends intent to proceed, or intent to terminate
  4. If the client wishes to proceed, they send the full request data
  5. If the client sent the full request data, server sends output of instruction

In binary mode, instruction codes are three bytes indicating the type of instruction, followed by a zero byte. All bytes of the instruction must be nonzero and the last four bits of the instruction must be hexadecimal a, or 1010. Additional zero bytes sent not immediately after a valid instruction code will reset any existing instruction data and may be sent to keep the connection alive. The server must respond to any sequence of zero bytes not preceeded by an instruction code with an equal number of zero bytes to indicate that it is still present.

In string mode, instruction codes are strings of capital ASCII letters and/or numbers indicating the type of instruction, followed by a newline. Empty instruction codes may be sent to keep the connection alive without running any instructions, and the special instruction BOOP will also be interpreted as a keep-alive message. The server should respond BOP to any empty instructions to indicate that it is still present.

Once the server receives a full instruction code, it should respond with the instruction code followed by a version code. The version code takes the same format as a database name. An empty version code indicates that the server refuses to run a given instruction, either because it does not recognise it or because the authenticated user is not allowed to run it. The meaning of the version code is intentionally dependent on the instruction to allow for extensibility, although versions should not include any authentication errors if the user is not permitted to run a recognised instruction; this should be accomplished by separate instructions which explain user permissions.

In binary mode, the instruction code and version code are sent without any data between them. In string mode, the server may prefix the instruction code with ACK\s* and must follow it with a newline, and may prefix a nonempty version code with VER\s* or provide REFUSE for an empty version code, terminating both with a newline.

After the version is received, the client should respond 🆗↵ to indicate it wishes to proceed with the request or 🆖↵ if it does not. If it wishes to proceed, it then sends the instruction data, which should be given with the usual identifier format. The server then gives its response as an identifier. The client does not have to respond with intent if the server provides an empty version code, but it may still do so to be polite.

Terminating connections

The client and server may terminate connections instead of sending new instructions, although either party can also send a 🍃↵ (1f343a in binary, BYE\n for string) instruction to signal intent to close the connection before doing so. In general, a good server implementation will try to inform the user that it is shutting down.

Data model

A stonks database is separated into two parts: metrics and labels. Generally speaking, metrics are time series data stored as integers, and labels are strings that map to integers that may be used by metrics.

Labels

Stonks labels are used to give understandable names to otherwise nameless integer IDs. The labels for a stonks database are conceptually stored as a list of lists of strings; the list of lists is the label database, each list within the label database is a label index, and each string within that index is a label. The position of each index within the database is known as the index ID, and the position of a string within an index is known as the label ID. Index IDs are guaranteed to remain unchanged, although label IDs may change over time to allow for storage optimisations. In general, label IDs are guaranteed to remain unchanged between instructions for a given connection, and if a server wishes to change label IDs it must first close any existing connections. Each index will also have a small amount of associated metadata that will be explained later.

Generally, label storage is relatively opaque to the user, and most of the time, users will be able to send and receive strings directly and have them converted to and from label IDs or index IDs automatically. The main reason for keeping labels separate is to allow for reuse of various integer operations (e.g. find the most common value in a dataset) on string data.

The first two label indices have special meaning in stonks -- the first index is the label index and the second index is the metric index. These two indices will always exist for any stonks database and they cannot be deleted.

Index ID zero is reserved to represent an empty index.

Index ID one, the label index, is used to label indices beyond the two special indices. Each label will correspond to exactly one index, and operations to list all indices or search through them can be done on this index. The label index is also special because it does not contain the first two label IDs, as these correspond to the unnamed label and metric indices.

Index ID two, the metric index, is used to label metrics. Each label will correspond to exactly one metric, and these should be relatively transparent to the user. Operations to list all metrics or search through metrics can be done on this index.

Labels take the form of identifiers, allowing for up to 32k of data for each label -- although you shouldn't use that much. Each index, including the special ones, allows you to constrain the kind of data within them using an expression -- while you can request this expression, the exact value of it may change between versions to allow for extension. The exact constraint on the data should not change between versions, but the way the constraint is represented may be.

The number of indices and labels within each index is virtually unlimited, as they are represented using up to 128 bits.

Metrics

Metrics are the important part of stonks, and they represent time series data with up to nanosecond granularity. All timestamps are represented using TAI, ensuring that leap seconds can be properly represented without breaks in continuity.

Conceptually, metrics map ranges of time to metric data, which takes the form of one or more integers. Each metric has some metadata associated with it to allow for automatic aggregation of data using a number of different functions. Because metrics always represent integers, non-integer quantities like rational numbers, floating-point numbers, or fixed-point numbers must be represented using multiple integers, and strings must be stored as label IDs.

In general, metrics are always optimised to store data in non-overlapping ranges, meaning that aggregation must be specified to allow dealing with overlapping ranges. Aggregation does not necessarily require that each range stores exactly one data point, however; for example, data which includes labels may wish to keep a record of all unique labels that occur within a given range of time. The main requirement for data aggregation is that aggregated data is always of the same type as non-aggregated data, meaning that any additional values that are required for aggregation should be present in the original data.

The configuration for metrics has three parts:

  1. The fields of the metric and their types
  2. An aggregation function
  3. When to aggregate

The first part is relatively straightforward; each metric stores one or more integers and you must specify how many integers you wish to store upon metric creation. If you wish to change this number, you must create a new metric and migrate data to the new metric. Each integer should also be flagged with a given type, namely:

  1. Fixed (value divided by a given constant)
  2. Rational (pair of values representing numerator and denominator)
  3. Float (pair of values representing coefficient and power of a given constant)
  4. Label (represented by its ID or zero to indicate missing)

Each field should also be marked as allowing all integers (-/0/+), non-negative integers (0/+), or positive integers (+). Labels are restricted to only non-negative or positive integers (where zero may indicate a missing label), and the denominator of a rational number is always a positive integer. Fields may also be marked as for-aggregation-only by providing them with a value that will always be used when adding new data points, which may be based on the values of the fields.

Aggregation takes the form of two parts:

  1. How data should be split if it partially overlaps a range
  2. How multiple data points in the same range can be aggregated into a single point

Splitting of data will always occur before aggregation. The resulting data point for the two halves of a split (chronologically before and after split) and the aggregation of multiple data points will be represented using an expression, whose exact format will be detailed later.

When to aggregate will be represented as a series of range sizes which must be even multiples of each other -- for example, you can't aggregate by both weeks and months, because a week may overlap with two months. When each range is aggregated may be determined by age (e.g. aggregate by day if older than a month) or by total size (e.g. aggregate oldest data by day once we exceed 2GiB of data). The smallest range size will always aggregate and split data to match that range size -- e.g., if you provide data from 23:59 to 00:01 and your smallest range is a day, then the data will be split and aggregated into both overlapping days.