Request for Comments: 4464 M. West
Category: Informational Siemens/Roke Manor Research
May 2006
Signaling Compression (SigComp) Users’ Guide
Status of This Memo
This memo provides information for the Internet community. It does
not specify an Internet standard of any kind. Distribution of this
memo is unlimited.
Copyright Notice
Copyright (C) The Internet Society (2006).
Abstract
This document provides an informational guide for users of the
Signaling Compression (SigComp) protocol. The aim of the document is
to assist users when making SigComp implementation decisions, for
example, the choice of compression algorithm and the level of
robustness against lost or misordered packets.
Table of Contents
1. Introduction ....................................................3
2. Overview of the User Guide ......................................3
3. UDVM Assembly Language ..........................................4
3.1. Lexical Level ..............................................4
3.2. Syntactic Level ............................................5
3.2.1. Expressions .........................................7
3.2.2. Instructions ........................................8
3.2.3. Directives ..........................................9
3.2.4. Labels .............................................10
3.3. Uploading the Bytecode to the UDVM ........................11
4. Compression Algorithms .........................................12
4.1. Well-known Compression Algorithms .........................12
4.1.1. LZ77 ...............................................12
4.1.2. LZSS ...............................................16
4.1.3. LZW ................................................18
4.1.4. DEFLATE ............................................21
4.1.5. LZJH ...............................................24
4.2. Adapted Algorithms ........................................28
4.2.1. Modified DEFLATE ...................................28
5. Additional SigComp Mechanisms ..................................31
5.1. Acknowledging a State Item ................................32
5.2. Static Dictionary .........................................33
5.3. CRC Checksum ..............................................34
5.4. Announcing Additional Resources ...........................35
5.5. Shared Compression ........................................37
6. Security Considerations ........................................38
7. Acknowledgements ...............................................38
8. Intellectual Property Right Considerations .....................38
9. Informative References .........................................38
Appendix A. UDVM Bytecode for the Compression Algorithms ..........40
A.1. Well-known Algorithms .....................................40
A.1.1. LZ77 ..............................................40
A.1.2. LZSS ..............................................40
A.1.3. LZW ...............................................40
A.1.4. DEFLATE ...........................................40
A.1.5. LZJH ..............................................41
A.2. Adapted Algorithms ........................................41
A.2.1. Modified DEFLATE ...................................41
1. Introduction
This document provides an informational guide for users of the
SigComp protocol, RFC 3320 [2]. The idea behind SigComp is to
standardize a Universal Decompressor Virtual Machine (UDVM) that can
be programmed to understand the output of many well-known compressors
including DEFLATE [8] and LZW [7]. The bytecode for the chosen
compression algorithm is uploaded to the UDVM as part of the
compressed data.
The basic SigComp RFC describes the actions that an endpoint must
take upon receiving a SigComp message. However, the entity
responsible for generating new SigComp messages (the SigComp
compressor) is left as an implementation decision; any compressor can
be used provided that it generates SigComp messages that can be
successfully decompressed by the receiving endpoint.
This document gives examples of a number of different compressors
that can be used by the SigComp protocol. It also gives examples of
how to use some of the mechanisms (such as acknowledgements)
described in RFC 3321 [3].
2. Overview of the User Guide
When implementing a SigComp compressor, the first step is to choose a
compression algorithm that can encode the application messages into a
(hopefully) smaller form. Since SigComp can upload bytecode for new
algorithms to the receiving endpoint, arbitrary compression
algorithms can be supported provided that suitable bytecode has been
written for the corresponding decompressor.
This document provides example bytecode for the following algorithms:
1. LZ77
2. LZSS
3. LZW
4. DEFLATE
5. LZJH
Any of the above algorithms may be useful depending on the desired
compression ratio, processing and memory requirements, code size,
implementation complexity, and Intellectual Property (IPR)
considerations.
As well as encoding the application messages using the chosen
algorithm, the SigComp compressor is responsible for ensuring that
messages can be correctly decompressed even if packets are lost or
misordered during transmission. The SigComp feedback mechanism can
be used to acknowledge successful decompression at the remote
endpoint.
The following robustness techniques and other mechanisms specific to
the SigComp environment are covered in this document:
1. Acknowledgements using the SigComp feedback mechanism
2. Static dictionary
3. Cyclic redundancy code (CRC) checksum
4. Announcing additional resources
5. Shared compression
Any or all of the above mechanisms can be implemented in conjunction
with the chosen compression algorithm. An example subroutine of UDVM
bytecode is provided for each of the mechanisms; these subroutines
can be added to the bytecode for one of the basic compression
algorithms. (Note: The subroutine or the basic algorithm may require
minor modification to ensure they work together correctly.)
3. UDVM Assembly Language
Writing UDVM programs directly in bytecode would be a daunting task,
so a simple assembly language is provided to facilitate the creation
of new decompression algorithms. The assembly language includes
mnemonic codes for each of the UDVM instructions, as well as simple
directives for evaluating integer expressions, padding the bytecode,
and so forth.
The syntax of the UDVM assembly language uses the customary two-level
description technique, partitioning the grammar into a lexical and a
syntactic level.
3.1. Lexical Level
On a lexical level, a string of assembly consists of zero or more
tokens optionally separated by whitespace. Each token can be a text
name, an instruction opcode, a delimiter, or an integer (specified as
decimal, binary, or hex).
The following ABNF description, RFC 4234 [1], specifies the syntax of
a token:
token = (name / opcode / delimiter / dec / bin / hex)
name = (lowercase / "_") 0*(lowercase / digit / "_")
opcode = uppercase *(uppercase / digit / "-")
delimiter = "." / "," / "!" / "$" / ":" / "(" / ")" /
operator
dec = 1*(digit)
bin = "0b" 1*("0" / "1")
hex = "0x" 1*(hex-digit)
hex-digit = digit / %x41-46 / %x61-66
digit = %x30-39
uppercase = %x41-5a
lowercase = %x61-7a
operator = "+" / "-" / "*" / "/" / "%" / "&" / "|" /
"^" / "~" / "<<" / ">>"
When parsing for tokens, the longest match is applied, i.e., a token
is the longest string that matches the <token> rule specified above.
The syntax of whitespace and comments is specified by the following
ABNF:
ws = *(%x09 / %x0a / %x0d / %x20 / comment)
comment = ";" *(%x00-09 / %x0b-0c / %x0e-ff)
(%x0a / %x0d)
Whitespace that matches <ws> is skipped between tokens, but serves to
terminate the longest match for a token.
Comments are specified by the symbol ";" and are terminated by the
end of the line, for example:
LOAD (temp, 1) ; This is a comment.
Any other input is a syntax error.
When parsing on the lexical level, the string of assembly should be
divided up into a list of successive tokens. The whitespace and
comments should also be deleted. The assembly should then be parsed
on the syntactic level as explained in Section 3.2.
3.2. Syntactic Level
Once the string of assembly has been divided into tokens as per
Section 3.1, the next step is to convert the assembly into a string
of UDVM bytecode.
On a syntactic level, a string of assembly consists of zero or more
instructions, directives, or labels, each of which is itself built up
from one or more lexical tokens.
The following ABNF description specifies the syntax of the assembly
language. Note that the lexical parsing step is assumed to have been
carried out; so in particular, the boundaries between tokens are
already known, and the comments and whitespace have been deleted:
assembly = *(instruction / directive / label)
instruction = opcode ["(" operand *("," operand) ")"]
operand = [["$"] expression]
; Operands can be left blank if they can
; be automatically inferred by the
; compiler, e.g., a literal operand
; that specifies the total number of
; operands for the instruction.
; When "$" is prepended to an operand,
; the corresponding integer is an
; address rather than the actual operand
; value. This symbol is mandatory for
; reference operands, optional for
; multitypes and addresses, and
; disallowed for literals.
label = ":" name
directive = padding / data / set / readonly /
unknown-directive
unknown-directive = name ["(" expression *("," expression) ")"]
; The parser can ignore unknown
; directives. The resulting bytecode
; may or may not generate the expected
; results.
padding = ("pad" / "align" / "at") "(" expression ")"
data = ("byte" / "word") "(" expression *(","
expression) ")"
readonly = "readonly" "(" "0" / "1" ")"
set = "set" "(" name "," expression ")"
expression = value / "(" expression operator expression ")"
value = dec / bin / hex / name / "." / "!"
; "." is the location of this
; instruction/directive, whereas "!" is
; the location of the closest
; DECOMPRESSION-FAILURE
The following sections define how to convert the instructions, labels
and directives into UDVM bytecode:
3.2.1. Expressions
The operand values needed by particular instructions or directives
can be given in the form of expressions. An expression can include
one or more values specified as decimal, binary, or hex (binary
values are preceded by "0b" and hex values are preceded by "0x").
The expression may also include one or more of the following
operators:
"+" Addition
"-" Subtraction
"*" Multiplication
"/" Integer division
"%" Modulo arithmetic (a%b := a modulo b)
"&" Binary AND
"|" Binary OR
"^" Binary XOR
"~" Binary XNOR
"<<" Binary LSHIFT
">>" Binary RSHIFT
The operands for each operator must always be surrounded by
parentheses so that the order in which the operators should be
evaluated is clear. For example:
((1 + (2 * 3)) & (0xabcd - 0b00101010)) gives the result 3.
Expressions can also include the special values "." and "!". When
the symbol "." is encountered, it is replaced by the location in the
bytecode of the current instruction/directive. When the symbol "!"
is encountered it is replaced by the location in the bytecode of the
closest DECOMPRESSION-FAILURE instruction (i.e., the closest zero
byte). This can be useful when writing UDVM instructions that call a
decompression failure, for example:
INPUT-BYTES (1, temp, !)
The above instruction causes a decompression failure to occur if it
tries to input data from beyond the end of the compressed message.
Note: When using "!" in the assembly language to generate bytecode,
care must be taken to ensure that the address of the zero used at
bytecode generation time will still contain zero when the bytecode is
run. The readonly directive (see Section 3.2.3) can be used to do
this.
It is also possible to assign integer values to text names: when a
text name is encountered in an expression, it is replaced by the
integer value assigned to it. Section 3.2.3 explains how to assign
integer values to text names.
3.2.2. Instructions
A UDVM instruction is specified by the instruction opcode followed by
zero or more operands. The instruction operands are enclosed in
parentheses and separated by commas, for example:
ADD ($3, 4)
When generating the bytecode, the parser should replace the
instruction opcode with the corresponding 1-byte value as per Figure
11 of SigComp [2].
Each operand consists of an expression that evaluates to an integer,
optionally preceded by the symbol "$". This symbol indicates that
the supplied integer value must be interpreted as the memory address
at which the operand value can be found, rather than the actual
operand value itself.
When converting each instruction operand to bytecode, the parser
first determines whether the instruction expects the operand to be a
literal, a reference, a multitype, or an address. If the operand is
a literal, then, as per Figure 8 of SigComp, the parser inserts
bytecode (usually the shortest) capable of encoding the supplied
operand value.
Since literal operands are used to indicate the total number of
operands for an instruction, it is possible to leave a literal
operand blank and allow its value to be inferred automatically by the
assembler. For example:
MULTILOAD (64, , 1, 2, 3, 4)
The missing operand should be given the value 4 because it is
followed by a total of 4 operands.
If the operand is a reference, then, as per Figure 9 of SigComp, the
parser inserts bytecode (usually the shortest) capable of encoding
the supplied memory address. Note that reference operands will
always be preceded by the symbol "$" in assembly because they always
encode memory addresses rather than actual operand values.
If the operand is a multitype, then the parser first checks whether
the symbol "$" is present. If so, then, as per Figure 10 of SigComp,
it inserts bytecode (usually the shortest) capable of encoding the
supplied integer as a memory address. If not, then, as per Figure 10
of SigComp, it inserts bytecode (usually the shortest) that encodes
the supplied integer as an operand value.
If the operand is an address, then the parser checks whether the
symbol "$" is present. If so, then the supplied integer is encoded
as a memory address, just as for the multitype instruction above. If
not, then the byte position of the opcode is subtracted from the
supplied integer modulo 16, and the result is encoded as an operand
value as per Figure 10 of SigComp.
The length of the resulting bytecode is dependent on the parser in
use. There can be several correct and usable representations of the
same instruction.
3.2.3. Directives
The assembly language provides a number of directives for evaluating
expressions, moving instructions to a particular memory address, etc.
The directives "pad", "align", and "at" can be used to add padding to
the bytecode.
The directive "pad (n)" appends n consecutive padding bytes to the
bytecode. The actual value of the padding bytes is unimportant, so
when the bytecode is uploaded to the UDVM, the padding bytes can be
set to the initial values contained in the UDVM memory (this helps to
reduce the size of a SigComp message).
The directive "align (n)" appends the minimum number of padding bytes
to the bytecode such that the total number of bytes of bytecode
generated so far is a multiple of n bytes. If the bytecode is
already aligned to a multiple of n bytes, then no padding bytes are
added.
The directive "at (n)" appends enough padding bytes to the bytecode
such that the total number of bytes of bytecode generated so far is
exactly n bytes. If more than n bytes have already been generated
before the "at" directive is encountered then the assembly code
contains an error.
The directives "byte" and "word" can be used to add specific data
strings to the bytecode.
The directive "byte (n[0],..., n[k-1])" appends k consecutive bytes
to the bytecode. The byte string is supplied as expressions that
evaluate to give integers n[0],..., n[k-1] from 0 to 255.
The directive "word (n[0],..., n[k-1])" appends k consecutive 2-byte
words to the bytecode. The word string is supplied as expressions
that evaluate to give integers n[0],..., n[k-1] from 0 to 65535.
The directive "set (name, n)" assigns an integer value n to a
specified text name. The integer value can be supplied in the form
of an expression.
The directive "readonly (n)" where n is 0 or 1 can be used to
indicate that an area of memory could be changed (0) or will not be
changed (1) during the execution of the UDVM. This directive could
be used, for example, in conjunction with "!" to ensure that the
address of the zero used will still contain zero when the bytecode is
executed. If no readonly directive is used, then any address
containing zero can be used by "!" (i.e., by default, there is
assumed to be a readonly (1) directive at Address 0) and it is up to
the author of the assembly code to ensure that the address in
question will still contain zero when the bytecode is executed. If
the readonly directive is used, then bytes between a readonly (0) and
readonly (1) pair are NOT to be used by "!". When a readonly
directive has been used, the bytes obey that directive from that
address to either another readonly directive or the end of UDVM
memory, whichever comes first.
3.2.4. Labels
A label is a special directive used to assign memory addresses to
text names.
Labels are specified by a single colon followed by the text name to
be defined. The (absolute) position of the byte immediately
following the label is evaluated and assigned to the text name. For
example:
:start
LOAD (temp, 1)
Since the label "start" occurs at the beginning of the bytecode, it
is assigned the integer value 0.
Note that writing the label ":name" has exactly the same behavior as
writing the directive "set (name, .)".
3.3. Uploading the Bytecode to the UDVM
Once the parser has converted a string of assembly into the
corresponding bytecode, it must be copied to the UDVM memory
beginning at Address 0 and then executed, beginning from the first
UDVM instruction in the bytecode.
SigComp provides the following message format for uploading bytecode
to the UDVM:
0 1 2 3 4 5 6 7
+---+---+---+---+---+---+---+---+
| 1 1 1 1 1 | T | 0 |
+---+---+---+---+---+---+---+---+
| |
: returned feedback item : if T = 1
| |
+---+---+---+---+---+---+---+---+
| code_len |
+---+---+---+---+---+---+---+---+
| code_len | destination |
+---+---+---+---+---+---+---+---+
| |
: uploaded UDVM bytecode :
| |
+---+---+---+---+---+---+---+---+
| |
: remaining SigComp message :
| |
+---+---+---+---+---+---+---+---+
The destination field should be set to the memory address of the
first UDVM instruction. Note that if this address cannot be
represented by the destination field, then the bytecode cannot be
uploaded to the UDVM using the standard SigComp header. In
particular, the memory address of the first UDVM instruction must
always be a multiple of 64 bytes or the standard SigComp header
cannot be used. Of course, there may be other ways to upload the
bytecode to the UDVM, such as retrieving the bytecode directly via
