talkat speculative specification

Introduction

talkat ("talk across TLS") is a minimalist protocol for real-time textual bilateral communication, similar to talk(1). It updates talk(1) by using encrypted authenticated connections with traffic-analysis countermeasures, and by being unicode-aware.

Protocol

A talkat server listens on a tcp port, 5518 by default, and accepts TLS 1.3 or later connections. The connecting client provides a client certificate, and uses Server Name Indication (SNI).

The application data in each direction consists entirely of a single stream of UTF-8 characters, interpreted as a handshake character followed by a timed character stream as described below.

If no client certificate is presented, the server MUST reject the connection.

Either party may close the TLS connection at any time.

TLS 1.2 (and earlier) connections MUST be rejected. TLS 1.2 is inappropriate, as it sends client certificates unencrypted.

Authentication

The server and client certificates are intended to identify the individual users involved.

In particular, a server process is intended to belong to a single user. Multiple users on a single host can use different ports. Alternatively, SNI may be used to disambiguate between multiple users, bearing in mind the privacy consequences of the fact that the SNI is sent unencrypted.

Each user is identified by a single public key, which they use as the subject public key of the tail certificates of the server and client certificate chains they provide. For out-of-band confirmation of identity, the SHA256/128 hash (i.e. the first 16 bytes of the SHA256 hash) of the binary DER encoding of the Subject Public Key Info (SPKI) field of such an X509 certificate (as in RFC 7469) should be used. A 128-bit truncated hash is sufficient, since only pre-image resistance rather than collision resistance is required.

The Common Name field of the tail client certificate may be used to indicate the identity of the user it claims to identify.

For communicating hashes and host information, we define the URI scheme "talkat:HASH[@HOST[:PORT]]", where HASH is the hexadecimal encoding (case-insensitive) of the SHA256/128 hash. For futureproofing purposes, other hash algorithms may be specified, so the full format is "talkat:[HASHALG:]HASH[@HOST[:PORT]]", with any truncation to initial bytes of the hash indicated by the length, so HASHALG defaults to "sha256".

Further details of authentication are left up to the user agent, but the intended primary mode of operation is as follows. A single key pair is generated by a user and self-signed certificates generated from it are used both for the server certificate when running a server and for the client certificate when connecting to another server. Users use out-of-band means to verify key hashes. It does not make sense to limit the validity of these certificates, so it is recommended to create it with undefined validity (notAfter value of 99991231235959Z, as per RFC5280). Since including the user's name in a server certificate provided to anyone connecting to the appropriate port could be a privacy/security concern, it is recommended to use a server certificate with empty Common Name, and one or more client certificates each with the Common Name set to an appropriate name for the user, all generated from the same public key.

Handshake

The first character sent in each direction is the 1-byte handshake character 'T'. This is to be sent once the sender is ready for conversation to begin. The TLS client MUST NOT send the handshake character until having received the handshake character from the TLS server.

Timed character streams

Each character stream is parsed as a sequence of lines terminated by \n or \r\n. Each line consists of a sequence of unicode characters, erasures, and pauses, encoded as follows:

'\b' (0x08, ^H) is interpreted as erasing the last character of the current line. If the current line is empty, this is a no-op.
'\NAK' (0x15, ^U) is interpreted as erasing the entire current line.
'\n' is interpreted as terminating the current line and beginning a new empty line.
'~' followed by a 12-bit integer N encoded as a big-endian 2-character base64 sequence (using the character mapping of RFC4648) is interpreted as a pause of N milliseconds, except that the maximum value '~//' is interpreted as a pause of 4095ms or more.
"~~" is interpreted as the character '~'.
Null bytes are ignored; they can be used for padding.
All other unicode characters are interpreted as themselves.

Example

"今~A+日わ!~DI\b\bは~~\n" denotes "今" followed by a 62ms pause, then "日わ!", then a 200ms pause, then the erasure of the last two characters ("わ!") followed by "は~" and a newline. The final resulting string is "今日は~".

Chunking and padding

When transmitting over a network which might be surveilled by an adversary, such as the open internet, appropriate means MUST be used to mitigate the ability to fingerprint a user, and/or gain information on communication contents, through timing information (see RFC6973 for context on surveillance and traffic analysis). In particular, the naive approach of sending a packet immediately after each keystroke of a typing user MUST NOT be used on such a network. Instead, multiple quickly typed keystrokes are to be sent together, with the timing of the keystrokes indicated by the encoded pauses. Furthermore, packets MUST be padded to obscure the length of their contents; this can be done either with TLS record padding or by inserting null bytes in the timed character streams.

As a rough guide (non-normative), the stream could be chunked into 300ms intervals, with each sent (if any character/erasure is typed in the interval) padded to have length divisible by 24, with no chunk ending with a pause. Note that regular chunking is necessary even if the Nagle algorithm is in effect (i.e. if TCP_NODELAY is not set), to ensure that typing speed is not revealed by packet sizes.

Recommendations for display

This subsection is not normative.

A client attempting to display in real-time an incoming stream which includes pauses has to deal with the problem that network delays may exceed the pauses specified in the stream. There are two conflicting goals:

1. Minimise the delay between data being received and being displayed.
2. Faithfully reproduce the pauses specified in the stream.

Goal 1 can be achieved at the expense of goal 2 by immediately beginning display of the data in a packet when we receive it, compressing to zero any pauses in the part of the stream we were in the process of rendering. Alternatively, we can get as close as possible to achieving goal 2, at the expense of goal 1, by always rendering all pauses as specified, except when pauses are unavoidably lengthened or introduced due to a network delay in receiving the next part of the stream.

The latter strategy is likely to be unacceptable in context of real-time communication, since it means that the delay between a packet being sent and being displayed is the maximum of the network delays for all packets sent so far. The former strategy might be acceptable, but an implementation may prefer a compromise strategy which handles small variations in network delay more smoothly.

Other real-time text protocols

RFC4103: lossy protocol using RTP over UDP, for use with SIP.
XEP-0301: XMPP extension. Doesn't seem to be used.