Some SIP forwarders and clients are very limited and/or broken to only allow ASCII 0-9 in the SIP URI username. So, it is useful to have an encoding for all text into only ASCII 0-9 in as few resulting characters as possible in the average case.

Full ASCII goes to 127, so it takes three digits to encode each point. However if we subtract 31 to remove the nonprinting space at the front of ASCII then we get all points that fit within two digits.

The encoding:

bytes = text encoded in utf8

if all (byte - 30 < 99) then
    concat of each byte { base 10 encode of (byte - 30), two digits each }
else
    "99" + concat of each byte { base 10 encode, three digits each }
end

(ossguy) The "99" above was "00" in the original draft. I feel that "99" is better for a few reasons. First, it is unlikely to be mistaken for null in a naive decoder implementation (which would be bad for the first byte as it would result in an empty string). Second, it is easier to trivially parse (including the UTF-8 fallback). For example, instead of having to explicitly check the first pair, you can start decoding assuming that it's all ASCII and then stop at the first pair >"98" - if that happens to be the first byte, then proceed using the UTF-8 fallback. If it's not the first byte, then the input is invalid (at least in the current encoding; we could later add an option to start with ASCII and switch to UTF-8 later by putting a "99" in the middle). Third, "99" is obviously not a valid pair (>"97", 97 being the last valid converted ASCII value) AND not a valid start of a triple (since the base 10 UTF-8 triple must start with 0, 1, or 2)

(ossguy) I generally like the above, and particularly the use of "30" - in full ASCII (127 characters) it leaves a couple reserved numbers at the end ("98" and "99") plus a couple at the start as well, before the printable characters ("00" and "01").

TextEncodingBase10 (last edited 2020-250 23:57:34 by d50-92-76-47)