An introduction to decoding

Frank Pilhofer's hopefully humorous introduction to how mail and news came to be encoded, along with some hints on how to make the best of the bad situation we always seem to find ourselves in.

Sections:

 

Bits of History

This text is fiction. There's some truth in it, but its main purpose is to entertainingly show how problems arise. It's also a little technical. And it's also not essential. If you don't want to know, skip the section. The other sections explain the different methods of encoding and give some guidelines for encoding files. If you are only interested in decoding, read on, too. The understanding of encoding will also help you to understand the decoding process.

Once upon a time, in a not so far away land, hardware engineers thought of a basic calculation unit for their computers, packaged 8 bits together and called it a byte. It would store 256 different values that could be used for machine instructions. Well, this definition was broken from time to time with 7 or 9 bit machines, but the 8 bits were eventually accepted as a good choice. Note that at first, nobody thought of storing text strings in computer memory, which was far too precious to be filled with stupid messages like The Result is or Hello World. People were content with receiving a punch card with an equivalent of 3.14157354 coded into it after letting the machine calculate for 5 hours.

It was years later that memory was getting so cheap that some computers were equipped with a kilobyte or more, and results were printed on a small computer screen, when finally executives decided that a machine that cost millions to build should be able to express itself in a more human-readable form. It really quite probably was the executives that made this decision. Computer scientists didn't need such a feature; they could all read the punch cards and were busy fixing the fifth decimal of their pi approximation. Then the executives fixed pi to be 4, and relieved the scientists of their quest for more accuracy. (Look at the Guinness Book of Records. It was in a different context, but mentions the definition of pi=4 by some court).

Everything really took off when the first terminals were hooked up to the computer. Just imagine, no more punching the right unlabeled buttons, but now the computer could read commands in more or less plain english. Now the characters must be transferred from the keyboard into the machine, and then from the machine back onto the display. For this transmission the letters and digits were to be encoded. But how? The 256 values of a byte are obviously not useful for this task: there are much less than 256 usable characters. The engineers decided that there was no need to waste a full 8 bits on a single character, so they used only 7 bits with 128 values to encode a letter. And still this is too much, so they invented lots of special characters to tell the machine on one end or the human on the other end what's going on, like characters for end of input or ring bell. You can guess the executives were enthusiastic when the machine accepted their inputs with a gentle but firm ping!

This solution was perfect for a couple of years, yes, even for more than a decade, until the price for computers dropped below a couple of hundred thousand dollars, and some institutions found they had some money left for a second computer. Wow, to have two computers, that was power! Then students with too much time on their hands started to connect the machines together, and wrote software to communicate from the first one to the other one, and send messages over that link. For that, they adopted the established protocol of a machine talking with its terminal. Only they did not use 7 bits for encoding but 8 bits; the new bit was used as parity, as checksum to see if transmission was successful. But still, they used all the special characters that were invented for the terminal. For years, they were happy sending little pieces of mail from here to there, and eventually, after discovering the advantages of a telephone line, across the country.

The trouble started as computer companies managed to sell more than one computer of the same model, and the users of both machines got to know each other. This means that both machines were compatible, that programs written on the one could be copied and run on the other. And it wasn't long that the first user boasted to the second what a neat program he'd written. Then the second one, reading this message by mail thousands of miles away, asked the first user to send the program to him. The user shivered, "but, but ... it's a program! You see, this assembler instruction here is the terminal character of end of transmission, it would tell your mail receiving program to terminate the connection and you wouldn't get the rest of it. I can't possibly send it to you!".

Of course, this solution is inadequate. What's the purpose of linking computers together if you can't share programs, or any other data you like? That's a dilemma. The communication links are standardized, and we do not want to start all over again. We can transfer plain text but not binary files. But, what if we encode binary files into plain text on the one end, and decode it from plain text into the originally binary representation on the other? That'll work, so that's what we'll do!

So programs were written for encoding and decoding, until the next obstacle was hit: some mail transfer programs, that the programmers have just avoided to rewrite, read or wrote mail in fixed-size buffers. If a sender sent more than the recipient's buffer size, the end of the message got lost. Because computer users usually don't write novels, this limit has previously gone unnoticed; but images or movie clips just failed to fit and arrived, if at all, only in pieces. In pieces! That's a splendid idea; if we split up large files, we can then encode and transmit the pieces individually and let the recipient put them together, and then we can transfer everything we want!

Yes, once again the programmers have successfully avoided to go back to the drawing board. They were now free of their problems - because they've loaded them all onto the user, who has now to ask himself, "How do I split the original file up?" "How do I encode the pieces?" "How do I decode incoming pieces?" "How do I put the decoded pieces together?".

 

The Problem

If you didn't read the above text, or didn't catch the point, here's the major problem we have to face:

  • Binary data, that includes software, images, audio, video etc., uses all 256 possible values of a byte.
     
  • Some of these values represent control characters that would have undesired effects upon transmission, or wouldn't transfer correctly.
     
  • So binary data is encoded into a set of characters that can be safely transferred. This encoded data is then decoded into its original form by the recipient.
     
  • Some transfer programs can only transfer messages of limited size, so large files must be split into pieces.
Note that the word mailing, if used, can be replaced by posting. Posting messages to the Usenet news system is similar to mailing because the same methods of communication, with all above problems, are employed. When transferring binary files to and from newsgroups, the files must also be split up, encoded and decoded.

You will find that the UUDeview package will help you through all steps of encoding files and decoding incoming messages. However, this text continues not to be specific about the program. Following are discussions about four different methods of encoding and a few guidelines.

 

Four Methods

Everything becomes a little more complicated because there isn't just one way to accomplish all that. In particular, there are four different, incompatible methods to encode binary data into plain text. Sender and recipient must agree on the same method. The four methods are:

  • uu-encoding. This was historically the first method invented. The encoding was quite simple and caused frequent trouble. In particular, first implementation used the space character for encoding. But some mail gateways stripped spaces at the end of a line, so what the recipient got was invalid. Later, a special case was introduced to avoid the problem.
     
  • xx-encoding. This rarely used method appeared after the initial problems with uuencoding, but before they were fixed. It also avoided using the space character by using a different character set for encoding.
     
  • Base64. This method was introduced by the MIME standard, avoiding some rarely encountered problems with the other two method (which used some characters not available on some machines). Together with other benefits of MIME, this is the most secure method.
     
  • BinHex. A method conceived for transferring files among Macintosh systems. Files on a Macintosh consist of two parts, the "data fork" and the "resource fork". This encoding first adds a third header part, and composes the three parts into a single data stream, which is then slightly compressed and encoded.
Actually, there are a couple of other, less widely used encodings, like ship or btoa. They are not covered here, since they are rarely seen in "real life".

 

Encoding Guidelines

Which method to use?

Of course, the question is, "what method shall I use?" The first rule is to avoid xxencoding, which is obsolete. And while BinHex is a "must" for Macintosh systems, it is usually not a wise choice on other systems, because decoders are not widespread elsewhere. True, BinHex claims to also compress the file and make the transferred data smaller, but don't believe this argument. It does not hold for already-compressed data.

Compression is also one important issue that needs to be mentioned. Never send an uncompressed file, this just wastes valuable bandwidth. GIF and JPEG images are well-compressed themselves, but other types of data should always be compressed into a ZIP file or something similar.

This leaves uuencoding and Base64. This argument should be decided with consideration of your mail software. If it is MIME-compliant and offers to "attach" files encoded in Base64, use this encoding. Base64 is the preferred method for MIME messages.

Otherwise, use uuencoding, which is still the most common encoding method. Because more and more software becomes MIME-compliant, it is expected that uuencoding is completely replaced by Base64, but as long as there is the possibility that the recipient still uses old software, uuencoding is the safest method.

Note that MIME-compliance is something only the mail software can handle. For example if you encode data to Base64 using UUEnview and then include the encoded data in your message, the resulting message will not be MIME-compliant! This fact is important to realize. If your mail software does not allow "attachments" on its own and you have to use an external encoder, always use uuencoding.

To Split or not to Split

Then there's another question that applies similarly to all methods, the question of splitting files. Sending everything in a single mail or posting is easiest, both for the sender and the recipient, because you don't have to fuss around with splitting and putting parts together. But you must make sure that the mails or postings aren't stripped somewhere on its way. This isn't much of a problem with email any more, usually you can send megabytes or more at once. I suggest to try sending the full file at first, and then ask the recipient if (s)he got everything of it. If not, you can experiment to find the maximum size and then stick to it.

But news are a different topic. There are still some gateways around allowing no more than a fixed size. The semi-accepted limit is to send only thousand lines of encoding per post, and to split large files into parts of thousand lines each.

Include Information

When you mail or post a file, people will want to know what they can expect from it before having to decode it. Also, most people will have to download the encoded file before they can even decode it. These people can get quite annoyed if they discover they've downloaded something entirely inappropriate. To avoid annoying people, you should always send a small message what this file is all about. Don't just say, "you must have this!", make the message informative.

For small files with less than thousand lines of encoding, you can include this message with the encoding, but if it's more, you should send a separate mail or post.

Composing a Subject Line

The last problem we have to face is how to build up a subject line for the mail or the posting, so that people can easily spot the file and, in the case of multiple parts, know which postings belong together for the single file. Here's an example:

UUDeview 0.5a for Windows - uudvw05a.zip (001/004)

First on the subject line is a short description of the file, less than 40 characters. Then, separated with a dash is the original filename. Last, enclosed in brackets, is the number of this part, and the total number of parts. In this case, the reader will know that (s)he also has to get parts two to four to decode the file. A subject line like this includes all necessary information.

The informative message from above is usually sent as the zeroeth part. This zeroeth part should only be a textual description and should not include any encoded data. If a part number zero is present, people will read it and only this part to see whether they'll want to decode the rest of the file or not.

BTW, you should also include the part numbering if there's only a single part. This should then read (001/001). Then people will know they don't have to search for more.

 

Summary

To summarize our discussion ...

  • If your mail or news software is MIME-compliant and allows to directly "attach" binary files, use Base64. If you use an external encoder to include encoded data into your messages, use uuencoding.
     
  • There's no real need to split encoded files for email transmission; but restrict yourself to no more than 1000 lines per part when posting into a newsgroup (unless it's a really huge file; don't post more than 100 parts).
     
  • Create a subject line with all necessary information for the reader and the decoder. UUDeview will need a proper subject line to combine the parts.
     
  • Send a part number zero with a description of the file.
Well, I hope you've enjoyed this little introduction, and learned a little from it. In fact, I hope you'll stick to the mentioned guidelines. Me and lots of other people are sick of trying to decode files that were sent without any guidelines in mind.