Nice Clean Example

Encoding and Decoding. Hello Unicode, goodbye ASCIIZ strings

In this article I'm going to talk about String and character processing in the .NET environment, and how the classes Encoding and Decoding seem to be ever-present when working with strings. I'll talk about Unicode characters and the essential role of these two classes.

There are two classes that hang around you like mosquitos whenever you are working with strings in .NET and you want to do I/O.

  1. Encoding class
  2. Decoding class

If you've ever wanted to write a string value directly to a stream and discovered that it needed to be converted to an array of bytes then you've run across this code:

byte[] byteList = Encoding.UTF8.GetBytes("string to write out");

or perhaps

byte[] byteList = Encoding.ASCII.GetBytes("string to write out");

Depending on what programming languages you have used before it may or may not be obvious why there would be a class called "Encoding" that would be necessary to convert a string into an array of bytes. While its handy to have this utility function you'd think it could be just something like this:

byte[] byteList = System.Text.Utility.GetBytes("string to write out");

When I started programming in the 80's I used FORTRAN in Physics, and I will never forget the confusion that surrounded accompained learning about the fact that the letters, numbers and punctuation were represented as numbers internally. Of course, until you start programming it would never occur to you to think that way or not -- the letter 'A' was the letter 'A', it did not need further decomposition or any other representation.

Each character had a numeric value and a hexadecimal representation consisting of two hex digits; 'A' was '41'. There was not a small amount of spinning as we all tried to keep it straight -- 'x41' meant the capital letter 'A', even though '41' consisted of two digits, '4' and '1', each of which had a value, 0x34 and 0x31 in this case, which of course led to more spinning. We learned that there was a name for this mapping of number to character, "ASCII", for the American Standard Code for Information Interchange.

The C programming language taught us a model of simplicity for character and string processing. The 'char' datatype was a signed 8 bit value, and a quoted string consisted of 8 bit byte values encoded in ASCII, with a trailing 0, also referred to ASCIIZ.

char *str = "This is a string"; // Encoded as ASCII with a trailing null

When I worked on IBM's I learned that there was an entirely different way to encode the letter 'A', this time as xC1, in a scheme called EBCDIC. When PC's were introduced and evolved we were flooded with "code pages"; different maps of values that represented different kinds of characters. There were API calls to set and get these code pages, and changing them would alter what glyph displayed when certain values were plunked into screen memory at, I believe, the memory address 0xB800 (Won't swear to that).

<memory_lane>
Remember attribute bytes for color and blinking?
</memory_lane>

I think the most important development in the brief history of character encoding is development of Unicode, which is, yes, another scheme for mapping numbers to letters, but one which established a very lofty goal of cleaning up the mess, best articulated on their web site.

Unicode provides a unique number for every character,
no matter what the platform,
no matter what the program,
no matter what the language.

Zowie! I'm sold. Its no wonder that the .NET platform made it core to string and character processing. If you're running Visual Studio (2005 anyway) and you mouse over the 'string' type in C# and the 'char' type you will see that they are defined as Unicode types.

string hw =  "Hello World";  // Represents text as a series of Unicode characters
char c = 'a';                // Represents a 16 bit Unicode character

One of the reasons that Unicode isn't obvious with most of our work as programmers on .NET is that the Unicode characters in the range 0 to 127 (called "Basic Latin") are the same as ASCII characters in that range. The designers of Unicode were not stupid people ;-)

char c = 'A'; // has the value 0x0041, ASCII is 0x41

Unicode characters are either 2 or 3 bytes in length, but that's not the most important point about Unicode in getting to the bottom of why Encoding and Decoding classes are ever present. Unicode doesn't just define the fact that 'A' has the numerical value 0x0041. The Unicode standard has a say in how the binary values are laid out in memory when you string them together.

There are three ways to represent a Unicode string in memory: an array of bytes (UTF8), an array of shorts (UTF16), and an array of 4 byte words (UTF32).

The fact that there are three ways to store a Unicode string that the standard defines makes Unicode very different from ASCII or EBCDIC or a code page -- in those schemes it takes one or two bytes to hold the value and there is nothing about the standard which dictates how thats represented in memory if you string them together -- the layout is directly inferred from the storage requirement. An array of ASCII chars is necessarily an array of 8 bit bytes. Duh!

This layout issue matters because of its part of the handshaking that two sides of a transfer go through when exchanging a Unicode string. When the sender side ships the string off to the brave world the recieving side has to know if its got an 8, 16 or 32 storage mechanism on its hand. The Unicode standard defines how that layout occurs, even to the level of whether something is little-endian or big-endian.

This brings us to Encoding and Decoding. The process by which a 16 bit Unicode string is delivered unto the world involves Encoding it into one of the three different forms, UTF8, UTF16 or UTF32. The inverse process, going from an array of bytes into Unicode text, is called Decoding

We can see this clearly with this next example. A short string is encoded into all three forms, with the results dumped in hex. You see that the binary data that results is very different, and you can see that if whoever gets this data doesn't know which form its in it won't make any sense.

 string hello = "Hi!";
 byte[] bytelist = Encoding.UTF8.GetBytes(hello);
 // Result is 48 69 21
 
 bytelist = Encoding.Unicode.GetBytes(hello);
 // Result is 48 00 69 00 21 00

 bytelist = Encoding.UTF32.GetBytes(hello);
 // Result is 48 00 00 00 69 00 00 00 21 00 00 00	

Unicode doesn't enforce any sort of "packet", if you are reading Unicode data and don't know how it was encoded you're up the creek. However, there is a byte marker defined called a "preamble" that can be prepended to a data stream that will provide information about the type of encoding, if needed.

So, the purpose of the Encoding and Decoding classes is to implement transfer of Unicode strings to and from the outside world. .NET uses Unicode-16 as its native type for storing characters and strings. In order to go from a string or character type to a file, a socket, a web page, anywhere, the string /character data must be Encoded into a format. There are three Unicode formats, UTF8 to encode to an array of bytes, UTF16 for encoding to an array of shorts, and UTF32 to encode to an array of 4 byte words.

As you dig into this you will also see other encodings, such as ASCII. These are provided by .NET as a way of allowing different kinds of encoding. However, they are not defined by the Unicode standard, they are supplementary.

1 responses