In this article I'm going to talk about String and character processing in
the .NET environment, and how the classes Encoding and Decoding seem to be ever-present when working with strings. I'll talk about Unicode characters and the essential role of these two classes.
There are two classes that hang around you
like mosquitos whenever you are working with strings in .NET and you want to
do I/O.
- Encoding class
- Decoding class
If you've ever wanted to write a string value directly to a stream and
discovered that it needed to be
converted to an array of bytes then you've run across this code:
byte[] byteList = Encoding.UTF8.GetBytes("string to write out");
or perhaps
byte[] byteList = Encoding.ASCII.GetBytes("string to write out");
Depending on what programming languages you have used before it may or may not be obvious why
there would be a class called "Encoding" that would be necessary to
convert a string into an array of bytes. While its handy to have this utility function
you'd think it could be just something like this:
byte[] byteList = System.Text.Utility.GetBytes("string to write out");
When I started programming in the 80's I used FORTRAN in Physics, and I will
never forget the confusion that
surrounded accompained learning about the fact that the letters, numbers and
punctuation were represented as numbers internally. Of course, until you
start programming it would never occur to you to think that way or
not -- the letter 'A' was
the letter 'A', it did not need further decomposition or any other representation.
Each character had a numeric value and
a hexadecimal representation consisting of two hex digits; 'A' was '41'. There was
not a small amount of spinning as we all tried to keep it straight -- 'x41' meant
the capital letter 'A', even though '41' consisted of two digits, '4' and '1',
each of which had a value, 0x34 and 0x31 in this case, which of course led to more spinning.
We learned that there was a name for this mapping of number to character, "ASCII", for the American Standard Code for Information Interchange.
The C programming language taught us a model of simplicity
for character and string processing. The 'char' datatype
was a signed 8 bit value, and a quoted string consisted of 8 bit byte
values encoded in ASCII, with a trailing 0, also referred to ASCIIZ.
char *str = "This is a string"; // Encoded as ASCII with a trailing null
When I worked on IBM's I learned that there was an entirely
different way to encode the letter 'A', this time as xC1, in a scheme called EBCDIC. When
PC's were introduced and evolved we were flooded with "code pages"; different maps of values that
represented different kinds of characters. There were API calls to set and get these
code pages, and changing them would alter what glyph displayed when certain values were plunked
into screen memory at, I believe, the memory address 0xB800 (Won't swear to that).
<memory_lane>
Remember attribute bytes for color and blinking?
</memory_lane>
I think the most important development in the brief history of character encoding
is development of Unicode, which is, yes,
another scheme for mapping numbers to letters, but one which established a
very lofty goal of cleaning
up the mess, best articulated on their web site.
Unicode provides a unique number for every character,
no matter what the platform,
no matter what the program,
no matter what the language.
Zowie! I'm sold. Its no wonder that the .NET platform made it core to string and character processing.
If you're running Visual Studio (2005 anyway) and you mouse over the 'string' type in C# and the
'char' type you will see that they are defined as Unicode types.
string hw = "Hello World"; // Represents text as a series of Unicode characters
char c = 'a'; // Represents a 16 bit Unicode character
One of the reasons that Unicode isn't obvious with most of our work as programmers on
.NET is that the Unicode characters in the range 0 to 127 (called "Basic Latin")
are the same as ASCII characters in that range. The designers of Unicode were
not stupid people ;-)
char c = 'A'; // has the value 0x0041, ASCII is 0x41
Unicode characters are either 2 or 3 bytes in length, but that's not the
most important point about Unicode in getting to the bottom of
why Encoding and Decoding classes are ever present. Unicode doesn't just
define the fact that 'A' has the numerical value 0x0041. The Unicode standard
has a say in how the binary values are laid out in memory when you string them together.
There are three ways to represent a Unicode string in memory: an array of bytes (UTF8),
an array of shorts (UTF16), and an array of 4 byte words (UTF32).
The fact that there are three ways to store a Unicode string that the standard
defines makes Unicode very different
from ASCII or EBCDIC or a code page -- in those schemes it takes one or two bytes to hold the
value and there is nothing about the standard which dictates how thats represented in
memory if you string them together -- the layout is directly inferred from the storage
requirement. An array of ASCII chars is necessarily an array of 8 bit bytes. Duh!
This layout issue matters because of its part of the handshaking that two sides of a transfer
go through when exchanging a Unicode string. When the sender side ships the string off
to the brave world the recieving side has to
know if its got an 8, 16 or 32 storage mechanism on its hand. The Unicode standard
defines how that layout occurs, even to the level of whether something is little-endian or
big-endian.
This brings us to Encoding and Decoding. The process by which a 16 bit Unicode string
is delivered unto the world involves Encoding it into one of the three different
forms, UTF8, UTF16 or UTF32. The inverse process, going from an array of bytes into Unicode text,
is called Decoding
We can see this clearly with this next example. A short string is encoded into all
three forms, with the results dumped in hex. You see that the
binary data that results is very different, and you can see that if whoever gets this
data doesn't know which form its in it won't make any sense.
string hello = "Hi!";
byte[] bytelist = Encoding.UTF8.GetBytes(hello);
// Result is 48 69 21
bytelist = Encoding.Unicode.GetBytes(hello);
// Result is 48 00 69 00 21 00
bytelist = Encoding.UTF32.GetBytes(hello);
// Result is 48 00 00 00 69 00 00 00 21 00 00 00
Unicode doesn't enforce any sort of "packet", if you are reading Unicode data
and don't know how it was encoded you're up the creek. However, there is
a byte marker defined called a "preamble" that can be
prepended to a data stream that will provide information about the type of encoding, if needed.
So, the purpose of the Encoding and Decoding classes is to
implement transfer of Unicode strings to and from the outside world. .NET uses
Unicode-16 as its native type for storing characters and strings. In order to go from
a string or character type to a file, a socket, a web page, anywhere, the string /character data
must be Encoded into a format. There are three Unicode formats, UTF8 to encode
to an array of bytes, UTF16 for encoding to an array of shorts, and UTF32 to encode to
an array of 4 byte words.
As you dig into this you will also see other encodings, such as ASCII. These are provided by
.NET as a way of allowing different kinds of encoding. However, they are
not defined by the Unicode standard, they are supplementary.