Computers work entirely with numbers. For the computer to represent letters and other symbols, numbers are used internally to represent each of the different letters and other characters. The computer memory is defined in bytes with each byte being able to hold a number between zero and 255. When we store something other than numbers in the computer, our characters are stored in one or more bytes using the number that represents the particular character.
There are two standard "character encoding" methods used by web browsers. The most commonly used of these is ASCII (American Standard Code for Information Interchange). ASCII defines what text characters the values between zero and 127 in a specific byte represent (for example the value 65 represents the letter 'A'). ASCII does not define what the values between 128 and 255 represent and these can be used for different values depending on which extension to the ASCII encoding that a particular web page uses. The most commonly used character encoding used in web pages is ISO-8859-1 which allocates specific characters to those values that ASCII does not define. This gives us 256 different characters that can be represented using one byte to store each character.
There are 32 control characters and 96 printable characters defined in ASCII. Here is a complete list of the printable ASCII characters and their decimal and hexadecimal values.
There are many different languages in the world. When you add up all the characters that they use, as well as special symbols used in mathematics and other specialist areas, 256 different values is nowhere near sufficient to allocate a single value to each symbol. To rectify this an extended multi-byte encoding is needed. Unicode uses up to four bytes per character to allow millions of different characters to each be assigned a unique value. The characters represented by values up to 127 in Unicode are exactly the same as the ASCII equivalents making translation between the two as straightforward as possible.
Unicode can be used in web pages in any of three different forms with the most common being UTF-8 (which only uses a single byte for most characters and only uses two or four bytes for characters that can't be represented in a single byte. UTF-16 and UTF-32 represent exactly the same characters but use a minimum of two or four bytes per character respectively and so just make the web page source bigger.
This tutorial first appeared on www.felgall.com and is reproduced here with the permission of the author.