 |
Character: A symbol, such as a letter of the
alphabet, a digit, a simple musical note, etc.
|
 |
Character encoding: Mapping a set of natural language
characters to a set of octects (or bits or bytes). In the simplest form, each
character in a repertoire is mapped to an integer in the range 0 - 255. For example, as defined by the Standard ASCII
character code, the sequence “162” represents an “ó”, “164"
represents an “ñ”, “142” represents an “Ä”, etc. (See table below.) Several different character codes support English, some
of these are ASCII, ISO8859-1, Windows Code Page 1252, and EBCDIC.
(Note: ASCII stands for "American Standard Code for Information
Interchange",
EBCDIC for Extended Binary Coded Decimal Interchange Code and
ISO for International Organization for Standardization.)
|
 |
Character code:
Computers work with numbers therefore, in the computer world, characters are
represented with numbers. Therefore, a character code is a mapping of
a set of natural language characters to a unique numerical sequence.
|
 |
Character set: A collection of characters, such as
“a”, “b” “ó” “ñ”, “Ä”, and so on.
|
 |
Charset: Used to refer an encoding
such as that is often found in the meta tags of html pages. For
example: <meta http-equiv="Content-Type" content="text/html; charset=windows-1252">,
or <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">.
(Note: Meta Tags refer to html code placed in the header of a web page,
enclosed in "<" and ">" brackets, that provides information that is not
displayed in the browser. ) Single, Double and Multi-Byte Character Sets
When computers “spoke” mainly English, 128 code positions or code elements were
enough to represent all the letters, punctuation marks and control characters
required to write English:
| Character |
Code position |
| Capital letters A-Z |
26 |
| Lowercase letters A-Z |
26 |
| Numbers (1 to 10) |
10 |
| Punctuation marks (. , + “ ‘ and so on) |
32 |
| Space |
1 |
| Control characters (CR, Tab, LF, etc.) |
33 |
| Total code positions |
128 |
|
Single and Double Byte Character Sets
Byte is the short form of "binary term", the amount of storage needed for one
character. Also known as an octet, a byte consists of 8 bits, and can
represent numeric values ranging from 0 (zero) to 255
Bit stands for
"binary digit", the smallest unit of information in a
computer. A bit holds either a 1 or a 0. Therefore, for more meaningful
information, bits are are combined into larger units, for example a byte, which is made up of 8 consecutive bits.
The number of unique characters in a natural language determines the amount of
storage needed for each character in a character set.
For example:
Single Byte
Character Sets
A single byte can store values in the range from 0 to 255, therefore a single
byte can identify 256 unique characters. Since most European languages have
fewer than 256 characters, their character sets consist of single-byte
characters and applications written in European languages assume that
each character requires one byte of storage.
The ASCII code set (See table below) is a 7-bit system, i.e., the 128
configurations listed in the "Standard ASCII" column of the table below require 7 bits (27 = 128). However, since computers
manipulate data in 8-bit bytes, and the number of symbols required by several
languages exceed the available 128 characters, many extensions use the 128
remaining available codes by using all 8-bits of each byte (28 =
256), as shown in the "Extended ASCII (DOS)" column of the table.

Double and Multi-byte Character Sets
Languages such as Chinese, Japanese and Korean, have thousands of characters.
For example, the Chinese script includes more that ten thousand basic
ideographic characters known as “hantsu” or “hanzi” in Chinese, “kanji” in Japanese and
“hanja” in Korean. Obviously, 256 characters
or code positions are not enough to handle these languages.
The solution: Use double-byte or 2-byte values to expand the code set. Some operating systems,
such as UNIX use as many as 4 bytes to represent these characters. Character
sets that mix single byte and 2, 3 and even 4-byte characters are “Multibyte Character
Sets”.
(Note: The Japanese language presented a particular challenge to encoders,
because it mixes single and double-byte characters. The Japanese “katakana”, or phonetic syllabary often used to
represent foreign words, can be represented by single-byte characters, but kanji
(the Chinese ideograms) are two-byte characters). (Note: A syllabary is a series
or set of written characters each one of which is used to represent a
syllable - distinguished from alphabet - Reference: Merriam Webster
Unabridged Dictionary.)
Input Methods
Originally, computer keyboards were designed for 7-bit ASCII. They
included all the keys necessary to write English. Therefore if you
are using a standard English keyboard, you will notice that you do not
have the keys to input the accented and other special characters required to
write other languages, even those that use the Latin alphabet, like French or Spanish, for
instance. And, you definitely do not have sufficient keys to input the thousands
of characters
required for CJKV (Chinese-Japanese-Korean-Vietnamese).
For European languages there are international keyboard drivers that are available upon installation
of Microsoft applications to facilitate the input of the appropriate
characters for a particular language, but if you want to write in Spanish using
your standard English keyboard, you can press and hold the Alt-Key while you key
in the corresponding sequence of numbers for a particular character of a
character set. For example, as per he Extended ASCII codes in the example
above, pressing and holding the Alt-Key and keying in the sequence of number
1-6-2, you can input an "ó" (an o with an acute accent).
It is also possible to use your
standard English keyboard to
input Asian characters, but it requires two steps. First, the user enters a
character using Latin characters and a conversion dictionary displays a list of
candidate characters. Then, the user selects the most appropriate choice
or requests more choices.
The following article in the Microsoft website explains how
this is done and what is required. http://www.microsoft.com/windows/ie/downloads/recommended/ime/default.mspx
What is Unicode?
The Unicode Standard, and the availability of
tools that support it, are among the most significant recent global software
technology trends.
“Unicode provides a unique number for every character, no matter what the
platform, no matter what the program, no matter what the language.” (http://www.unicode.org/standard/WhatIsUnicode.html)
Before Unicode was invented, hundreds of
different encoding methods were used to assign unique number sequences to
character sets.
Therefore, no single encoding could contain enough characters. For example, the
European Union requires several different encodings to cover all its
languages and, no single encoding was adequate for all the letters, punctuation,
and technical symbols in common use even for a single language like English. The
proliferation of encoding systems caused problems, because different number
sequences can be used to represent the same character in different encoding systems.
Therefore, computers
such as servers that need to support many different encodings, run the risk of
data corruption when data is passed between these different encodings or platforms
(Note: Platform refers to the underlying hardware or software for a system.)
The Unicode standard assigns a unique number
to each character, regardless of platform, program or language, therefore,
data can be transported through many
different systems without corruption. The
Unicode Standard has been adopted by Apple, HP, IBM, Microsoft,
Oracle, SAP, Sun, Sybase, Unisys and many others. Unicode is required by
modern standards such as XML, Java, ECMAScript (JavaScript), LDAP, CORBA 3.0,
WML, etc. It is the official way to implement ISO/IEC 10646. And, it is supported
by many operating systems, all modern browsers, and many other applications.
A Brief History of the Chinese Character Set Standards
“GB”means “Guo Biao” (国标) in Chinese” and it is the short form
of “Guojia Biaozhun” (国家标准)
“National Standard". All character set standards issued by the People’s Republic
of China (PRC) begin with the designation “GB”.
In 1993, the PRC published GB 13000.1-93, a Chinese national standard that is
code- and character-compatible with ISO 10646-1/Unicode 2.1, thus signaling
their intention to support the
efforts of the International Organization for Standardization (ISO)/
International Electrotechnical Commission (IEC) and the Unicode Consortium. GB
13000 was updated as required to maintain compatibility with the standards of
the ISO/IEC and the Unicode Consortium.
When GB 13000 was implemented, there was already a widely used Chinese national standard
for “Chinese simplified” characters.
This standard, named GB
2312-1980, consists of 7,445 characters including 6,763 Hanzi
and 682 non-Hanzi characters.
Therefore, to maintain compatibility
with with this standard, another character set was created - "GBK",
short for “Guojia biaozhun kuozhan”. The official title of this
character set is “Hanzi neima kuozhan guifan 汉字内码扩展规范 ”,or
“Rules/Specifications defining the extensions of internal codes for Chinese
characters (ideograms)”.
In addition to maintaining compatibility with the existing GB 2312-1980
character set, the purpose of the GBK character set was to make available the complete
Unicode Unified Han character set.
 |
Han characters are the Chinese ideograms, or characters whose origin can
be traced back to pictographic characters. |
 |
“Han Unification” (Unihan), was the Unicode
Consortium’s effort to consolidate the Chinese, Japanese and Korean (CJK)
versions of Han characters in a common code set by eliminating duplication.
As defined in the Unicode Standard, Unified CJK is a range of 20,902 ideographic
characters shared by Chinese, Japanese and Korean. |
What is GB 18030-2000?
GB 18030-2000 is the Chinese version of UTF8 (Unicode). And, it is the new
“compulsory” Chinese national standard. It was published by the China Standard
Press, Beijing with the official title “Information technology - Chinese ideograms coded character set for information
interchange - Extension for the basic set” in
March 17, 2000, and updated in November 20, 2000. As of September 1, 2001, all
computer operating systems sold in China must support this character set.
GB 18030-2000 was designed because the code space defined in GBK was very packed,
so it could not accommodate a major addition, like Unihan Extension A, that
defines 6,582 additional characters in Unicode 3. GB 18030-2000 combines Unihan
Extension A (Unicode 3.0's extended CJK character set) with previous
Chinese national standards. It is fully backward-compatible with GB 2312-1980
and replaces (代替 daiti) GBK (the “specification”). It is also compatible with CN-GB
(the 8-bit character encoding used to write e-mails in Chinese simplified
characters.)
Historically, GB 18030-2000 standard is significant because it is the first widely used
character set that includes characters with Universal numerical codes (code
positions) that exceed 65,535 - the largest number that can be
represented with two bytes of computer memory. Actually, GB18030 includes characters that cannot be represented in any two-byte fixed width (or
double-byte) character set.
The easiest way to support GB 1830-2000 is to develop Unicode enabled products.
References
http://www.nationmaster.com/encyclopedia/GB-18030.
http://www.webopedia.com
CJKV Information Processing, by Ken Lunde
http://www.monotypeimaging.com
http://www-106.ibm.com/developerworks/library/u-china.html
http://www.cesi.ac.cn/default.htm
http://www.unicode.org/
http://www.cs.tut.fi/~jkorpela/chars.html