English|中文|Deutsch|Español|Français|Indonesian|Italiano|日本語|Português|Pycckий

Three Pointe Dr.
Suite 301
Brea, California 92821
U.S.A.

Tel.: +1 714.671.9180
Fax: +1 714.671.9188
Toll Free (U.S.)
+1 888.472.2001

 

The Global Advisor Newsletter -  Tips for improving the process and reducing the cost of website localization. Bringing Medical Devices to Market - Useful links. Celebrating notable anniversaries...

Features articles of interest on language translation and localization, culture, language technology and other related topics. The goal of the Global Advisor Newsletter is to inform and entertain.

Other Editions

  Print this page

Forty-Eighth Edition - Unicode Standard for Chinese Characters GB18030-2000

The People's Republic of China has established a character set standard for Simplified Chinese. This is GB18030-2000, a standard that specifies an extended codepage and a mapping table to Unicode.

(Note: Mainland China and Singapore use the Simplified Chinese script. Hong Kong, Taiwan and Macao use the Traditional Chinese script.)

Why is this important?

  1. The Chinese economy is currently the strongest market in the world, and China is a huge market for information technology products.
  2. The China Electronics Standardization Institute (CESI) is authorized by the Chinese Government to set standards for electronic information technology. CESI's objective is to become the authority on information technology standardization and compliance assessment.
  3. The CESI center responsible for the validation of information technology products evaluates all IT products, including Chinese operating systems, for compliance with the national standard, GB 18030-2000. Products that do not comply cannot be sold in mainland China. (For more information about CESI certification, please refer to http://www.cesi.ac.cn/en/MAIN%20BUSINESS/IT.htm.)

Characters and character sets - Sorting out the terminology

 
bullet Character: A symbol, such as a letter of the alphabet, a digit, a simple musical note, etc.

bullet Character encoding: Mapping a set of natural language characters to a set of octects (or bits or bytes). In the simplest form, each character in a repertoire is mapped to an integer in the range 0 - 255. For example, as defined by the Standard ASCII character code, the sequence “162” represents an “ó”, “164" represents an “ñ”, “142” represents an “Ä”, etc. (See table below.)  Several different character codes support English, some of these are ASCII, ISO8859-1, Windows Code Page 1252, and EBCDIC. (Note: ASCII stands for "American Standard Code for Information Interchange", EBCDIC for Extended Binary Coded Decimal Interchange Code and ISO for International Organization for Standardization.)

bullet Character code: Computers work with numbers therefore, in the computer world, characters are represented with numbers. Therefore, a character code is a mapping of  a set of natural language characters to a unique numerical sequence.

bullet Character set: A collection of characters, such as “a”, “b” “ó” “ñ”, “Ä”, and so on.

bullet Charset: Used to refer an encoding such as that is often found in the meta tags of html pages. For example: <meta http-equiv="Content-Type" content="text/html; charset=windows-1252">, or <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">. (Note: Meta Tags refer to html code placed in the header of a web page, enclosed in "<" and ">" brackets, that provides information that is not displayed in the browser. )

Single, Double and Multi-Byte Character Sets

When computers “spoke” mainly English, 128 code positions or code elements were enough to represent all the letters, punctuation marks and control characters required to write English:
 

Character Code position
Capital letters A-Z  26
Lowercase letters A-Z  26
Numbers (1 to 10)  10
Punctuation marks (. , + “ ‘ and so on)  32
Space    1
Control characters (CR, Tab, LF, etc.)  33
Total code positions 128

 
Single and Double Byte Character Sets

Byte is the short form of "binary term", the amount of storage needed for one character. Also known as an octet, a byte consists of 8 bits, and can represent numeric values ranging from 0 (zero) to 255

Bit stands for "binary digit", the smallest unit of information in a computer. A bit holds either a 1 or a 0. Therefore, for more meaningful information, bits are are combined into larger units, for example a byte, which is made up of 8 consecutive bits.

The number of unique characters in a natural language determines the amount of storage needed for each character in a character set. For example:

Single Byte Character Sets

A single byte can store values in the range from 0 to 255, therefore a single byte can identify 256 unique characters. Since most European languages have fewer than 256 characters, their character sets consist of single-byte characters and applications written in European languages assume that each character requires one byte of storage.

The ASCII code set (See table below) is a 7-bit system, i.e., the 128 configurations listed in the "Standard ASCII" column of the table below require 7 bits (27 = 128). However, since computers manipulate data in 8-bit bytes, and the number of symbols required by several languages exceed the available 128 characters, many extensions use the 128 remaining available codes by using all 8-bits of each byte (28 = 256), as shown in the "Extended ASCII (DOS)" column of the table.


 

Double and Multi-byte Character Sets

Languages such as Chinese, Japanese and Korean, have thousands of characters. For example, the Chinese script includes more that ten thousand basic ideographic characters known as “hantsu” or “hanzi” in Chinese, “kanji” in Japanese and “hanja” in Korean. Obviously, 256 characters or code positions are not enough to handle these languages.

The solution: Use double-byte or 2-byte values to expand the code set. Some operating systems, such as UNIX use as many as 4 bytes to represent these characters. Character sets that mix single byte and 2, 3 and even 4-byte characters are “Multibyte Character Sets”.

(Note: The Japanese language presented a particular challenge to encoders, because it mixes single and double-byte characters. The Japanese “katakana”, or phonetic syllabary often used to represent foreign words, can be represented by single-byte characters, but kanji (the Chinese ideograms) are two-byte characters). (Note: A syllabary is a series or set of written characters each one of which is used to represent a syllable - distinguished from alphabet - Reference: Merriam Webster Unabridged Dictionary.)
 

Input Methods

Originally, computer keyboards were designed  for 7-bit ASCII. They included all the keys necessary to write English. Therefore if you are using a standard English keyboard, you will notice that you do not have the keys to input the accented and other special characters required to write other languages, even those that use the Latin alphabet, like French or Spanish, for instance. And, you definitely do not have sufficient keys to input the thousands of characters required for CJKV (Chinese-Japanese-Korean-Vietnamese).

For European languages there are international keyboard drivers that are available upon installation of Microsoft applications to facilitate the input of the appropriate characters for a particular language, but if you want to write in Spanish using your standard English keyboard, you can press and hold the Alt-Key while you key in the corresponding  sequence of numbers for a particular character of a character set. For example, as per he Extended ASCII codes in the example above, pressing and holding the Alt-Key and keying in the sequence of number 1-6-2, you can input an "ó" (an o with an acute accent).

It is also possible to use your standard English keyboard to input Asian characters, but it requires two steps. First, the user enters a character using Latin characters and a conversion dictionary displays a list of candidate characters. Then, the user selects the most appropriate choice or requests more choices.

The following article in the Microsoft website explains how this is done and what is required. http://www.microsoft.com/windows/ie/downloads/recommended/ime/default.mspx

What is Unicode?

The Unicode Standard, and the availability of tools that support it, are among the most significant recent global software technology trends.

“Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.” (http://www.unicode.org/standard/WhatIsUnicode.html)

Before Unicode was invented, hundreds of different encoding methods were used to assign unique number sequences to character sets. Therefore, no single encoding could contain enough characters. For example, the European Union requires several different encodings to cover all its languages and, no single encoding was adequate for all the letters, punctuation, and technical symbols in common use even for a single language like English. The proliferation of encoding systems caused problems, because different number sequences can be used to represent the same character in different encoding systems. Therefore, computers such as servers that need to support many different encodings, run the risk of data corruption when data is passed between these different encodings or platforms (Note: Platform refers to the underlying hardware or software for a system.)

The Unicode standard assigns a unique number to each character, regardless of platform, program or language, therefore, data can be transported through many different systems without corruption. The Unicode Standard has been adopted by Apple, HP, IBM, Microsoft, Oracle, SAP, Sun, Sybase, Unisys and many others. Unicode is required by modern standards such as XML, Java, ECMAScript (JavaScript), LDAP, CORBA 3.0, WML, etc. It is the official way to implement ISO/IEC 10646. And, it is supported by many operating systems, all modern browsers, and many other applications.

A Brief History of the Chinese Character Set Standards

“GB”means “Guo Biao” (国标) in Chinese” and it is the short form of “Guojia Biaozhun” (国家标准) “National Standard". All character set standards issued by the People’s Republic of China (PRC) begin with the designation “GB”.

In 1993, the PRC published GB 13000.1-93, a Chinese national standard that is code- and character-compatible with ISO 10646-1/Unicode 2.1, thus signaling their intention to support the efforts of the International Organization for Standardization (ISO)/ International Electrotechnical Commission (IEC) and the Unicode Consortium. GB 13000 was updated as required to maintain compatibility with the standards of the ISO/IEC and the Unicode Consortium.

When GB 13000 was implemented, there was already a widely used Chinese national standard for “Chinese simplified” characters. This standard, named GB 2312-1980, consists of 7,445 characters including 6,763 Hanzi and 682 non-Hanzi characters. Therefore, to maintain compatibility with with this standard, another character set was created - "GBK", short for “Guojia biaozhun kuozhan”. The official title of this character set is “Hanzi neima kuozhan guifan 汉字内码扩展规范 ”,or “Rules/Specifications defining the extensions of internal codes for Chinese characters (ideograms)”.

In addition to maintaining compatibility with the existing GB 2312-1980 character set, the purpose of the GBK character set was to make available the complete Unicode Unified Han character set.

bullet

Han characters are the Chinese ideograms, or characters whose origin can be traced back to pictographic characters.

bullet

“Han Unification” (Unihan), was the Unicode Consortium’s effort to consolidate the Chinese, Japanese and Korean (CJK) versions of Han characters in a common code set by eliminating duplication. As defined in the Unicode Standard, Unified CJK is a range of 20,902 ideographic characters shared by Chinese, Japanese and Korean.

What is GB 18030-2000?

GB 18030-2000 is the Chinese version of UTF8 (Unicode). And, it is the new “compulsory” Chinese national standard. It was published by the China Standard Press, Beijing with the official title “Information technology - Chinese ideograms coded character set for information interchange - Extension for the basic set”  in March 17, 2000, and updated in November 20, 2000. As of September 1, 2001, all computer operating systems sold in China must support this character set.

GB 18030-2000 was designed because the code space defined in GBK was very packed, so it could not accommodate a major addition, like Unihan Extension A, that defines 6,582 additional characters in Unicode 3. GB 18030-2000 combines Unihan Extension A (Unicode 3.0's extended CJK character set) with previous Chinese national standards. It is fully backward-compatible with GB 2312-1980 and replaces (代替 daiti) GBK (the “specification”). It is also compatible with CN-GB (the 8-bit character encoding used to write e-mails in Chinese simplified characters.)

Historically, GB 18030-2000 standard is significant because it is the first widely used character set that includes characters with Universal numerical codes (code positions) that exceed 65,535 - the largest number that can be represented with two bytes of computer memory. Actually, GB18030 includes characters that cannot be represented in any two-byte fixed width (or double-byte) character set.


The easiest way to support GB 1830-2000 is to develop Unicode enabled products.
 

References

http://www.nationmaster.com/encyclopedia/GB-18030.
http://www.webopedia.com
CJKV Information Processing, by Ken Lunde
http://www.monotypeimaging.com
http://www-106.ibm.com/developerworks/library/u-china.html
http://www.cesi.ac.cn/default.htm
http://www.unicode.org/
http://www.cs.tut.fi/~jkorpela/chars.html

Join the InterSol, Inc. mailing list
Email:
 

Copyright 1996 - 2008 InterSol, Inc. All Rights Reserved