MySQL 5.1 Reference Manual :: 10 Character Set Support :: 10.1 Character Sets and Collations in General

  • MySQL 5.1 Reference Manual

  • 10 Character Set Support
  • 10.1 Character Sets and Collations in General
  • 10.2 Character Sets and Collations in MySQL
  • 10.3 Specifying Character Sets and Collations
  • 10.4 Connection Character Sets and Collations
  • 10.5 Collation Issues
  • 10.6 Operations Affected by Character Set Support
  • 10.7 Unicode Support
  • 10.8 UTF-8 for Metadata
  • 10.9 Character Sets and Collations That MySQL Supports

10.1. Character Sets and Collations in General

A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set. Let's make the distinction clear with an example of an imaginary character set.

Suppose that we have an alphabet with four letters: ‘A’, ‘B’, ‘a’, ‘b’. We give each letter a number: ‘A’ = 0, ‘B’ = 1, ‘a’ = 2, ‘b’ = 3. The letter ‘A’ is a symbol, the number 0 is the encoding for ‘A’, and the combination of all four letters and their encodings is a character set.

字母`A' 是一个符号(symbo)l,而数字0是 `A' 的 encoding(编码),而这四个字母和他们的编码合起来就是一个字符集(character set)。

Suppose that we want to compare two string values, ‘A’ and ‘B’. The simplest way to do this is to look at the encodings: 0 for ‘A’ and 1 for ‘B’. Because 0 is less than 1, we say ‘A’ is less than ‘B’. What we've just done is apply a collation to our character set. The collation is a set of rules (only one rule in this case): “compare the encodings.” We call this simplest of all possible collations a binary collation.

假设我们有四个字母: ‘A’, ‘B’, ‘a’, ‘b’.我们给每个字母一个号码 : ‘A’ = 0, ‘B’ = 1, ‘a’ = 2, ‘b’ = 3. 假设我们要比较二个字符串 A(0) 就会小于 B(1)............ 之类
现在,我们就算已经对我们的字符集使用了一个collation,
collation 是一组规则(在这个例子里只有一条规则): "比较编码compare the encodings".我们把所有可能的 collation 中最简单的这种叫做binary collation(二進位排序)

But what if we want to say that the lowercase and uppercase letters are equivalent? Then we would have at least two rules: (1) treat the lowercase letters ‘a’ and ‘b’ as equivalent to ‘A’ and ‘B’; (2) then compare the encodings. We call this a case-insensitive collation(不区分大小写的 collation). It's a little more complex than a binary collation.

In real life, most character sets have many characters: not just ‘A’ and ‘B’ but whole alphabets, sometimes multiple alphabets or eastern writing systems with thousands of characters, along with many special symbols and punctuation marks. Also in real life, most collations have many rules, not just for whether to distinguish lettercase, but also for whether to distinguish accents (an “accent” is a mark attached to a character as in German ‘Ö’), and for multiple-character mappings (such as the rule that ‘Ö’ = ‘OE’ in one of the two German collations).

在实际生活中,大多数字符集都包含很多字符:不是仅仅`A'和`B' 而是整个字母表.......有一点复杂....bla bla bla .......

中文字型collation(排序)由笔划多少决定 例如右表 : 周 林 都是8笔 , 再比第2字........ 由SELECT语句的ORDER BY和GROUP BY子句 排序

右表是在 big5,但我在 utf8 比较collation(排序)就怪怪的, You may try it





SELECT * FROM clascore ORDER BY IDnumber ; 是比较英文字 collation(排序)

MySQL can do these things for you(MySQL可以为你做以下事) :

  • Store strings using a variety of character sets
    使用各种字符集存储字符串strings

  • Compare strings using a variety of collations
    使用各种collation 比较字符串strings

  • Mix strings with different character sets or collations in the same server, the same database, or even the same table
    在同一台服务器server上或者同一个数据库database 甚至同一个表中table 使用不同的字符集和collation混合

  • Allow specification of character set and collation at any level
    允许在任何级别上指明字符集和collation

In these respects, MySQL is far ahead of most other database management systems. However, to use these features effectively, you need to know what character sets and collations are available, how to change the defaults, and how they affect the behavior of string operators and functions.

在这些方面,MySQL 比其他database管理系统 先进很多。不过要想 有效的使用这些特性features,你需要知道 哪些字符集和collation是可用的(available),怎样把他们改成 默认(defaults)


User Comments

Posted by David L Lambert on December 28 2004 4:33pm[Delete] [Edit]

Note that a MySQL database is only one part of a system for storing data that includes extended characters, and your client tools and languages will also need to support the characters you want in your application. For instance, the MySQL comand line client [1] does not allow direct entry of latin-1 characters, let alone Unicode. Furthermore, it doesn't automatically detect the language-environment, so there are some cases where properly entered characters get displayed as soething else; for instance, in US Win95, the command-prompt boxes use CP437, and a lot of printers use that same character set by default.

The MySQL Control Center under MS Windows and Microsoft Access are pretty good about representing international characters faithfully. On RedHat Linux 9, the Control Center displays each non-ASCII character as two boxes (presuably because it internally converts to UTF8, but the X server expects latin1).

JDBC [2] does a good job of handling international characters, because Java uses Unicode internally and the MySQL driver picks the server character set automatically.

Perl supports Unicode internally, thus DBI and CGI [3] would seem like a good set of tools to get at multilingual data. However, DBD::MySQL doesn't actually do any conversion, so in perl 5.6 you might need to do something like

($name) = $dbh->selectrow_array( ... );
$name = pack("U*",unpack("C*",$name));
print "$name\n";

to get the characters in the proper format, or you could add an explicit

$dbh->do("SET character_set_results="utf8"');

before doing anything else.

Note also that CGI.pm (most recent version) sets the output encoding to iso-8859-1 by default, so a CGI script will need to convert back from UTF on output. Unicode::Lite might prove useful in some cases. Perl 5.8 [4] has a slightly different model for handling Unicode that might (untested) require a line similar to the following when the server has been set to respond with Unicode:

use Encode 'decode_utf8';
$name = decode_utf8($name);

References:

1. http://bugs.mysql.com/bug.php?id=7491
2. http://java.sun.com/j2se/1.4.2/docs/api/java/sql/package-summary.html
3. http://search.cpan.org/~lds/CGI.pm-3.05/CGI.pm
4. http://www.perldoc.com/perl5.8.4/pod/perluniintro.html

Add your own comment.

10.2. Character Sets and Collations in MySQL

The MySQL server can support multiple character sets. To list the available character sets, use the SHOW CHARACTER SET statement.
MySQL 服务器可支持多个字符集。要列出可用的字符集,使用SHOW CHARACTER SET 语句:
A partial listing follows. For more complete information, see Section 10.9, “Character Sets and Collations That MySQL Supports”.

mysql> SHOW CHARACTER SET;
+----------+-----------------------------+---------------------+--------+
| Charset  | Description                 | Default collation   | Maxlen |
+----------+-----------------------------+---------------------+--------+
| big5     | Big5 Traditional Chinese    | big5_chinese_ci     |      2 |
| dec8     | DEC West European           | dec8_swedish_ci     |      1 |
| cp850    | DOS West European           | cp850_general_ci    |      1 |
| hp8      | HP West European            | hp8_english_ci      |      1 |
| koi8r    | KOI8-R Relcom Russian       | koi8r_general_ci    |      1 |
| latin1   | cp1252 West European        | latin1_swedish_ci   |      1 |
| latin2   | ISO 8859-2 Central European | latin2_general_ci   |      1 |
| swe7     | 7bit Swedish                | swe7_swedish_ci     |      1 |
| ascii    | US ASCII                    | ascii_general_ci    |      1 |
| ujis     | EUC-JP Japanese             | ujis_japanese_ci    |      3 |
| sjis     | Shift-JIS Japanese          | sjis_japanese_ci    |      2 |
| hebrew   | ISO 8859-8 Hebrew           | hebrew_general_ci   |      1 |
| tis620   | TIS620 Thai                 | tis620_thai_ci      |      1 |
| euckr    | EUC-KR Korean               | euckr_korean_ci     |      2 |
| koi8u    | KOI8-U Ukrainian            | koi8u_general_ci    |      1 |
| gb2312   | GB2312 Simplified Chinese   | gb2312_chinese_ci   |      2 |
| greek    | ISO 8859-7 Greek            | greek_general_ci    |      1 |
| cp1250   | Windows Central European    | cp1250_general_ci   |      1 |
| gbk      | GBK Simplified Chinese      | gbk_chinese_ci      |      2 |
| latin5   | ISO 8859-9 Turkish          | latin5_turkish_ci   |      1 |
...

Any given character set always has at least one collation. It may have several collations. To list the collations for a character set, use the SHOW COLLATION statement. For example, to see the collations for the latin1 (cp1252 West European) character set, use this statement to find those collation names that begin with latin1:

任何字符集至少有一个Collation, 它可以有很多Collation
要列出一个字符集的 collations , 使用 SHOW COLLATION 语句(statement). 例如, 要看latin1 (cp1252 West European'')的collations, 使用这个语句来找到哪些名字以latin1 开头的collation

mysql> SHOW COLLATION LIKE 'latin1%';
+---------------------+---------+----+---------+----------+---------+
| Collation           | Charset | Id | Default | Compiled | Sortlen |
+---------------------+---------+----+---------+----------+---------+
| latin1_german1_ci   | latin1  |  5 |         |          |       0 |
| latin1_swedish_ci   | latin1  |  8 | Yes     | Yes      |       1 |
| latin1_danish_ci    | latin1  | 15 |         |          |       0 |
| latin1_german2_ci   | latin1  | 31 |         | Yes      |       2 |
| latin1_bin          | latin1  | 47 |         | Yes      |       1 |
| latin1_general_ci   | latin1  | 48 |         |          |       0 |
| latin1_general_cs   | latin1  | 49 |         |          |       0 |
| latin1_spanish_ci   | latin1  | 94 |         |          |       0 |
+---------------------+---------+----+---------+----------+---------+

The latin1 collations have the following meanings:

Collation Meaning
latin1_german1_ci German DIN-1
latin1_swedish_ci Swedish/Finnish
latin1_danish_ci Danish/Norwegian
latin1_german2_ci German DIN-2
latin1_bin Binary according to latin1 encoding
latin1_general_ci Multilingual (Western European)
latin1_general_cs Multilingual (ISO Western European), case sensitive
latin1_spanish_ci Modern Spanish

Collations have these general characteristics:

  • Two different character sets cannot have the same collation.
    两个不同字符集没法拥有同一个collation

  • Each character set has one collation that is the default collation. For example, the default collation for latin1 is latin1_swedish_ci.
    The output for SHOW CHARACTER SET indicates which collation is the default for each displayed character set.
    每个字符集有一个默认 collation. 例如, latin1 的默认 collation 是 latin1_swedish_ci.

  • mysql> SHOW CHARACTER SET ;
    +----------+-----------------------------+---------------------+--------+
    | Charset  | Description                 | Default collation   | Maxlen |
    +----------+-----------------------------+---------------------+--------+
    | big5     | Big5 Traditional Chinese    | big5_chinese_ci     |      2 |
    
    | latin1   | cp1252 West European        | latin1_swedish_ci   |      1 |
    | gb2312   | GB2312 Simplified Chinese   | gb2312_chinese_ci   |      2 |
    | greek    | ISO 8859-7 Greek            | greek_general_ci    |      1 |
    | cp1250   | Windows Central European    | cp1250_general_ci   |      1 |
    | gbk      | GBK Simplified Chinese      | gbk_chinese_ci      |      2 |
    | utf8     | UTF-8 Unicode               | utf8_general_ci     |      3 |
    
  • There is a convention for collation names: They start with the name of the character set with which they are associated, they usually include a language name, and they end with _ci (case insensitive), _cs (case sensitive), or _bin (binary).
    collation 的命名有个约定: 他们由所关联的字符集的名字打头,他们通常包含一个 语言名, 并以 _ci (case insensitive大小写不敏感), 或者 _cs (case sensitive大小写敏感), 或者 _bin (binary二进制).




real_vine@hotmail.com