A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set. Let's make the distinction clear with an example of an imaginary character set.
Suppose that we have an alphabet with four letters:
‘A
’,
‘B
’,
‘a
’,
‘b
’. We give each letter a number:
‘A
’ = 0,
‘B
’ = 1,
‘a
’ = 2,
‘b
’ = 3. The letter
‘A
’ is a symbol, the number 0 is
the encoding for
‘A
’, and the combination of all
four letters and their encodings is a
character set.
字母`A' 是一个符号(symbo)l,而数字0是 `A' 的 encoding(编码),而这四个字母和他们的编码合起来就是一个字符集(character set)。
Suppose that we want to compare two string values,
‘A
’ and
‘B
’. The simplest way to do this is
to look at the encodings: 0 for ‘A
’
and 1 for ‘B
’. Because 0 is less
than 1, we say ‘A
’ is less than
‘B
’. What we've just done is apply
a collation to our character set. The collation is a set of rules
(only one rule in this case): “compare the
encodings.” We call this simplest of all possible
collations a binary collation.
假设我们有四个字母:
‘A
’,
‘B
’,
‘a
’,
‘b
’.我们给每个字母一个号码 :
‘A
’ = 0,
‘B
’ = 1,
‘a
’ = 2,
‘b
’ = 3. 假设我们要比较二个字符串 A
(0) 就会小于 B
(1)............ 之类
现在,我们就算已经对我们的字符集使用了一个collation,
collation 是一组规则(在这个例子里只有一条规则):
"比较编码compare the
encodings".我们把所有可能的 collation 中最简单的这种叫做binary collation(二進位排序)
But what if we want to say that the lowercase and uppercase
letters are equivalent? Then we would have at least two rules: (1)
treat the lowercase letters ‘a
’ and
‘b
’ as equivalent to
‘A
’ and
‘B
’; (2) then compare the
encodings. We call this a case-insensitive
collation(不区分大小写的 collation). It's a little more complex than a binary collation.
In real life, most character sets have many characters: not just
‘A
’ and
‘B
’ but whole alphabets, sometimes
multiple alphabets or eastern writing systems with thousands of
characters, along with many special symbols and punctuation marks.
Also in real life, most collations have many rules, not just for
whether to distinguish lettercase, but also for whether to
distinguish accents (an “accent” is a mark attached
to a character as in German ‘Ö
’),
and for multiple-character mappings (such as the rule that
‘Ö
’ =
‘OE
’ in one of the two German
collations).
在实际生活中,大多数字符集都包含很多字符:不是仅仅`A'和`B' 而是整个字母表.......有一点复杂....bla bla bla .......
中文字型collation(排序)由笔划多少决定 例如右表 : 周 林 都是8笔 , 再比第2字........ 由SELECT语句的ORDER BY和GROUP BY子句 排序
右表是在 big5,但我在 utf8 比较collation(排序)就怪怪的, You may try it
SELECT * FROM clascore ORDER BY IDnumber ;
是比较英文字 collation(排序)
MySQL can do these things for you(MySQL可以为你做以下事) :
Store strings using a variety of character sets
使用各种字符集存储字符串strings
Compare strings using a variety of collations
使用各种collation 比较字符串strings
Mix strings with different character sets or collations in the
same server, the same database, or even the same table
在同一台服务器server上或者同一个数据库database 甚至同一个表中table 使用不同的字符集和collation混合
Allow specification of character set and collation at any
level
允许在任何级别上指明字符集和collation
In these respects, MySQL is far ahead of most other database management systems. However, to use these features effectively, you need to know what character sets and collations are available, how to change the defaults, and how they affect the behavior of string operators and functions.
在这些方面,MySQL 比其他database管理系统 先进很多。不过要想 有效的使用这些特性features,你需要知道 哪些字符集和collation是可用的(available),怎样把他们改成 默认(defaults)
The MySQL server can support multiple character sets. To list the
available character sets, use the SHOW CHARACTER
SET
statement.
MySQL 服务器可支持多个字符集。要列出可用的字符集,使用SHOW CHARACTER
SET
语句:
A partial listing follows. For more
complete information, see Section 10.9, “Character Sets and Collations That MySQL Supports”.
mysql> SHOW CHARACTER SET;
+----------+-----------------------------+---------------------+--------+
| Charset | Description | Default collation | Maxlen |
+----------+-----------------------------+---------------------+--------+
| big5 | Big5 Traditional Chinese | big5_chinese_ci | 2 |
| dec8 | DEC West European | dec8_swedish_ci | 1 |
| cp850 | DOS West European | cp850_general_ci | 1 |
| hp8 | HP West European | hp8_english_ci | 1 |
| koi8r | KOI8-R Relcom Russian | koi8r_general_ci | 1 |
| latin1 | cp1252 West European | latin1_swedish_ci | 1 |
| latin2 | ISO 8859-2 Central European | latin2_general_ci | 1 |
| swe7 | 7bit Swedish | swe7_swedish_ci | 1 |
| ascii | US ASCII | ascii_general_ci | 1 |
| ujis | EUC-JP Japanese | ujis_japanese_ci | 3 |
| sjis | Shift-JIS Japanese | sjis_japanese_ci | 2 |
| hebrew | ISO 8859-8 Hebrew | hebrew_general_ci | 1 |
| tis620 | TIS620 Thai | tis620_thai_ci | 1 |
| euckr | EUC-KR Korean | euckr_korean_ci | 2 |
| koi8u | KOI8-U Ukrainian | koi8u_general_ci | 1 |
| gb2312 | GB2312 Simplified Chinese | gb2312_chinese_ci | 2 |
| greek | ISO 8859-7 Greek | greek_general_ci | 1 |
| cp1250 | Windows Central European | cp1250_general_ci | 1 |
| gbk | GBK Simplified Chinese | gbk_chinese_ci | 2 |
| latin5 | ISO 8859-9 Turkish | latin5_turkish_ci | 1 |
...
Any given character set always has at least one collation. It may
have several collations. To list the collations for a character
set, use the SHOW COLLATION
statement. For
example, to see the collations for the latin1
(cp1252 West European) character set, use this statement to find
those collation names that begin with latin1
:
任何字符集至少有一个Collation, 它可以有很多Collation
要列出一个字符集的 collations , 使用 SHOW COLLATION
语句(statement). 例如, 要看latin1
(cp1252 West European'')的collations, 使用这个语句来找到哪些名字以latin1
开头的collation
mysql> SHOW COLLATION LIKE 'latin1%';
+---------------------+---------+----+---------+----------+---------+
| Collation | Charset | Id | Default | Compiled | Sortlen |
+---------------------+---------+----+---------+----------+---------+
| latin1_german1_ci | latin1 | 5 | | | 0 |
| latin1_swedish_ci | latin1 | 8 | Yes | Yes | 1 |
| latin1_danish_ci | latin1 | 15 | | | 0 |
| latin1_german2_ci | latin1 | 31 | | Yes | 2 |
| latin1_bin | latin1 | 47 | | Yes | 1 |
| latin1_general_ci | latin1 | 48 | | | 0 |
| latin1_general_cs | latin1 | 49 | | | 0 |
| latin1_spanish_ci | latin1 | 94 | | | 0 |
+---------------------+---------+----+---------+----------+---------+
The latin1
collations have the following
meanings:
Collation | Meaning |
latin1_german1_ci |
German DIN-1 |
latin1_swedish_ci |
Swedish/Finnish |
latin1_danish_ci |
Danish/Norwegian |
latin1_german2_ci |
German DIN-2 |
latin1_bin |
Binary according to latin1 encoding |
latin1_general_ci |
Multilingual (Western European) |
latin1_general_cs |
Multilingual (ISO Western European), case sensitive |
latin1_spanish_ci |
Modern Spanish |
Collations have these general characteristics:
Two different character sets cannot have the same collation.
两个不同字符集没法拥有同一个collation
Each character set has one collation that is the
default collation. For example, the
default collation for latin1
is
latin1_swedish_ci
.
The output for
SHOW CHARACTER SET
indicates which
collation is the default for each displayed character set.
每个字符集有一个默认 collation. 例如, latin1 的默认 collation 是
latin1_swedish_ci.
mysql> SHOW CHARACTER SET ; +----------+-----------------------------+---------------------+--------+ | Charset | Description | Default collation | Maxlen | +----------+-----------------------------+---------------------+--------+ | big5 | Big5 Traditional Chinese | big5_chinese_ci | 2 | | latin1 | cp1252 West European | latin1_swedish_ci | 1 | | gb2312 | GB2312 Simplified Chinese | gb2312_chinese_ci | 2 | | greek | ISO 8859-7 Greek | greek_general_ci | 1 | | cp1250 | Windows Central European | cp1250_general_ci | 1 | | gbk | GBK Simplified Chinese | gbk_chinese_ci | 2 | | utf8 | UTF-8 Unicode | utf8_general_ci | 3 |
There is a convention for collation names: They start with the
name of the character set with which they are associated, they
usually include a language name, and they end with
_ci
(case insensitive),
_cs
(case sensitive), or
_bin
(binary).
collation 的命名有个约定: 他们由所关联的字符集的名字打头,他们通常包含一个
语言名, 并以 _ci (case insensitive大小写不敏感),
或者 _cs (case sensitive大小写敏感), 或者 _bin (binary二进制).
User Comments
Note that a MySQL database is only one part of a system for storing data that includes extended characters, and your client tools and languages will also need to support the characters you want in your application. For instance, the MySQL comand line client [1] does not allow direct entry of latin-1 characters, let alone Unicode. Furthermore, it doesn't automatically detect the language-environment, so there are some cases where properly entered characters get displayed as soething else; for instance, in US Win95, the command-prompt boxes use CP437, and a lot of printers use that same character set by default.
The MySQL Control Center under MS Windows and Microsoft Access are pretty good about representing international characters faithfully. On RedHat Linux 9, the Control Center displays each non-ASCII character as two boxes (presuably because it internally converts to UTF8, but the X server expects latin1).
JDBC [2] does a good job of handling international characters, because Java uses Unicode internally and the MySQL driver picks the server character set automatically.
Perl supports Unicode internally, thus DBI and CGI [3] would seem like a good set of tools to get at multilingual data. However, DBD::MySQL doesn't actually do any conversion, so in perl 5.6 you might need to do something like
($name) = $dbh->selectrow_array( ... );
$name = pack("U*",unpack("C*",$name));
print "$name\n";
to get the characters in the proper format, or you could add an explicit
$dbh->do("SET character_set_results="utf8"');
before doing anything else.
Note also that CGI.pm (most recent version) sets the output encoding to iso-8859-1 by default, so a CGI script will need to convert back from UTF on output. Unicode::Lite might prove useful in some cases. Perl 5.8 [4] has a slightly different model for handling Unicode that might (untested) require a line similar to the following when the server has been set to respond with Unicode:
use Encode 'decode_utf8';
$name = decode_utf8($name);
References:
1. http://bugs.mysql.com/bug.php?id=7491
2. http://java.sun.com/j2se/1.4.2/docs/api/java/sql/package-summary.html
3. http://search.cpan.org/~lds/CGI.pm-3.05/CGI.pm
4. http://www.perldoc.com/perl5.8.4/pod/perluniintro.html
Add your own comment.