Sign up with your email address to be the first to know about new products, VIP offers, blog features & more.
[mc4wp_form id="4890"]
Zapisz Zapisz

Character Encoding

Posted on 0 No tags 0

There’s a lot to learn about character encoding – the good news is that unless you really want to do so, you shouldn’t bother. If you’re looking for a recommendation, just use UTF-8. Its rapidly becoming the new ASCII because its reasonably efficient, supported by everything, and can comfortably handle transmitting every character a modern font might support.

Your goal should be to do absolutely everything using this encoding. That applies to source-code files you save on your personal machine that might have literal text in them, configuration properties that tell your database how to sort strings, you name it. Rather than a long explanation, this will be a quick reference to tell you how to set this in many of the places you’ll need to:

HTML

Web pages

Assuming that you’re correctly authoring your webpages (static or dynamic) in UTF8, you need to tell the world to expect that encoding. The best way to do that is to include an HTTP header such as:

Content-Type: text/html; charset=UTF-8

The first part of the content-type should reflect what you’re actually sending – which tells the receiver how the to use the file – the second type tells it how to read the file. You can also send the encoding within the file, for example in a meta tag, but that only works with XML or HTML files and will break the first time you find yourself needing to push something like a CSV, so get the headers right and don’t worry about it again.

Emails

Email servers include headers as well. The appropriate header to set will be

Content-Type: text/html; charset="UTF-8"

This will also allow you to properly send complex characters in parts of the email (such as a text version, or the subject line) that would be interpreted before an HTML meta tag was referenced.

Databases

MySQL

If you have access to your my.cnf file, set the following parameters:

[client]
default-character-set=utf8

[mysqld]
default-character-set=utf8
default-collation=utf8_unicode_ci
character-set-server=utf8
collation-server=utf8_unicode_ci

Otherwise, when you create your database, use the following syntax:

CREATE DATABASE db_name
 CHARACTER SET utf8
 DEFAULT CHARACTER SET utf8
 COLLATE utf8_unicode_ci
 DEFAULT COLLATE utf8_unicode_ci
 ;

Postgres

When you create your database use the following syntax:

CREATE DATABASE db_name
 WITH ENCODING 'UTF8'
 LC_COLLATE = 'en_US.UTF-8'
 LC_CTYPE = 'en_US.UTF-8'
 ;

Oracle

When you create your database, specify the AL32UTF8 character set:

CREATE DATABASE db_name
 CHARACTER SET AL32UTF8
 NATIONAL CHARACTER SET AL16UTF16
 ;

SQL Server

SQL Server natively uses UTF-16. If you’re using a Microsoft stack, you’re probably running in UTF-16 anyway, so you won’t have to worry about communicating with the database at least. If you’re not, make sure that you convert to-and-from in your DB wrapper, whichever one you choose.

Couchbase

The bad news is that Couchbase doesn’t support different collation or encoding settings. The good news is that it uses UTF-8 as its fixed default, so everything’s fine here.

MongoDB

MongoDB uses UTF-8 as its native representation too, as do most other modern platforms.

Operating Systems

Linux

Make sure that these variables are set in any environment (interactive or automatic):

export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
export LANGUAGE=en_US.UTF-8

Windows

Windows and its standard libraries inherently “think” in UTF-16. Make sure that every text editing application you use is manually set to expect (and write) UTF-8. Expect to change everything you use for development (IDEs, appservers, etc) and be suspicious if you can’t find a way to do so.

OSX

Go to terminal > preferences > advanced and make sure that UTF-8 is set as your character encoding. Also make sure that “Set locale environment variables on startup” is checked.

Also, many applications seem to follow the Safari default settings, so go to Safari > Advanced > Default Encoding and make sure that UTF-8 is set there as well.

bash

If you use bash, add the following to your ~/.inputrc file:

set meta-flag on
set input-meta on
set convert-meta off
set output-meta on

App Servers

Apache

Add the following to your httpd.conf file:

AddDefaultCharset utf-8

JBoss/Tomcat

Add this as an attribute to the connector element in your server.xml file:

URIEncoding="UTF-8"

PHP

Add this to your php.ini file:

default_charset = "utf-8"

.NET

Set the following in your Web.config file:

  
    
  

Ruby on Rails

Ruby 1.9 and above already defaults to UTF8. You’re good to go!

Node.js

By default, it appears that node.js and express.js communicate externally as UTF-8. JavaScript interpreters internally use UTF16, at least if they’re following the specification, so additional care should be taken when passing data to and from other sources such as your database.

Development Tools

Hudson

In Hudson’s /configure page, go to the Global Properties section and check the Environment Variables checkbox. Add a variable pair with the name JAVA_TOOL_OPTIONS and the value -Dfile.encoding=UTF-8.

Eclipse

In the configuration properties dialog, go to General > Workspace and set the Text file encoding to Other: UTF-8.

IntelliJ IDEA

In the settings dialog, go to Template Project Settings > File Encodings and set the IDE encoding to UTF-8.

signature