What is ICU? Our guide for ICU message formatting and syntax.

Kinga Pomykała
Kinga Pomykała
Last updated: July 04, 202418 min read
What is ICU? Our guide for ICU message formatting and syntax.

ICU message format is the most widely used format for Unicode strings. It is a lightweight, extensible character encoding standard that aims to be easy to implement and use correctly. ICU message format is a standard used in software localization. It improves the i18n process and helps to keep translations accurate and maintainable in multilingual software.

In this blog post, we will check the meaning and importance of ICU and different examples of ICU usage. We will go through ICU formatting of different elements, like numbers, pluralization, date and time, or explain locale. Knowing how ICU formatting works will help you understand how to use it and how it can improve your software localization workflow.

Table of contents

What is ICU?

International Components for Unicode (ICU) is an open source, cross-platform set of libraries that provides Unicode and Globalization support for software applications. It is widely used in many applications to provide support for the internationalization of text, numbers, dates, times, currencies, and other locale-sensitive data.

ICU was born out of a need to provide the same level of support for localization and internationalization for other programming languages, initially for Java, then C++ and C. Created in 1999 by company Taligent (incorporated then by IBM as Unicode group), it was built as an open-source project, IBM Classes for Unicode, and then renamed to International Components For Unicode (ICU). The purpose of creating ICU was to address the challenges associated with developing software applications for different languages, cultures, and regions.

It has since become an important tool for developers looking to create localized versions of their applications. ICU has been used by many major companies such as Google, Microsoft, Amazon, Apple, Oracle, Adobe, and many others.

How does it work?

ICU provides an extensive open-source set of libraries, APIs and tools to help developers convert, process, search and compare strings in different languages.

It is a Unicode-based, cross-platform encoding form that can handle all the languages and scripts of the world. The ICU project goals are to provide a library for parsing and generating Unicode strings, and to provide data for use by other software in dealing with Unicode text.

Unicode is a universal character encoding system that allows computers to display and store text in multiple languages. It is the foundation of most modern computing systems, making it possible to use different languages on the same computer. By combining the power of Unicode with the convenience of ICU, developers can create applications that are capable of working with virtually any language or script imaginable.

You can download ICU from GitHub repository and documentation.

What issues does it cover?

ICU helps developers to easily handle complex text processing tasks such as sorting, searching and formatting in multiple languages. It also provides APIs to convert between different character encodings and provides support for various writing systems, including right-to-left scripts like Arabic and Hebrew. With ICU, developers can quickly develop applications that can work with different languages without having to worry about the language-specific issues.

ICU covers all text and laguage related issues like:

  • date and time formatting
  • number formatting
  • currency formatting
  • pluralization
  • gender-specific forms
  • custom variables
  • sorting and searching text

What is Locale?

Locale is a set of parameters that defines the user's language, region, and cultural conventions. It is used to format and display text, dates, numbers, and other data in a way that is appropriate for a specific region or culture. For example, the date format and currency symbol used in the United States is different from that used in Spain.

A Locale object in ICU contains information such as the language, country, and variant codes, as well as data for formatting numbers, dates, and other data. It also provides access to ICU's services for text and data internationalization and localization, such as message formatting, text boundary analysis, and collation (string sorting).

For detailed resources check ICU documentation for Locale which lists all standards and explains Locale concept and usage.

ICU Messages Formatting

Messages are texts that are visible for users in the final product. Often they consist of not only fixed texts but also contain variable elements, like names, dates, or numbers. Such a variable can be added in text in different places depending on the language, but they do not divide the translation content but create a translatable string with a variable element inside. This way the translation process is consistent, and it is easier for translators to understand the full content of the translated text.

ICU MessageFormat class offers a pattern for messages with variable elements (called arguments) that are placed in curly braces { }. Each variable can include special formatting details that customize the translated argument or if none is specified, then the default format is used.

ICU variables

In ICU, variables are used to store and manipulate Unicode data, and they can be customized with MessageFormat class, with the goal to streamline and facilitate the translation process.

If we need to translate a simple message without any variables, the translation process is simple. We translate strings without any special syntax. But, we can add specific information to the translation (that changes in different conditions), by inserting it using a variable.

However, as grammar and syntax is different in different languages, regular translation may not be enough. Let's take an example of text with variable element that a translator needs to translate. Text which looks simple at first sight will be problematic as we do not have any place to specify translations for plurals for the number of rooms.

// Message
"Hi, thank you for booking {roomsNumber} rooms with us!"

// Result
"Hi, thank you for booking 1 rooms with us!"
// or
"Hi, thank you for booking 3 rooms with us!"

This translation is correct only for one case, it won't be if someone books just 1 room. It can be even more complicated in other languages.

To make all texts and variable insertions display correctly for different languages, we use ICU message formatting. It allows you to specify the format for dates, time, numbers, and others to comply with the language rules and user's country.

Below you can see an example of a message with a variable that specifies the ICU format for plurals and thus covers all translation options of our text.

// Message
"Hi, thank you for booking {bookedRooms, plural, one {room} other {rooms}} with us!"

// Result
"Hi, thank you for booking 1 room with us!"
// or
"Hi, thank you for booking 3 rooms with us!"
// etc.

Plurals

Pluralization rules vary depending on the language, and its grammar. The rules are different in English, Polish, or Arabic - for different locales. In example, English has "one" and "other" while other language have additional arguments to cover all number cases. You can find out more about each language rule in Unicode CLDR specs.

ICU message format utilizes a plural argument to choose sub-messages based on a numerical value, along with the language's plural rules. This helps in efficiently selecting the right messages for a specified language.

other is an argument always required in the ICU formatting for plurals.

The syntax for ICU messages is {variable, plural, forms} where forms is one or more plural forms for the phrase. It will differ depending on the language. Let's take an example of content in version in English and Polish. As those two languages have different pluralization rules, the forms are different too.

// Message EN
"You have booked {bookedRooms, plural, "
                                    "one {one room}"
                                    "other {# rooms}} "
                                    "for {stayDate, date, medium}"
// Message PL
"Zarezerwowałeś {bookedRooms, plural, "
                                    "one {jeden pokój}"
                                    "few {# pokoje}"
                                    "other {# pokoi}}"
                                    "for {stayDate, date, medium}"

This formatting allows translators to cover all plural forms for a given language and thus provide more accurate translations.

Select

The same as for plurals, instead of repeating messages for different variants, we have an option to select between different argument forms. In ICU, select argument lets us select between multiple fixed options in our variable. This way we can show all variable options within one message and cover different variants in a single translation.

The system for ICU messages is {variable, select, forms} and in forms we put all options for our argument.

// Message
"Thank you for booking {bookedRoom, select, "
                                        "dorm {a bed in dorm room}"
                                        "private {a private room}} in our hostel"
// Result
"Thank you for booking a private room in our hostel"
// and
"Thank you for booking a bed in dorm room in our hostel"

Select provides translators with full content of the translated message, which improves the translation quality and accuracy. Instead of creating separate messages for each variable option, they are all in one location, providing maximum amount of information.

Complex argument types

We can build custom structures using ICU message format with variable and select arguments. In such cases, we nest to make sure that the structure is readable and can be understood without bigger issues. Writing full sentences will help in keeping the text clear and consistent.

For complex arguments which use both plurals and select, ICU docs recommend using select first and then nest plural argument into it.

"{gender, select, "
  "female {"
    "{num_guests, plural, offset:1 "
      "=0 {{host} doesn't invite guests to her new hotel opening.}"
      "=1 {{host} invites {guest} to her new hotel opening.}"
      "=2 {{host} invites {guest} and one other guest to her new hotel opening.}"
      "other {{host} invites {guest} and # other guests to her new hotel opening.}}}"
  "male {"
    "{num_guests, plural, offset:1 "
      "=0 {{host} doesn't invite guests to his new hotel opening.}"
      "=1 {{host} invites {guest} to his new hotel opening.}"
      "=2 {{host} invites {guest} and one other guest to his new hotel opening.}"
      "other {{host} invites {guest} and # other guests to his new hotel opening.}}}"
  "other {"
    "{num_guests, plural, offset:1 "
      "=0 {{host} does not give a party.}"
      "=1 {{host} invites {guest} to their new hotel opening.}"
      "=2 {{host} invites {guest} and one other guest to their new hotel opening.}"
      "other {{host} invites {guest} and # other guests to their new hotel opening.}}}}"

In the example above, we have such a sentence with different variables: {host} invites {guest} to {her/his/their} new hotel opening.. Host can be different gender, so first we need to cover versions for all forms (her/his/their). We use select here because we choose from a predefined list of options.

Next, we need to cover options where different number of guests is invited. We do that using plural and creating a specific message for each form. Such a structure will create a clear and comprehensible view of the text in more complicated messages.

Number formatting

The way numbers are formatted can vary depending on the language and country. It applies to different aspects of displaying numerical values, like:

  • decimal formatting (thousand separator, rounding)
  • currencies
  • measurement units
  • percentages (placement of the % symbol)
  • scientific notations
  • compact notation

In example, in the USA the character used as a thousand separator is a comma (,), in Poland it is a space ( ) and in Spain, it can be both a space or a dot (.). Number 12999.99, is formatted like 12,999.99 is USA, while in Poland or Spain we would write it as 12 999,99.

12999,99        // Decimal number
12,999.99       // US formatting
12 999,99       // PL formatting

Recommended ICU formatting for numbers is NumberFormat class, which helps in formatting numbers, currencies, and units for any locale.

The syntax for number formatting is {variable, number, format} where format covers all number formats.

ICU default NumberFormat class formats are integer (a number), currency and percent.

12999.99        // Decimal number
$12999.99       // US currency
12 999,99zł     // Polish currency
12999%          // Percent

For more examples, details, and tips regarding number formatting, check Jakub's blog post about number formatting in JavaScript.

Currencies

Currency requires a specific format as it consists of two elements, a number and a currency symbol or name. By default, the currency is set from the locale data, but we can specify the currency formatting using different methods under NumberFormat class.

Here are different ways of formatting currencies using NumberFormat according to ICU docs:

  • ICU4C (C++) NumberFormat.setCurrency() which takes a Unicode string with the 3-letter code.
  • ICU4C (C API) unum_setTextAttribute() with UNUM_CURRENCY_CODE selector.
  • ICU4J NumberFormat.setCurrency() takes an ICU Currency object which encapsulates the 3-letter code.
  • JDK's NumberFormat.setCurrency() takes a JDK Currency object which encapsulates the 3-letter code.

Download the updated list of 3-letter ISO 4217 codes from six-group.com or click here.

Number skeletons

In ICU, a number skeleton is a string that defines a pattern for formatting numbers. A skeleton defines the overall structure of the number format, such as the number of decimal places, the presence of a currency symbol, and other such characteristics. Skeletons are used with the Unicode CLDR (Common Locale Data Repository) to provide a flexible and powerful way to format numbers for a wide variety of locales and cultures.

CLDR, Unicode Common Locale Data Repository, Unicode CLDR (Common Locale Data Repository) is a database of locale-specific data that provides information such as date, time, and number formats, as well as translations for names of languages, countries, and time zones, and other locale-specific data. It is used to enable software to be adapted to meet the linguistic and cultural requirements of users in different regions of the world.

A skeleton is a string made up of a combination of special characters that represent different elements of a number format, such as the number of decimal places, the presence of a currency symbol, and so on. For example, the skeleton "###,##0.00" would represent a number format with a thousands separator, two decimal places, and a trailing currency symbol.

Date and Time Formatting

Date and time are formatted differently in different countries. They can use different separators, different number of days in the formatting and their order.

In example, in the majority of European countries dates are presented with syntax like DD/MM/YYYY while in the US or Canada widely used date format is MM/DD/YYYY. September 2nd of 2023 can be written as 02/09/2023 or 09/02/2023 which can be confusing for users, if the date format is not adjusted to the format they are used to. That is why it is important to localize your software properly when it is used by different users across the world.

Check ICU date symbols table to see all date patterns and syntax.

In ICU, date and time formatting is used to format date and time values according to the conventions of a specific locale. For that, we use DateFormat class, which is an abstract base class for date and time formatting.

The syntax for date formatting is {variable, date, format}. Instead of variable we use the actual value that will appear in the final text. It is the same with time formatting: {variable, time, format}. In place of format, we put selected date or time ICU format.

There are also four default options for dates format:

  • short, e.g., 2/9/23
  • medium, e.g., Sept. 2nd, 2023
  • long, e.g., September 2nd, 2023
  • full, e.g., Saturday, September 2nd, 2023 AD

ICU default formats for time:

  • short: 9:30 AM
  • long: 9:30:28 AM
  • full: 9:30:28 AM CET

Here you can see an example of date and time formatting:

// Message
"Your room {roomNumber} is ready for you check-in on {checkinDate, date, medium}."

// Result
"Your room 5 is ready for you check-in on March. 2nd, 2023."

ICU Message Syntax Tester

To test ICU message syntax, you can use the ICU Message Syntax Tester. It is a tool that allows you to test ICU message syntax and see how it will be formatted in different languages. You can enter a message with ICU syntax and see how it will be formatted for different locales. This can help you ensure that your ICU messages are formatted correctly and will display correctly in different languages.

Try our ICU Message Syntax Tester

ICU Message Syntax Tester screenshot

Conclusion

To sum up, ICU is a powerful library that provides support for the world's languages, scripts, and locales. It provides a set of APIs and data files that enable applications to work with any language. ICU helps applications to display text correctly in different languages, formats date and time according to local custom, and sort text according to local conventions.

ICU is important in the localization process for several reasons:

  • Unicode Support: ICU provides robust support for the Unicode standard, which is the most widely used character encoding standard for representing text in a wide range of languages. This allows software developers to handle text in many languages and scripts consistently.
  • Extensive Data and APIs: ICU includes a large set of data and APIs for working with different locales and cultures. This data includes information such as date and time formats, number formats, and translations for names of languages, countries, and time zones. This data is provided in a way that is easy to access and use, which makes it easy for software developers to localize their software.
  • Platform-Independent: ICU is platform-independent and available for a wide range of platforms and languages, this makes it easy to use in any development environment and easy to integrate with other software.
  • Robust and Reliable: ICU is widely used and has a large user base, which means that it is well-tested and robust. Additionally, it's maintained and developed by the Unicode Consortium, which is an industry standards organization that specializes in character encoding and software internationalization.
  • Customizable: ICU allows for customization of the data to suit specific needs, this can be done by creating a custom locale data file and use it along with the CLDR data.

All of these factors make ICU a powerful tool for software developers who need to internationalize and localize their software. It provides a comprehensive set of features and data that makes it easy to handle text and data in many different languages and scripts in a consistent way, and it's available for a wide range of platforms and languages, which makes it easy to integrate with other software.

Working with i18n content

Software localization brings many challenges for all involved teams, from developers, to translators and managers. The main issue is usually localization quality.

Providing multilanguage content using ICU messages is a great way of managing i18n software translations. It provides maximum information for translators to cover all language varieties and forms. Thanks to that, translation quality can greatly improve and thus increase your customers' satisfaction of your software.

SimpleLocalize can help you in your software translation management. Our tools and integrations create a developer-friendly environment with a simple and intuitive integration process. It can automate your translation workflow and give one place for all team members for centralized localization management platform. Get started now for free and import your translation files to get started with translation management.

ICU official homepage or GitHub repository

ICU documentation

Unicode CLDR specs

Number formatting in JavaScript

Kinga Pomykała
Kinga Pomykała
Content creator of SimpleLocalize