Thursday, May 2, 2024

Unicode Technical Committee (UTC) Updates from Meeting #179

by Peter Constable, UTC Chair

The Unicode Technical Committee (UTC) met last week (April 23 to 25) in San Jose, California. Thanks to Unicode member company Adobe for hosting. Here are some highlights from the large number of items that were covered.

Preparing Unicode 16.0 Beta

An important objective was to cover all technical decisions that would be needed for the Unicode 16.0 Beta preview. The Beta will be available for public review and comment on May 21, 2024, and will include all charts, data and annexes for The Unicode Standard as well as other synchronized standards, including UTS 10, Unicode Collation Algorithm, and UTS 51, Unicode Emoji. Also, for the first time, the Beta release will include a complete draft of the core text of the standard.

The character repertoire for Unicode 16.0 was slightly adjusted, with the removal of two characters: U+0CDC KANNADA ARCHAIC SHRII and U+0C5C TELUGU ARCHAIC SHRII. These characters were first approved in January 2022 (UTC #170) and assigned for addition in Unicode 16.0 in April 2023 (UTC #175). However, in the ISO process for Amendment 2 of ISO/IEC 10646:2022 (which is to be synchronized with Unicode 16.0), the India national body requested more time for review by experts in India. To avoid a risk of Unicode 16.0 and Amendment 2 of 10646 not being in sync, UTC decided to delay these two characters for a later version.

Various character property (UCD) and algorithm changes were made based on issues reported during the Alpha review or found while the UTC Properties and Algorithms Working Group prepared data files for 16.0. Two notable areas for changes are grapheme cluster segmentation (UAX #29) and line breaking (UAX #14):
  • For grapheme clusters, some changes will be made to extended grapheme cluster segmentation for improved handling of orthographic syllables in Indic scripts.
  • For line breaking, several changes will be made to data and rules to fix various edge cases, and to incorporate behaviour for hyphens that has already been implemented in CLDR and ICU for several years.
Also related to properties, the organization of the ScriptExtensions.txt file will be changing. Previously, lines of data were grouped by characters that had the same script extension property values. Going forward, lines will be ordered by code point. (This is only a change in the order the data is listed; the parsing of lines is unchanged.) This will make it much easier to compare changes in property values between different Unicode versions.

In relation to emoji, the set of new emoji for version 16.0 is unchanged. During the Beta review, the draft update for UTS #51, Unicode Emoji, will include some proposed revisions related to recommendations for display of emoji family combinations. These revisions have not yet been reviewed and approved by UTC, so will require careful review and will be subject to confirmation or change at the next UTC meeting, after the Beta review period is over.

UTC action item backlog

UTC has had a growing backlog of open action items, some over ten years old. For this meeting, the various UTC working groups triaged their action items that were five or more years old, and outcomes were discussed by the UTC. Some action items were completed; some were closed as no longer relevant. Many that required more research were closed as UTC action items and replaced by issues in the relevant working group’s GitHub repo. Note that tracking them in this other way doesn’t necessarily mean they will get higher priority. However, since the working groups are using GitHub issues to organize their regular work, this should bring more attention to these issues. UTC will repeat this process at UTC #181, six months from now.

As a side effect of this review of old action items, a document was submitted to UTC (L2/24-123) proposing that UTC transition from the way it has handled action items in the past to tracking issues in a public GitHub repo to allow contributions from a broader set of volunteers. That document identifies some problems and limitations of the existing processes, and suggests that a new process could provide improvements. UTC spent some time discussing this document. It was noted that the idea was valuable, though such a change in processes would not be a small change and would involve some not-so-obvious challenges. It would also be something that affects the Unicode Consortium as a whole, not just UTC. For that reason, this proposal will need to be considered as part of a broader discussion of Consortium processes, resources and infrastructure.

New investigation: automatic space handing at inter-script boundaries

East Asian text often combines different scripts, and a common typographic practice is to insert space between script runs. UTC briefly discussed a new document, L2/24-057, which proposes development of an algorithm for automatic spacing between script runs. The Properties and Algorithms Working Group has assembled experts to discuss this topic. Interested experts are invited to participate in discussion via issues (with "auto-spacing" label) in the public unicodetools repo in GitHub.


Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
πŸ•‰️πŸ’—πŸŽ️🐨πŸ”₯πŸš€ηˆ±₿♜πŸ€

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock

SILICON Joins as Supporting Member of the Unicode Consortium

[image]The Unicode Consortium is pleased to announce that SILICON has joined as a Supporting Member.

The Stanford Initiative on Language Inclusion & Conservation in Old & New Media (SILICON) is a humanities-led tech initiative at Stanford University aiming to promote and sustain Digitally Disadvantaged Languages and, more broadly, address digital inequalities. Bridging gaps between Engineering, the Humanities, Computer Science, and the Social Sciences, the initiative seeks to help build tomorrow’s digital tools: improved OCR algorithms and AI generative text models; more globally inclusive text corpora, interfaces, keyboards, and digital fonts.

SILICON is interested in accelerating the timeline for digitally disadvantaged languages to be fully usable by their communities, by facilitating ongoing conversation between people involved in Unicode’s encoding work, designers of the fonts and keyboards, script and language communities, and technical experts, linguists, and technologists. We will also be working towards usable OCR for newly-encoded languages, with an eye towards developing corpora for LLM training.

“In the 21st century, the intertwining fate of language death and digital exclusion underscores a critical challenge: the marginalization and potential extinction of diverse linguistic heritage. With over 98% of the world’s ~7000 languages categorized as ‘Digitally Disadvantaged Languages’ by the Unicode Consortium, the urgency to bridge this digital divide is unmistakable. SILICON is delighted to support the pivotal role played by Unicode, long at the forefront of advancing the cause of Digitally Disadvantaged Languages globally.” - Tom Mullaney, Professor of History at Stanford University and Co-Director of SILICON 

“We are excited to welcome SILICON as a Supporting member of the Unicode Consortium. By integrating SILICON’s interdisciplinary expertise, we look forward to working together to advance digital inclusiveness.” - Toral Cowieson, CEO of Unicode

Supporting members of the Consortium have a half vote as well as representation on up to two technical committees. A list of Consortium members can be found here.


Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
πŸ•‰️πŸ’—πŸŽ️🐨πŸ”₯πŸš€ηˆ±₿♜πŸ€

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock

Thursday, April 18, 2024

Unicode CLDR v45 released

[image] The Unicode CLDR v45 is now available and has been integrated into version 75 of ICU. The CLDR v45 release page has information on accessing the data, reviewing charts of the changes, and — importantly — Migration issues.

CLDR provides key building blocks for software to support the world's languages (dates, times, numbers, sort-order, etc.) For example, all major browsers and all modern mobile phones use CLDR for language support. (See Who uses CLDR?)

CLDR 45 did not have a Survey Tool submission phase, and focused on tooling and just a few functional areas:

MessageFormat 2.0 Tech Preview

Software needs to construct messages that incorporate various pieces of information. The complexities of the world's languages make this challenging. The goal for MessageFormat 2.0 is to allow developers and translators to create natural-sounding, grammatically-correct, user interfaces that can appear in any language and support the needs of various cultures.

The new MessageFormat defines the data model, syntax, processing, and conformance requirements for the next generation of dynamic messages. It is intended for adoption by programming languages, software libraries, and software localization tooling. It enables the integration of internationalization APIs (such as date or number formats), and grammatical matching (such as plurals or genders). It is extensible, allowing software developers to create formatting or message selection logic that add on to the core capabilities. Its data model provides the means of representing existing syntaxes, thus enabling gradual adoption by users of older formatting systems.
See also:

Keyboard 3.0 stable version

Keyboard support for digitally disadvantaged languages (DDLs) is often lacking or inconsistent between platforms. The updated LDML Keyboard 3.0 format specifies an interchange format for keyboard data. This will allow keyboard authors to create a single mapping file for their language, which implementations can use to provide that language’s keyboard mapping on their own platform. This format allows both physical and virtual (that is, on-screen or touch) keyboard layouts for a language to be defined in a single file.

See also:

Tooling changes

Many tooling changes are difficult to accommodate in a data-submission release, including performance work and UI improvements. The changes in v45 provide faster turn-around for linguists and higher data quality. They are targeted at the v46 submission period, starting in May, 2024.


Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
πŸ•‰️πŸ’—πŸŽ️🐨πŸ”₯πŸš€ηˆ±₿♜πŸ€

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock

Wednesday, April 17, 2024

ICU 75 Released

ICU LogoUnicode® ICU 75 has just been released. ICU is the premier library for software internationalization, used by a wide array of companies and organizations to support the world's languages, implementing both the latest version of the Unicode Standard and of the Unicode locale data (CLDR). ICU 75 updates to CLDR 45 (beta blog) locale data with new locales and various additions and corrections. C++ code now requires C++17 (C code now requires C11) and is being made more robust.

The CLDR MessageFormat 2.0 specification is now in technology preview, together with a corresponding update of the ICU4J (Java) tech preview and a new ICU4C (C++) tech preview.

For details, please see https://icu.unicode.org/download/75.

Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
πŸ•‰️πŸ’—πŸŽ️🐨πŸ”₯πŸš€ηˆ±₿♜πŸ€

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock

Friday, April 5, 2024

Unicode CLDR v45 Beta available for specification review

[image] The Unicode CLDR v45 Beta is now available for specification review and integration testing. The release is planned for April 17th, but any feedback on the specification needs to be submitted well in advance of that date. The specification is available at Draft LDML Modifications. The biggest change is the new Message Formats and Keyboards section; see also the Migration section.

The beta has already been integrated into the development version of ICU. We would especially appreciate feedback from ICU users and non-ICU consumers of CLDR data, and on Migration issues.

Feedback can be filed at CLDR Tickets.

CLDR provides key building blocks for software to support the world's languages (dates, times, numbers, sort-order, etc.) For example, all major browsers and all modern mobile phones use CLDR for language support. (See Who uses CLDR?)

Via the online Survey Tool, contributors supply data for their languages — data that is widely used to support much of the world’s software. This data is also a factor in determining which languages are supported on mobile phones and computer operating systems.

CLDR 45 did not have a Survey Tool submission phase, and focused on tooling and just a few functional areas:

MessageFormat 2.0 Tech Preview

Software needs to construct messages that incorporate various pieces of information. The complexities of the world's languages make this challenging. The goal for MessageFormat 2.0 is to allow developers and translators to create natural-sounding, grammatically-correct, user interfaces that can appear in any language and support the needs of various cultures.

The new MessageFormat defines the data model, syntax, processing, and conformance requirements for the next generation of dynamic messages. It is intended for adoption by programming languages, software libraries, and software localization tooling. It enables the integration of internationalization APIs (such as date or number formats), and grammatical matching (such as plurals or genders). It is extensible, allowing software developers to create formatting or message selection logic that add on to the core capabilities. Its data model provides the means of representing existing syntaxes, thus enabling gradual adoption by users of older formatting systems.

See also:

Keyboard 3.0 stable version

Keyboard support for digitally disadvantaged languages (DDLs) is often lacking or inconsistent between platforms. The updated LDML Keyboard 3.0 format specifies an interchange format for keyboard data. This will allow keyboard authors to create a single mapping file for their language, which implementations can use to provide that language’s keyboard mapping on their own platform. This format allows both physical and virtual (that is, on-screen or touch) keyboard layouts for a language to be defined in a single file.

See also:

Tooling changes

Many tooling changes are difficult to accommodate in a data-submission release, including performance work and UI improvements. The changes in v45 provide faster turn-around for linguists and higher data quality. They are targeted at the v46 submission period, starting in May, 2024.

For more information

See the draft CLDR v45 release page, which has information on accessing the data, reviewing charts of the changes, and — importantly — Migration issues.


Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
πŸ•‰️πŸ’—πŸŽ️🐨πŸ”₯πŸš€ηˆ±₿♜πŸ€

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock

Thursday, March 28, 2024

Wikimedia Foundation Joins as an Associate Member of the Unicode Consortium

Wikimedia Logo ImageThe Unicode Consortium is pleased to announce that Wikimedia Foundation has joined as its latest organizational member.

The Wikimedia Foundation hosts the most multilingual top-10 website on the internet, Wikipedia, which is built collaboratively by people around the world – and in more than 300 languages.

“Our projects have been supported by the resources, technical projects, and forums supported by the Unicode Consortium. Through initiatives like the CLDR project, for example, Language and Internationalization engineers at the Wikimedia Foundation and volunteers on our projects have contributed to the open-source knowledge infrastructure that serves a shared mission with the Consortium of expanding language representation online. We’re looking forward to future collaborations with the Unicode Consortium as the Wikimedia movement continues to prioritize global access to knowledge in the fast-changing digital space.”
— Selena Deckelmann, Chief Product and Technology Officer

“We’re excited to welcome the Wikimedia Foundation as our latest organizational member, enhancing our shared mission to promote global language representation. We greatly appreciate the contributions from the Wikimedia team over the years and look forward to accelerating our collaboration.”
— Toral Cowieson, CEO

Associate members of the Consortium have observation status on one technical committee. The list of Consortium members can be found here.


Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
πŸ•‰️πŸ’—πŸŽ️🐨πŸ”₯πŸš€ηˆ±₿♜πŸ€

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock

Volunteer Spotlight — Roozbeh Pournader

Roozbeh Pournader ImageThe rewards have been much greater than I could have ever imagined.”

Roozbeh Pournader has been a volunteer at Unicode since 1999. His first exposure to Unicode was in the late 90’s when he and several colleagues were working on a project called FarsiWeb at Sharif University trying to figure out how best to properly support Persian on the internet. They encountered several “home-made” solutions.

They also discovered The Unicode Standard. Quickly realizing Unicode was the right path, Roozbeh reached out and found a welcoming community of supportive and like-minded technologists. Roozbeh recalls engaging with Mark Davis, Ken Whistler, and Rick McGowan who “were there to help anybody who wanted to use or contribute to Unicode.” Roozbeh became an active contributor when recognizing areas within Unicode that he could help improve.

Roozbeh arrived in the United States in 2008, having represented an Iranian standards organization to Unicode before that. Since then, he has also been a technical representative for several US organizations to the Consortium, a Technical Director, and is currently a Vice Chair of the Script Encoding Working Group. He remembers how excited he was to be involved in person with Unicode when he first came to the US, sharing that he walked off the plane on a Wednesday and on Friday he was attending his first UTC meeting.

Asked what he likes most about working with Unicode, he admits that it is hard to say. He knows that what he has done and is doing has an enormous impact on communities around the world. Roozbeh came from an underserved community and is pleased that he can use his skills to support minority languages, historical languages, and writing systems for other underserved communities. He is especially proud of his work in the area of Arabic script and says everyday is a joy to work with people who are so dedicated to the Unicode cause.

On a personal note, Roozbeh enjoys eating anything his wife prepares, and says that as scientists (his wife is a biologist), they are fascinated by scientific methods of cooking and experimenting in the kitchen. His favorite dish is Kabab Tabe’i, a mixture of ground meat, onions and tomatoes. When asked about any hobbies outside of work, Roozbeh says that Unicode is really not his work anymore, it is his life’s work and where he spends most of his free time.

His parting words were to encourage anyone who is interested in Unicode to get involved. There are big and small ways to contribute, and he recommends just reaching out as the Unicode community is so welcoming.

Roozbeh says he has made lifetime friends of the highest quality and for that he is forever grateful.


Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
πŸ•‰️πŸ’—πŸŽ️🐨πŸ”₯πŸš€ηˆ±₿♜πŸ€

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock

Tuesday, March 26, 2024

Cathy Wissink joins Unicode Board, other Board Updates

Cathy Wissink ImageUnicode Consortium is excited to welcome Cathy Wissink to the Board effective immediately. Cathy is a 30-year veteran of the global tech industry. Most of her career was spent at Microsoft, with her early tenure devoted to internationalization support for Windows. She then spent 15 years working focused on global government and regulatory affairs. In her most recent role at Microsoft, Cathy managed a part of Microsoft’s standards portfolio supporting regulatory needs in forums like ISO/IEC JTC1, CEN/CENELEC, NIST, and also led Microsoft’s product certification process for China.

Cathy led Microsoft’s participation in the Unicode Technical Committee from 2000-2005, and served as UTC vice-chair and INCITS/L2 chair from 2002-2005.

The Board also elected Tim Brandall and Salvo Giammarresi as Chair and Vice Chair of the Finance and Funding Committee. Tim was also named Treasurer of the Board.

Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
πŸ•‰️πŸ’—πŸŽ️🐨πŸ”₯πŸš€ηˆ±₿♜πŸ€

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock