A large portion of the people studying Cantonese are starting from a place of loving the language. Figuring out the technical pieces of the story is a secondary task and has not received consistent focused effort. As a community of developers we need to identify how to make the existing data better available to researchers, users, teachers, and learners of Cantonese.

This effort on datasets includes:

  • Reviewing, cleaning, and (with permission), republishing existing data sets.
  • Converting existing datasets into interoperable and maintainable formats.
  • Creating new datasets.

Beyond merely capturing the datasets, we must also make that data beneficial to researchers, users, teachers, and learners of Cantonese. This requires both improving existing tooling and identifying missing tools and building them.

Table of Contents


Datasets

There are a lot of datasets available in poorly interoperable formats that could benefit from consolidation & maintenance.

Input Method Editors

Input Method Editors that work across all platforms should be created. These input method editors should use the same training data for autocompletion across each platform.

Target platforms:

  • iOS
  • macOS
  • Windows
  • Android
  • ChromeOS
  • Linux

Input method types:

  • Drawing
  • Stroke
  • Cangjie
  • Phonetic
  • Others?

To make for an optimum experience for multi-platform users, all training data should be centrally synced to enable consistent experiences across platforms.

Romanization and Syllabarization Tools

Jyutping has become the standard format for Cantonese romanization to form a syllabary for Cantonese. There are, however, many other existing romanizations of Cantonese. Tools should be built for full conversion between:

  • Jyutping (Prefix numeric, Postfix numeric, Diacritics)
  • Yale (Prefix numeric, Postfix numeric, Diacritics)
  • IPA
  • SL Wong
  • Sidney Lau
  • Penkyamp
  • Cantonese Pinyin
  • Canton Romanization

Character Sets

Historical Chinese language input has been reencoded many different times. In order to not lose previously created documents to time, point-in-time to point-in-time character set conversion should be offered. This should be configurable based upon known-shipped mapping tables with possible errors as well as standards-based idealistic methodolgy.

  • HZ
  • Big5
  • ISO 10646
  • HKSCS
  • Unicode
  • Others

Learning Tools

A language must have learning tools in order to help the next generation of users learn it. Cantonese has a limited set of material available for it. This project should build:

  • Flash Card Templates
    • Preconstructed flash card sets for known exam targets (KS1, KS2).
  • Dynamic Question Generators for time, money, numbers, and other calculateable patterns.
  • Character Writing Practice Tools, stroke-by-stroke
    • Songti
    • Kaiti
    • Heiti

Publishing Tools

There are very few books published in Cantonese. This project should aim to make it as easy as possible to publish a book either in Cantonese, or to teach Cantonese. These tools should allow for streamlined publishing as an Ebook, PDF, print-on-demand, or even full publishing runs. This should also likely include tools for TeX and publishing of research articles.

Captioning and Coding Tools

A large portion of the material that uses the Cantonese language comes from spoken sources. To make that more-accessible for machine processing it will need to be serialized from audio sources into written data sets. These data sets will need to be coded in order to provide additional required data at post-processing time.

Machine Translation

Current machine translation of Cantonese is not very precise. This could be made better.

Voice Recognition

Voice recognition for Cantonese requires massive amounts of data. An effort should be undertaken, possibly in partnership with Mozilla Common Voice, in order to capture as much data as possible.

Optical Character Recognition

Support for handwriting, Songti, Kaiti, and Heiti.

Word Segmentation

Identifying word boundaries in written Cantonese is a difficult task. This should be implemented in a way that is API-compatible with ICU (or tools that build on top of it) and uses more-complete data sets and/or better models.

This can be used as an input to improved line breaking tools.

Grammar and Spell Checking

After identifying word boundaries, in combination with a list of confusables, check for mistyped characters based upon sound, adjacency, and other signals. This should also work for romanizations of Cantonese.

Dictionaries

The existing dictionaries are all flat lists. They do not implement strong relationship models that would act as an enabler for additional research and learning models built on top of them.

These dictionaries should support Sense:Cantonese lookup across a variety of languages, prioritizing Cantonese, Mandarin, and English.

Sign Language

Provide tools for learning sign language as a user, and as a translator.

Character Compendium

Character information regarding history and construction should be consolidated in order to make for simpler learning. Examples of common mnemonics for particular characters should be included.

Fonts

There are thousands of characters. There exists programmatic tools for generating characters such as the Wenlin Character Description Language which can be adopted to fill in gaps prior to further standardization of characters.

Further, identifying, creating, and distributing high quality open source fonts should be a priority. A high-quality open source Songti, Heiti, and Kaiti font should all be released.

A programmatic font may be extremely valuable to speed up adoption of new characters.

Locale Handling for Translation

There should be a standard fallback procedure to ensure that more Chinese-language materials are better-localized for different populations, whether they are diaspora, Taiwan, Hong Kong, Macau, Singapore, or Mainland.

This should be clever about BCP-47 fallback ordering.

Text-To-Speech and Voices

Given a passage in written Cantonese, turn it into spoken Cantonese. This should also support the translation from written formats to spoken formats.

The voices should be natural, with speakers from different locales and of different genders, as well as full-robotic for assistive technology users.

Content

Above all, the more content that exists in Cantonese, the better the situation is for all Cantonese users.