Text encoding débacles

Fletcher

50-Minute Talk

You might think that text encoding is a problem that was solved by UTF-8. This is basically true for many developers, but PostgreSQL continues to support dozens of encodings and multi-encoding configurations. There are some rough and even dangerous edges, with implications even if you only use UTF-8. I want to present prototypes to address those with a practical model, and some other opportunities I have spotted along the way.

  • Overview of the PostgreSQL text encoding model, related OS concepts and motivations
  • The holes in that model, including shared catalogs and views, authentication, file systems and more
  • In which usage patterns do we get away with that? Or not?
  • A proposed model to nail down the encoding of everything, while allowing for reasonable usage patterns
  • Overview of closely related pg_wchar, holes and improvements
  • Opportunities to go faster
  • What would it take to support NUL in text?

Gold Sponsors

EDB

Microsoft

AWS

Huawei

Silver Sponsors

Percona

Fujitsu

HighGo

Duboce Labs, Inc.