How dbt Labs demonstrates its inclusive value by correcting Taiwan in country list
ISO 3166-1 country list names considered harmful. use GENC names
TL;DR: List of country codes and names is one of the most fundamental datasets, and in the meantime super political. Taiwan is often incorrectly listed as "Taiwan, Province of China" due to ISO 3166-1. Kudos to dbt Labs for setting a stellar example in addressing this issue, as well as the team exhibiting their values of inclusiveness. If you work on international user-facing products, use the FIP 10-4 / GENC names to be more geopolitically neutral.
As much as I talk about dbt with friends and customers, I'd never thought my first substack post about dbt would be on this topic that has been long haunting for me, and many Taiwanese. And this made me love dbt and the company even more!
dbt, Don't We All Love It
dbt, the open source data transformation tool, is a modern wonder in many ways:
As enabling tech that even created a whole new data role: Analytics Engineer.
"Simple but not simplistic" approach of applying simple tech in great use. You might frown about templated SQL fragments as anti-pattern, but dbt managed to create an ecosystem through organizing those as reusable modules, allowing modern data modeling to benefit from software development practices.
Successful open source business model with a thriving community, and commercial offerings that the community loves.
This post is not about dbt’s metrics layer, python models, data contracts, or the data check tool we are building. It is about dbt Labs as a company and its inclusiveness on a controversial and frustrating topic.
Paying for dbt Cloud
This starts with the kind of support ticket that I often send, when I see the country lists that are simply based on ISO 3166-1. This time with dbt Cloud:
While we upgrade to paid team plan, the billing info country dropdown list Taiwan as "Taiwan, Provence of China", which is politically provoking.
I understand you are likely using the iso3166-1 names, but the general practice in us federal government and the industry is using the FIPS 10-4 (formerly GENC, endorsed by U.S. Board on Geographic Names) as a more politically neutral list of country codes.
Well it has been a while since I send these kind of messages, and I got the GENC / FIPS 10-4 ancestry wrong. Nevertheless within 30min I got the first response:
Thank you for reaching out to dbt Labs Support, and I appreciate you bringing this to our attention!
I will forward your feedback to our internal teams to review, and hopefully get this changed to make it more neutral.
dbt Labs works hard to create a diverse and inclusive environment here, and that includes for our users too.
Meanwhile Karen brought this up in #local-taipei with dbt Labs DevRel Amanda. Two days later it's changed! This is one of the best experience I had with such requests:
I just wanted to update you after working with the team internally to get this changed. The team has recently merged a PR that should show the country only as Taiwan. Can you please confirm that you can see this change on your end through the Billing page?
ISO 3166, Their Friends, and China
A most political dataset
Naming things, or having the authority to name things, is incredibly powerful. ISO 3166-1 is probably the most widely used standard list of countries, and also probably the most political dataset. The controversy is not only of Taiwan listed as Province of China, but also not including Kosovo.
One of the world’s most democratic places
To quote Noah Smith about Taiwan, who I met a few weeks ago when he visited:
Though independent for most of its history, it was colonized multiple times — by the Netherlands in the 1600s, China in 1683, and Japan in 1895. This forced it to deal with social frictions between the various waves of colonizers, immigrants, and settlers, and also gave it a yearning for self-determination and freedom from outside powers.
Though Taiwan was a dictatorship for decades after being invaded and settled by the defeated Nationalists of China in 1949, it began democratic reforms in the 1980s. In 1990, student movements protested for full democracy and direct elections, and they won. Today it’s one of the world’s most democratic places, with a Polity score of 10 (compared to the U.S.’ 8) and a Freedom House score of 93 (compared to the U.S.’ 86).
Pressure from China
Commercial entities under immense pressure from China, such as airlines, would list Taiwan as "Taiwan, China" or have to apologize to China. And most international events like the Olympics would use "Chinese Taipei" under China's influence, including the International Olympics in Informatics I attended 25 years ago, and I still remember the sadness.
The US Government adopted a somewhat neutral or strategic-ambiguity policy on Geopolitical Entities, Names, and Codes (GENC), way before this current 2022 geopolitical tension:
The US Government cannot use ISO 3166 directly. US Public Law 80-242 (1947) requires the US Government to use geographic names that have been approved by the BGN. ISO 3166 contains some country and subdivision names that vary from those approved by the BGN. The geopolitical entities included in ISO 3166 are those that are recognized by the United Nations (UN). The GENC Ed1.0 is the US Government implementation of ISO 3166-1 that conforms to BGN and US Government recognition policy.
What options do we have?
So what other options do we data people have, if we want to get the data properly normalized and “right”? (and it does change btw, as Turkey will change its official name to “Türkiye” in December)
ICANN’s list of country code top-level domain (ccTLD) can be a good source, as the list uses codes derived from ISO 3166-1 but more neutral “entity” description. However I can’t find an officially published dataset on this.
The GENC list also has some slight different country codes from ISO 3166-1 for some countries like Sri Lanka. However the description will likely be more neutral and timely1.
How dbt and Other Open Source Projects Deal With the Issue
From the brief conversation with support and DevRel staff at dbt Labs, I believe they are empowered to make decisions based on their value of inclusiveness, and I am deeply impressed.
If you are doing any international user-facing product, use ISO 3166-1 with a grain of GENC or ICANN salt - for example refer to the country common names in FIPS 10-4 / GENC. Take it seriously like dbt Labs.
If you run into projects or services that seems to just base their country lists on ISO 3166-1, share the concerns with the provider, assume best intentions, and tell them about how others are dealing with it.
Here are a few other examples via Irvin:
mozilla: Lists of Countries and Regions - MozillaWiki
FreeBSD: Internationalization Policy | The FreeBSD Project
Thanks to Karen Hsieh, Ipa Chiu, Dave Flynn for reading and providing feedback on the draft.
Have you dealt with country code data and changes? leave a comment to share your experience.
At this time of writing, I can’t find the official GENC list that used to be on nsgreg.nga.mil.