There are 7,000 languages and dialects used across the globe. In India we have 22 scheduled languages
Universal Acceptance is a foundational requirement for a truly multilingual Internet, one in which users around the world can navigate entirely in local languages. It is also the key to unlocking the potential of new generic top-level domains (gTLDs) to foster competition, consumer choice and innovation in the domain name industry. In order to promote UA, Universal Acceptance Steering Group (UASG) - A community-led initiative was founded in February 2015 by ICANN.
• Tasked with undertaking activities to promote the Universal Acceptance of all valid Domain Names and email addresses.
• Members from more than 120 companies (incl. Apple, GoDaddy, Google, Microsoft, and Verisign), governments, and community groups.
Multilingual Internet, from a common man perspective may be defined as.
A set of tools / services by which one can easily create, communicate, transact, process and retrieve information with ease in digital medium without language barrier.
The Internet has become all pervasive and has become a part and parcel of our lives. We all have witnessed the power of the internet more specifically in the pandemic, which helped us to stay connected as well as do business as usual. As of Jan 2021, the global active internet user stands at 4.66 billion while 90-95% consumption of internet is from social networking usage alone.
The convergence of AI (Artificial Intelligence) and IoT has redefined the way industries, business, and economy’s function. Speech to speech technologies, facial recognition, virtual assistants, Machine translation systems, natural language processing, natural language generation and many more are now growing part of our lives and help dissolving language barriers.
Machine translation systems, natural language processing, natural language generation and many more are now growing part of our lives and help dissolving language barriers.
50 billion devices expected to get connected in 2022, within 50 years we will have the technology for embedding internet transceivers into human brains, and that by 2069 the brain-machine interface will be fully developed, wherein the internet ecosystem will be catalytic to human advancement. The residential internet speeds will be touching 10 gigabits per second – 10 times faster than today’s networks.
Multilingualism becomes an extremely crucial aspect to bring the next one billion users on the network. There are 7,000 languages and dialects used across the globe. In India we have 22 scheduled languages, and we have one to many and many to many relationships between the scripts and languages.
As an example, Devanagari script alone covers 10 scheduled languages, such as Boro (Bodo), Dogri, Konkani, Hindi, Maithili, Marathi, Nepali, Santali, Sanskrit, Sindhi while Sindhi is written in Devanagari as well as Perso-Arabic script
The diversification is well defined by Hindi phrase
कोस-कोस पर पानी बदले, चार कोस पर वाणी |
The next billion Internet users will likely come from non-English speaking countries, providing access for these users will require more than supporting internationalized or multilingual content. Localized domain names and email addresses are required.
Consumption as well as creation of multilingual contents is also on rise, which is a boon to advancements in human inspiring systems. The advancement in Machine Learning has led to remarkable progress in Natural Language Processing (NLP), the field of Artificial Intelligence that gives computers the ability to understand human language.
Since the inception of the Internet, Domain Names were available only in Latin characters and having Domain Names in one’s language was a distant dream. But, today with the initiative of ICANN, having Domain Names in any script / language of the world has become reality.
The Internet landscape has changed dramatically over the last decade with the expansion and evolution of available Top-Level Domains (TLDs), generic Top-Level Domains (gTLDs), the Internationalized Domain Names and Email Address Internationalization (EAI).
Since 2010, the industry has seen the introduction of IDNs which are based on different languages and scripts. Over 1,200 greater varieties of new generic Top-Level Domains (new gTLDs) got registered. Email Address Internationalization (EAI) also started appearing on the scene. The gTLDs consisted of New short Top-Level Domain Names as well as Long Top-Level Domain Names which removed the restriction of 3 characters at TLD level.
Though the Internet and Domain Name System (DNS) have transformed, many websites and applications have not kept themselves up with the changes. Many systems still cannot process all Domain Names or email addresses and more specifically the Internationalized Domain and Email Address Internationalization and have not realized that the growth of Internet users is dependent on this.
The Universal Acceptance (UA) initiative of Universal Acceptance Steering Group (UASG) of ICANN addresses this issue and the solutions are already available from the industry and for different technological platforms. However, everyone in the chain needs to be UA ready to achieve a truly Multilingual and inclusive Internet.
To support the new Top-Level Domains and Email Addresses, Applications and Systems must be capable of fundamental five actions: Accept, Validate, Store, Process and Display. Software and online services support Universal Acceptance when they offer the five actions listed above for all Domains and email Names.
A new horizon has opened up with the possibility to have Internationalized Domain Names in one’s mother-tongues and scripts. ICANN has opened up having gTLD’s other than those previously eight viz., .com, .org, .net, .int, .edu, .gov, and .mil, .arpa which were created in the 1980s. Today one can have a gTLD with the name of an organization or even the name of a city such as .delhi or an institution such as .iitmumbai. These changes have opened up new vistas and exciting possibilities and at the same time technological challenges as well as legal and security issues.
The need
Though 65 percent of the world’s population is connected to the Internet, 92 percent of the web pages are published only in 12 languages. Also, 60 percent of Internet publications are in the English language alone. It is interesting to note that there are 7,000 languages and dialects used across the globe and the next billion Internet users will likely come from non-English speaking countries. Hence there is a need for technological shift to bring this next billion plus users online.
Many of the next billion Internet users are not online because systems that enable their access do not support their language. India being a multilingual country and 92 percent of population is non-English, for want of proper support / availability of tools / technologies, 50 percent of India’s population is yet not online. Providing access to the internet for these users will require technological solutions apart from merely Internationalized or multilingual content. Localized Domain names and email addresses need to be part of these technological solutions.
Popular web platforms and applications are increasing support for multilingualism.
• Total 481 million internet users with 295 million – Urban and 186 million rural
• Out of 892 million potential new users – 160 million urban
• Potential new users in rural India – 732 million
• If language enabled 205 million will join
— Facebook supports more than 110 languages (compared with 100 languages last year) and is actively increasing the languages that it supports,
— Google Translate is available for more than 100 languages,
— Twitter supports 34 languages.
• The world’s most popular apps are also increasing the number of supported languages: Whatsapp is available in up to 60 languages, Instagram in 35 languages.
As a result of our analysis of the language of content associated with IDNs, we can state that:
• IDNs help to enhance linguistic diversity in cyberspace
• The IDN market is more balanced in favour of emerging economies
• IDNs are accurate predictors of the language of web content.
Benefits
The Top-Level and Internationalized Domains have evolved and matured enough as far as the technology is concerned. For increasing business reach and greater opportunities, the UA for applications, services are crucial. People are generally comfortable in trusting and communicating in their local language. Having a local language identity (i.e., email address) is easier to use for the non-English speaking user for participating in any government, social, banking and other online applications. UA allows customers to expand their customer base by offering products / technologies / services to various countries in their own languages. Businesses can now communicate, share information, provide products, technologies and services in the customer’s language, creating trust and build a huge business potential while bringing the next billion plus users online. Govt. services can also communicate with the user in their local language creating inclusiveness and better adoption.
National Scenario
As of Jan 2022, there are 1488 active TLDs including 153 IDN TLDs (mostly ccTLDs), which includes India’s 15 IDN ccTLDs, covering 22 Indian languages represented using 11 scripts (10 Unicode Code Charts).
NIXI has already started offering Indian language domain names in all 22 scheduled languages. .भारत (.bharat) IDN ccTLD (using Devanagari script) covers 8 languages Bodo (Boro), Dogri, Hindi, Konkani, Maithili, Marathi, Nepali, and Sindhi-Devanagari, while .ভারত IDN ccTLD covers 2 languages Bengali and Manipuri, also includes ccTLDs from RTL scripts viz. Urdu, Sindhi and Kashmiri. These domain names are being offered by accredited registrars and the user/registrant can register the Indian language domain names in his/her choice of language. Email in one’s own language offerings are also on rise. The following table shows the IDN ccTLDs in the scripts mentioned and the languages supported. Annexure I include “List of scheduled Indian languages and major scripts used”.
Application support - A study in 2017, carried out by Donuts and ICANN staff, looked at 749 of the Alexa top-1,000 websites and found that only 7% of the sites allowed users to use Internationalized Email Addresses in fields that require an email address to be filled in.
Security considerations in Internationalised Domain Names
As the Internet has become a critical resource with constant security attacks and threats, the DNS has also been attacked and threatened. However, use of new protocol, developments and operational best practices have increased the resilience, stability and security of the DNS protocol and the global DNS infrastructure
Security considerations in
Domain Names
Homograph / Homoglyphs
Visual illusion, already existed in ASCII Domain Names and was not originally introduced by IDN specifically. Visual illusion is created by using confusingly similar characters.
• Among ASCII characters, 1 (digit) and l (letter l) are similar-looking and so as 0 (digit) and O (letter O). These character pairs can be used for visual tricks.
• vishwakosh.com and vishvvakosh.com (later uses two consecutive “v” characters)
• www.ICICI.com & www.lClCl.com (latter make use of small L instead of “I”) are also not addressed
• “rnicrosoft.com” looks much like “microsoft.com”
Homograph attacks, which are widely known, abuse homoglyphs to create lookalike URLs. Instead of going to a legitimate site, you may be directed to a malicious site, which could look identical to the real one. While, that combinations of similar-looking characters will increase when Internationalised Domain Names are used, the same can be mitigated with the IDNA 2003/2008 protocols.
Corruption / misspellings
Additionally, a spoofing attack can be made by corruption of a name. Adding extra labels after a well-known brand name, or including the brand name in the path of a URL labelled as secure, can confuse users, more specifically from rural areas, regardless of the use of the IDN.
However, it will be not possible to address corruption / misspellings domain as below:
pay-pal.com nixi-support.in color.com localisation.in िहंदी इिण्डया
paypal-online.com ni-xi.in colour.com localization.in िहंदी इिण्डया
paypal24.com nixi123.in
As is known, the Domain names may not be necessarily “meaningful”, also, it is not possible to have 100% accurate rules / databases to handle different linguistic variations in any given language. The IDNA 2003 / 2008 protocols have not considered these variants as historically other domains have always considered alternate spellings of www.color.com and www.colour.com as separate entities
Single script confusable:
Spoofing characters entirely within one script or using characters common across scripts (such as numbers).
Example
Confusable using Latin character set
• lOl.in (use of small L and capital O)
• 101.in (use of numerals)
• dze.in (uses basic Latin character set)
• e.in (uses Unicode Character “ʣ” (U+02A3) - Latin Small Letter Dz Digraph)
Mixed script confusable
Spoofing characters within more than one script and not a single script confusable.
Mix of Latin and Cyrillic
• paypal.in (use of Latin character set)
• pаypаl.in (use of а - U+0430 - CYRILLIC SMALL LETTER A)
Mix of Latin and Greek
• top.in (use of Latin character set)
• tοp.in (use of ο - U+03BF - GREEK SMALL LETTER OMICRON)
Whole script confusable
Mixed script confusable where each of the strings is entirely within one script.
Example:
• caxap.in (use of Latin character set)
• caxap.in (use of capax - U+0441U+0430U+0445U+ 0430U+0440 in Cyrillic)
Bidirectional Spoofing
Example:
• com .سلام. سلام//:http
• http:// دائم. a. سلام. com
Syntax Spoofing
examples directing us to bad.com
• http://example.com⁄x.bad.com (beware of U+2044 Fraction Slash)
• http://example.com?x.bad.com (beware of missing fonts as question marks)
• http://example.com—long-and-obscure-list-ofcharacters.bad.com (this one already on the wild)
Mitigating the Security threats
The ASCII Domain names has its own set of security threats imposed from the underlying layers of the technology itself, thus they are applied to IDNs. Though IDNs fulfil the multilingualism and inclusive internet, they impose additional set of security threats – mainly from linguistic characteristics of various languages as mentioned above.
The majority of the issues related to IDNs are from an application perspective (including security), and protocols. IDNs have their own set of unique security concerns imposed from linguistic characteristics of various language sets such as diacritics, variants, and digit mixing. IDNA protocol addresses these issues with broad goal of:
– Unicode-version agnostic
– Easier to understand
– More predictable when languages and scripts are applied and used.
– More adaptable to regional requirements
The protocols are devised mainly to include
1. Latest / current version(s) of Unicode
2. Permissible and valid character repositories
3. Transforming(mapping) a Unicode string to remove case and other variant differences
4. Checking the resulting string for validity, according to certain rules
5. Handling deviations
6. Considerations for Right to Left scripts
7. Transforming Unicode characters into Punycode for working with DNS
8. No script mixing – to minimize the confusability of character across various scripts
9. Language tables for permissible characters with a code block
10. Minimizing impact of transition from IDNA 2003 to IDNA 2008
11. What is allowed on which layer of IDN registrations
IDNA2008 protocol have been developed to resolve many – if not all – including of them Bidirectional Spoofing.
Two IDNA Standards
First version: named IDNA2003 (RFC3490)
• Algorithms named StringPrep(RFC3454) and NamePrep(RFC3491).
• Encoding in ascii uses Punycode (RFC3492).
• identifies an IDN by adding xn-- to the Punycode encoding of the domain
Was defined against Unicode 3.2 (March 2002)
Provisions to use new characters (i.e., added after Unicode 3.2) as is
Second (and latest) version: named IDNA2008.
• RFC 5890 IDNA: Definitions and Document Framework J. Klensin
• RFC 5891 IDNA: Protocol J. Klensin
• RFC 5892 The Unicode Code Points and IDNA P. Faltstrom
• RFC 5893 Right-to-Left Scripts for IDNA H. Alvestrand, C. Karp
• RFC 5894 IDNA: Background, Explanation, and Rationale J. Klensin
No more using Stringprep and Nameprep, however encoding in ascii still uses Punycode and the same prefix (xn--). IDNA2008 is much more agile to support new characters added by Unicode over time.
• IDNA2008 is more restrictive than IDNA2003: valid domains under IDNA2003 may not be valid under IDNA2008.
• Therefore, it is highly recommended to use the IDNA2008 standard.
• Recommendation: make sure the libraries you are using are based on IDNA2008.
To “facilitate” the transition from IDNA2003 to IDNA2008, Unicode defined a transitional feature in the UTS 46 specification. Under UTS 46, ſſ.example is mapped to ss.example. IETF does not recommend the use of UTS 46. ICANN supports only IDNA2008, therefore an IDNA2003 or a UTS46 transitional domain are not valid.
Label Generation Rulesets (LGRs)
Label Generation Rulesets (LGRs) specify metadata, code point repertoire, variant rules and Whole Label Evaluation (WLE) rules to generate labels
Root Zone Label Generation Rules (LGR) Procedure
Generation Panels
• Generate proposals for script specific LGRs, based on community expertise and linguistic, security and stability requirements
There are many software applications issues that require special attention. For exampe, most DNS-resolving software – if not all – can resolve ASCII characters only. Thus, when deploying/updating domain name entries in zone files, they must be entered in their A-Label equivalent rather than their U-Label.
Integration Panel
• Integrates them into common Root Zone LGR while minimizing the risk to Root Zone as a shared resource
Label Generation Rules (LGR)
• Which labels are permissible
• Which variant labels exist
• Are there any more constraints?
Label Generation Rules
Root Zone Label Generation Rules (RZ-LGR) - provide a conservative mechanism to determine valid IDN TLDs and their variant labels, for stable and secure operation of the DNS Root Zone. Consists of Code point repertoire, Variants and Whole Label Evaluation Rules.RZ-LGR-4 currently integrates 18 scripts which includes Arabic, Devanagari, Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Tamil, Telugu
Reference Label Generation Rules for the Devanagari Script:
This document specifies a reference set of Label Generation Rules (LGR) for the Devanagari script for the second level. The starting point for the development of this LGR can be found in the related Root Zone LGR
In the context of the above, there are many software applications issues that require special attention. Software applications that require attention include:
• Browser applications
legacy web browsers – do not support IDNs
• E-mail client and server applications
Not all email servers support EAI and very few can offer mail boxes in local languages
• Suites of office productivity tools
Support of IDN hyperlinks linkification.
Web-based e-mail services, social networking services, blogging and online banking
Not all – of the above-mentioned services do not support IDNs.
Look-up tools and command prompts
Current look-up tools accept ASCII characters only. Various command prompts (such as cmd) accept ASCII characters only; if the language is changed, question marks ‘?’ appear.
DNS registration
Most DNS-resolving software – if not all – can resolve ASCII characters only. Thus, when deploying/updating domain name entries in zone files, they must be entered in their A-Label equivalent rather than their U-Label.
Search engines behaviour and optimisation
– They play an important role in marketing, proper representing, and indexing IDNs.
–IDNs under the fear of low ranking their IDN website in the search results.
Software development kits (SDK) and mobile SDKs
–Mobile applications have become very popular. There are thousands of applications out of which many uses domain names in the background to fetch/manipulate data.
Web hosting solutions providers
–Examples include hosting automation applications (cPanel, Plesk). End-users’ expectations for provisioning IDN hosting packages should be as simple as what it does with ASCII domain names.
SSL/digital certificate providers
–An IDN should be easily signed and be good enough for companies to use in e-commerce.
By Mahesh D. Kulkarn
Founder CTO EVARIS SYSTEMS LLP and former, Sr. Director (Corporate R&D), CDAC and HoD GIST.
feedbackvnd@cybermedia.co.in