Tarento joins Ekstep to build the pillar for National Language Translation Mission via ULCA platform

About the Customer

EkStep Foundation ("EkStep") is a non-for-profit foundation that aims to extend learning opportunities to millions of Indian children through a collaborative, universal platform that facilitates creation and consumption of educational content. EkStep was founded by Nandan Nilekani, Rohini Nilekani and Shankar Maruwada in collaboration with the leadership team that was instrumental in developing the Aadhaar project.

Overview

India required a platform, which will focus on building AI-based language technology solutions, along with the creation of multilingual datasets for enabling the digital services in Indian languages. To bootstrap this effort, NLTM (National Language Translation Mission) or Bhashini initiative from MeiTY (Ministry of Electronics and Information Technology), EkStep partnered with Tarento for building one of its critical components, ULCA(Universal Language Contribution API). Tarento was able to build this open scalable data platform, to support the largest collection of Indian language datasets and models.

Tarento as the partner of choice

Tarento, who had already partnered with EkStep, in developing the Anuvaad platform (legal document translation system for various judicial government bodies), was a natural choice for the NLTM (National Language Translation Mission) initiative as well. EkStep had the vision to build the largest corpus and the platform for hosting these datasets and ML models in Indic languages. EkStep, being a NPO, typically partners with technology entities to fulfill their vision.

Challenges

The major challenges involve collection of datasets in all the 22 indian languages (some of them fall under low resource languages) across various domains, which can be achieved only by writing custom crawlers for various sources. The platform also needed universal API definitions to interact with other hosted systems for datasets/models submission, inference, search, download etc.

Non-technically, another bigger challenge was to encourage other teams/entities/companies/institutions to make their models/datasets in a compliant format and submit them to this new platform.

Solution

Building the platform:

The Tarento team got into action by coming up with the right architecture which is scalable, reliable and also platform agnostic. Various POCs were done as part of it to come up with the best selection of open-source technologies. The APIs were created after discussions with various stakeholders.

Contribution from multiple talents and teams:

The true success of the research teams as part of the Bhashini initiative is being measured by how many models/datasets they can make it ULCA compliant. As part of this, Tarento team had coordinated with various teams, including and not limited to various IITs, IIITs, IISc, CDAC, AI4 Bharat etc. This made sure that all the Indic datasets and models were available in a single place with proper attribution & sanity.

Building datasets and models:

Being the early contributors of ULCA, the internal team worked on various data curation and models and ported them successfully within a short period of time. The Tarento team was also instrumental in creating a few benchmark datasets as part of the evaluation process of the models.

Technologies Involved:

Apache Kafka, MongoDb, Apache Druid, Java, Python, Redis, React JS, OpenAPI, Jenkins, Groovy, CSS, HTML, Shell, Azure, AWS, Zuul and CDN

Outcome

Numerous datasets, models and metrics were ready for users to get handy access. Here is the complete account:

Datasets:

Parallel Corpus: ~215 million pairs across 12 Indic languages, 16 domains

OCR Corpus: ~2.5 million images across 12 Indic languages, 4 domains

ASR Corpus: ~9800 hours across 14 Indic languages, 5 domains

TTS Corpus: ~510 hours (studio quality audio) across 14 Indic languages, 1 domain

ASR Unlabeled Corpus: ~14k hrs across 23 Indic languages, 5 domains

Transliteration Corpus: ~6 million across 19 Indic languages

Models:

Translation: 155 models across Indic combinations

ASR: 38 models across 19 Indic languages

TTS: 20 models across 12 Indic languages

OCR: 7 models across 7 Indic languages

Transliteration: 21 models across 21 Indic languages

Benchmark Datasets:

Translation: 56 datasets across 15 Indic languages, 3 domains, 5 contributors

ASR: 12 datasets across 7 Indic languages, 1 domain, 3 contributors

Transliteration: 67 datasets across 20 Indic languages, 1 domain, 2 contributors

OCR: (WIP) 23 datasets across 23 Indic languages, 1 domain, 1 contributor

Metrics:

Translation: 5 (bleu, meteor, ribes, gleu, bert)

ASR: 2 (wer, cer)

OCR: 2 (wer, cer)

Transliteration: 3 (cer, top-1 accuracy, top-5 accuracy)

And the list would keep growing!

Impact

Premier Repository of Datasets & Models:

When Bhashini was launched, ULCA had achieved its mission, by hosting the largest collection of Indic datasets and models of various tasks.

Easy Integration with various other systems:

The design of ULCA has enabled various other related eco-systems to interact in a smoother way. Ex : BashaDhaan, a crowdsourcing dataset contribution system can now be integrated, to push the output to ULCA.

Catalyzing Entrepreneurship:

A Bhashini Platform will make Artificial Intelligence (AI) and Natural Language Processing (NLP) resources available to MSMEs, Start-ups and Individual Innovators in the public domain.

Multi-linguality presents a major opportunity to start-ups for developing innovative solutions and products that can cater to all Indian citizens irrespective of the language they know.

Breaking the Language Barrier:

Digital India BHASHINI mission aims to empower Indian citizens by connecting them to the Digital Initiatives of the country in their own language thereby leading to digital inclusion.

Promoting Digital Government:

The Bhashini platform is interoperable and will catalyze the entire Digital Ecosystem. It is a giant step to realize the goal of Digital Government.

Creating Ecosystem for Products in Local Languages:

Mission Digital India Bhashini will create and nurture an ecosystem involving Central/State government agencies and start-ups, working together to develop and deploy innovative products and services in Indian languages.

Increased Digital Content in Indian Languages:

Mission Digital India Bhashini also aims to increase the content in Indian languages on the Internet substantially in the domains of public interest, particularly, governance-and-policy, science & technology, etc.

This will encourage citizens to use the Internet in their own language.

Summary

As part of the Bhashini initiative, more than 10 teams have contributed to ULCA either with their datasets or models. The next and final success criteria would be the usage of the ULCA platform to make the useful content available in all the Indian official languages by various startups/entities/etc.

With the ULCA/Bhashini platform, the foundation for the NLP in Indic languages is set. Any researcher community can come to Bhashini platform to get the datasets required to train their models and also compete with other models via Benchmarking and Leaderboards. This can be across any industry and domain (Educational, Judicial, Medical, Entertainment, etc).

Think your idea makes lives simpler?

We can help you transform your business.