Skip to main content

With rapid technological progress and steadily decreasing costs, an unprecedented volume of personal and population-scale human genomic data is being produced across the globe. Yet, there is a pressing need for a meticulously curated and richly annotated repository of human genome variants to advance large-scale biomedical research, clinical innovation, and broader socio-economic gains.

A major concern lies in the fact that the majority of current genomic datasets are derived predominantly from individuals of European ancestry, resulting in a significant underrepresentation of data from other ethnicities—particularly minority and diverse communities. This imbalance has deepened existing global health inequities, leading to reduced diagnostic accuracy, compromised effectiveness of precision medicine, and uneven access to personalized healthcare solutions.

As a result, these disparities further exacerbate socio-economic inequality, with serious ramifications for healthcare systems in India and other low- and middle-income nations. Bridging this gap is now an urgent imperative for medical professionals and health policy leaders working in these regions.

India, home to over 1.4 billion people and accounting for around 18% of the global population, ranks as the most populated country in the world. The nation is known for its vast diversity, comprising over 4600 anthropologically distinct groups. These populations are classified based on caste, tribe, and religion, with variations in cultural practices, geography, climate, physical traits, marital customs, languages, and genetic makeup. Historically, India served as a key route for early human migration out of Africa via its southern coasts. Over time, multiple waves of migration and invasions have further enriched the genetic diversity of the Indian subcontinent.

Despite its vast genetic diversity, India remains markedly underrepresented in global genomic research. The population is characterized by pronounced stratification into numerous endogamous communities and elevated levels of consanguinity, which contribute to a higher incidence of recessive genetic disorders. Yet, the absence of large-scale, India-focused genome sequencing initiatives has led to a significant gap in the documentation of subpopulation-specific genetic variants within international medical literature and research efforts.

Over the last decade, advancements in next-generation sequencing (NGS) technologies and their increasing affordability have significantly transformed the understanding of genetic variations in populations worldwide. Global initiatives such as the 1000 Genomes Project, ExAC (Exome Aggregation Consortium), ESP6500 (Exome Sequencing Project version 6500), and gnomAD (Genome Aggregation Database) have contributed to the development of extensive reference and patient genome datasets across continents. While these datasets include genomes from individuals of Indian origin, the sample size is insufficient to capture the vast genetic diversity and heterogeneity of the Indian population.

In addition to these global efforts, a few studies specific to the Asian and Indian populations have been conducted to explore the genetic landscape in this region. For instance, the Indian Genome Variation (IGV) Consortium examined over 1800 individuals from 55 subpopulations across 900 genes. This study highlighted the genetic heterogeneity of the Indian population and identified unique founder mutations within the subcontinent, improving the understanding of genotype-phenotype relationships. Similarly, the GenomeAsia100K project addressed questions related to Asian populations, including 598 Indian samples primarily representing tribal groups and specific castes, predominantly from southern India.

Given India’s population size, these datasets represent only a small portion of the country’s genetic diversity. To better understand this diversity, comprehensive genome sequencing efforts must focus on the cultural, ethnic, and geographic variation within India. Such population-specific studies can facilitate the identification of genetic variants and polymorphisms linked to diseases, enhance the effectiveness of precision medicine, create robust population-specific reference genome datasets, and improve clinical predictions.

The Indian clinical genomics database IndiGen has garnered significant attention for its efforts in cataloguing genome sequencing data, particularly for rare genetic disorders. Other initiatives spearheaded by the Department of Biotechnology, such as the 10,000 Genome Sequencing Project (GenomeIndia), the National Inherited Disorders Administration Kendras (NIDAN), and the Unique Methods of Management of Inherited Disorders (UMMID), are actively working to document population-level genetic variations. The GenomeIndia project aims to generate a reference database of genetic variations specific to the Indian population by sequencing the genomes of 10,000 individuals from 99 diverse ethnic groups, provide open-access genome data for academic and research purposes through the IBDC (Indian Biological Data Centre) platform, develop genome-wide and disease-specific genetic chips for affordable diagnostics and research, and ultimately, lay the foundation for genome-based precision medicine in India.

In the first phase of the GenomeIndia project, joint variant calling was conducted on 5750 samples, representing 69 distinct population groups across India, highlighting the genetic complexity of the Indian population and the necessity of a large-scale initiative like GenomeIndia.

From a dataset comprising 5,750 samples, over 135 million genetic variants were identified—primarily single nucleotide variants (SNVs) and small insertions or deletions (INDELs), with a smaller subset consisting of multi-allelic variants. While a majority of these (~65%) were classified as ultra-rare, more than 6.9 million common variants (approximately 11% of the total) were detected. These widely shared variants across Indian subpopulations offer significant promise for genome-wide association studies aimed at identifying genetic determinants of prevalent traits. Moreover, they are instrumental in refining gene chips customized to reflect the genetic landscape of the Indian population, as many of these variants are either rare or absent in global variant databases.

A substantial proportion of the discovered variants has been functionally annotated. Although the majority of SNVs and INDELs lie within non-coding regions, more than 1.4 million variants were identified as potentially functional. These include missense mutations, frameshift variants, splice site alterations, and changes within untranslated regions—each of which may impact gene expression and phenotypic outcomes. Notably, novel missense variants in the LDLR (Low-Density Lipoprotein Receptor) gene, associated with familial hypercholesterolemia, were identified, underscoring their clinical significance within the Indian context.

The genetic data also provides a unique perspective on India’s population history and linguistic diversity, capturing groups from the four major linguistic families: Indo-European, Dravidian, Austro-Asiatic, and Tibeto-Burman. This unprecedented genomic diversity contributes to a deeper understanding of the population dynamics and history within the Indian subcontinent.

The private genetic and genomic laboratories in India hold a crucial role in this regard. These laboratories typically offer a range of services, including molecular and cytogenetic analysis, single-gene mutation testing, multi-gene panel testing, targeted or whole-exome sequencing, and whole-genome sequencing. Some laboratories also specialize in areas such as fetal medicine, prenatal diagnostics, rare genetic disorders, and familial cancer. While many of these laboratories boast state-of-the-art equipment and technical facilities, only a limited number have skilled bioinformatics support for managing and interpreting sequencing data.

A critical challenge faced by these laboratories is the lack of standardized protocols to interpret variants of unknown significance (VUS), which is crucial for determining their potential pathogenicity. Moreover, certain pre-existing “founder mutations” prevalent in India’s diverse populations are often misclassified as VUS, thereby depriving affected individuals of appropriate clinical interventions.

To address some of these challenges, mandatory data sharing is required by all the commercial and non-commercial laboratories. Standardized annotation consistent with international nomenclature and classification is an important step to improve data consistency and reliability. The shared genomic database should be accessible to all, regardless of geographical location, to foster inclusivity.

Many profit-driven genome laboratories in India and other low- and middle-income countries are affiliated with multinational organizations or overseas laboratories. Their datasets are often based on individuals of European descent or small, non-representative minority groups, leading to potentially flawed interpretations of genomic variants. This issue is particularly critical, where indigenous populations risk being marginalized in genomic healthcare.

To address this, international collaboration, led by organizations such as the World Health Organization (WHO) and the Global Genome Consortium, is essential. These efforts could provide technical guidance, encourage data sharing, and help establish centralized genome databases. Such initiatives would not only advance precision medicine but also yield significant socio-economic benefits by making genomic healthcare more equitable and accessible.

Mandatory data sharing through centralized data centres and robust regulation, coupled with international cooperation, are imperative to ensure the effective integration of genomic medicine into healthcare systems in India while addressing global health disparities.

Author’s Biography

Sombuddha Roy Bhowmick is a driven professional with a strong foundation in data analysis and bioinformatics. He has experience developing innovative solutions across diverse fields, including genomics, machine learning, and clinical data management. 

Leave a Reply