In recent years, artificial intelligence (AI) is revolutionizing various industries, and healthcare has not been left out. The drug discovery process, while crucial for developing life-saving treatments, is notoriously time-consuming and expensive. As a result, the industry has for decades explored lots of techniques to make drug discovery more efficient and effective. Now, AI-enabled tools are showing great potential to significantly accelerate the drug discovery process, leading to faster, less costly and potentially more effective treatments for a wide range of diseases.

However, the development and application of AI in drug discovery are hindered by a critical challenge: limited access to high-quality datasets. While AI has emerged as a promising tool to streamline the drug discovery process, its effectiveness hinges on access to high-quality datasets. Unfortunately, the availability of such data is often limited, hindering the development of robust AI models for drug discovery.

Imagine trying to teach a kid about animals, but your only pictures are of a cat, a dog, and a goldfish. It’s a bit like that in AI-driven drug discovery—we need a zoo of diverse and high-quality datasets to train our algorithms effectively. The problem here is that important human and medical data are usually hidden within different health systems and medical practices and research organizations, and can be difficult to access by outside researchers applying AI to drug discovery.

Equally, the conventional drug discovery processes generate vast amounts of data, including molecular structures, biological activity data, and clinical trial results. However, this data is often siloed within pharmaceutical companies and academic institutions, making it difficult for researchers to access and utilize it effectively. This lack of data sharing poses a significant barrier to the development of robust AI models for drug discovery.

AI algorithms rely heavily on diverse and comprehensive datasets for training and validation. In drug discovery, obtaining access to high-quality datasets that span various biological contexts, disease types, molecular profiles and geographic locations is crucial for training accurate AI models.

The inadequate availability of such relevant datasets can hinder the development and refinement of robust models that can effectively predict potential drug candidates.

To address this challenge, several strategies are being implemented by developers applying AI to drug discovery. However, it’s important for all stakeholders across various facets of healthcare to be concerned and contribute in solving this challenge.

Due to its importance and implications in the development of therapies that will potentially be used by people around the world, every helpful strategy should be employed to expand access to high-quality datasets and pave the way for more efficient and effective AI-driven drug discovery. Some of these strategies we should be considering and expanding include:

Collaborative Partnerships

One effective approach to access high-quality datasets is through collaborations and partnerships with academic institutions, pharmaceutical companies, and research organizations. This is key in the effort to improve access to diverse high-quality data.

Players in AI for drug discovery need to foster more collaboration such as between research institutions, pharmaceutical companies, and academic labs.

By establishing and fostering collaborations and forging alliances for data pools, we can create data collectives to pool resources, knowledge, and precious datasets. Researchers can gain access to existing datasets that have been curated and validated by experts in the field.

Collaborative initiatives can break down data silos and provide access to a more diverse and expansive sets of contextual data for AI analyses.

Open Data Initiatives

Among the major ways to address the issue of limited access to datasets is for more stakeholders in healthcare and the scientific community to promote data sharing initiatives. Open data platforms and repositories can be established where researchers can contribute and access high-quality datasets.

Encouraging open data sharing practices among pharmaceutical companies, academic institutions, and research organizations is crucial. This can be achieved through initiatives such as public data repositories, data sharing agreements, and incentives for data sharing.

These platforms can encourage data sharing by providing proper attribution and recognition to the data contributors. Additionally, data sharing agreements and protocols can be developed to ensure data privacy and security.

We need to advocate for and contribute to open data initiatives in the field of drug discovery. Data sharing through open platforms and repositories promotes transparency and facilitates broader access to datasets.

Platforms like the Collaborative Drug Discovery (CDD) Vault and the Cheminformatics OLAP data warehouse provide frameworks for sharing pre-competitive data.

This would be akin to a world where drug discovery data flows freely, like a digital stream of knowledge and everyone can fetch for their own unique applications.

Incentivize Data Sharing

Everybody loves some good incentive. We need to create incentives for pharmaceutical companies and research institutions to share data by dangling a carrot or two.

Things like recognition in papers, collaborative opportunities, financial remuneration, etc., can be ways of rewarding organizations and individuals that contribute valuable datasets for the advancement of AI-driven drug discovery and encourage data sharing.

Standardization of Data Formats and Metadata:

Ever tried to read a book with jumbled-up pages? That’s how AI feels with non-standardized data formats.

We can straighten things out by implementing standardized formats and metadata for drug discovery datasets. Data standardization and harmonization efforts are essential to ensure the interoperability and reusability of data across different sources.

This involves developing standardized formats for data representation, nomenclature, and metadata.

Standardized data facilitates data sharing, integration, and analysis, and makes it easier for researchers to integrate and combine data from diverse sources to build more comprehensive AI models. This not only helps us mix and match datasets with less work but helps make data more readable.

Consistent data structures enhance interoperability, and streamline the data preprocessing phase which allows for more efficient use in AI models.

Foster Public-Private Partnerships:

Collaboration between public and private sectors can accelerate the development and sharing of high-quality datasets. Public-private partnerships can facilitate data sharing agreements, joint research projects, and the development of open data platforms.

These collaborations can bridge the gap between academic research and industry applications.

Investing in Data Generation, Annotation and Curation:

In cases where high-quality datasets are scarce, developers can be supported to consider generating their own datasets through experiments and clinical trials. However, this can be costly to many researchers.

This approach requires careful planning and adherence to ethical guidelines. Once the data is generated, it is crucial to annotate and label the data accurately to enable AI algorithms to learn from it effectively. Annotation can be done manually or through the use of AI-assisted annotation tools.

Data annotation and curation play a critical role in ensuring the quality and usability of data for AI applications. This involves labeling, classifying, and enriching data to make it machine-readable and interpretable.

Investing in data annotation and curation efforts can significantly improve the quality of data available for AI-driven drug discovery.

Address Privacy and Ethical Concerns:

When sharing and utilizing data for AI-driven drug discovery, it is crucial to address privacy and ethical concerns. Data sharing agreements should clearly outline data ownership, access rights, and data usage restrictions.

Robust data governance frameworks should be implemented to protect sensitive patient information and ensure that data is used responsibly and ethically.

Data sharing agreements should clearly outline data ownership, access rights, and usage restrictions. Robust data governance frameworks protect sensitive patient information and ensure responsible, ethical data use.

By implementing these strategies, we encourage data sharing and expand access to high-quality datasets.

Data Augmentation and Synthetic Data Generation Techniques:

Data augmentation techniques can be employed to expand the size and diversity of existing datasets. These techniques involve applying transformations or modifications to the original data to create new samples.

For example, in image-based drug discovery, data augmentation can involve flipping, rotating, or scaling the images. This approach helps in training AI models on larger and more varied datasets, improving their performance and generalization capabilities.

Synthetic data generation techniques can be employed to augment existing datasets and overcome data scarcity and can be a way to generate massive datasets at less cost.

These techniques utilize algorithms to generate realistic data that mimic the properties of real-world data. While respecting privacy and ethical considerations, synthetic data can be created to mimic real-world scenarios.

Synthetic data can be particularly useful in scenarios where collecting real-world data is expensive, time-consuming, or ethically challenging. This approach can supplement existing datasets and provide a larger pool of diverse examples for AI model training.

Tap into Public Databases:           

While proprietary and private dataset can source of important data which at times might be hard to find anywhere else, leverage publicly available databases and repositories for drug discovery data could equally be a life saver.

Platforms like ChemBank, PubChem, ChEMBL, ZINC and DrugBank offer extensive datasets that can serve as valuable resources for AI model development and have made significant contributions to open public data access. Integrating these public datasets with proprietary information can enhance the size and quality of accessible datasets.

Utilize Transfer Learning and Pretrained Models:

Transfer learning is a technique that allows researchers to leverage preexisting knowledge from models trained on large datasets in other domains.

Transfer learning is like learning a new skill by mastering an older one. Developers can apply transfer learning techniques to leverage knowledge gained from related fields.

Pretrained models on similar datasets can be fine-tuned for drug discovery applications, even when direct access to large, proprietary datasets is limited. By fine-tuning these pretrained models on smaller drug discovery datasets, researchers can overcome the limitations of limited data availability.

Transfer learning enables the transfer of learned features and patterns from one domain to another, improving the performance of AI models even with limited data. This approach maximizes the use of available data.

Participation in Collaborative Challenges:

Researchers from diverse disciplines can engage in collaborative challenges and competitions that focus on AI-driven drug discovery. Initiatives like the DREAM Challenges and Kaggle competitions often provide access to benchmark datasets and facilitate knowledge exchange between workers from different backgrounds within the research community.

Final Thoughts: Charting a Course for Progress

In drug discovery, AI has emerged as a powerful tool, with the promise of further accelerating the identification of novel therapeutic candidates. However, one critical bottleneck that often impedes progress is the limited access to diverse and high-quality datasets.

There is a need for more diverse and comprehensive datasets that mirror real-world biological and chemical complexities for AI-driven drug discovery. Finding ways of addressing this challenge of limited access to high-quality datasets requires a collective and strategic effort.

The most important challenge here and the major reason of this discussion is that we all should put in the effort to get more stakeholders across the healthcare continuum to realize that we all should be involved in solving and improving access to high-quality datasets.

In addition to other issues that could have undesirable results in healthcare, limited access to diverse datasets can lead to bias in AI models and hinder their generalization capabilities. It is important to address bias by actively seeking diverse datasets that represent different populations, diseases, and experimental conditions.

By fostering collaboration, promoting open data initiatives, and exploring innovative approaches, the research community can, with some creativity, unlock the potential of AI to further revolutionize drug discovery.

Together, we can break through these data barriers and accelerate the pace of innovation and AI-powered drug discovery. Overall, this will bring us closer to transformative breakthroughs in the healthcare, and ultimately lead to faster, more effective treatments for patients worldwide


Aliper, A., Plis, S., Artemov, A., Ulloa, A., Mamoshina, P., & Zhavoronkov, A. (2016). Deep learning applications for predicting pharmacological properties of drugs and drug repurposing using transcriptomic data. Molecular pharmaceutics, 13(7), 2524-2530.

Ching, T., Himmelstein, D. S., Beaulieu-Jones, B. K., Kalinin, A. A., Do, B. T., Way, G. P., … & Xie, W. (2018). Opportunities and obstacles for deep learning in biology and medicine. Journal of The Royal Society Interface, 15(141), 20170387.

Ma, J., Sheridan, R. P., Liaw, A., Dahl, G. E., & Svetnik, V. (2015). Deep neural nets as a method for quantitative structure–activity relationships. Journal of chemical information and modeling, 55(2), 263-274.