The best of both worlds: building R / Python pipelines for biomedical LLM semantic search apps
Is hallucinations a fundamental limitation of large language models (LLMs) ? At the CELEHS laboratory, we are curious about the answer, but in the meantime, we often use LLM-based embeddings instead of generative artificial intelligence (AI) to help us in our analyses. In clinical and biomedical settings, one hallucination may be enough to trigger a catastrophic failure, and this is why we are particularly interested by LLM-based embeddings as BGE and BERT, which enable us to compute cosine similarities between a query and elements of our databases, thus making sure our results will not suffer from hallucinations. One example of how we use them is to suggest variables of interest to a specific study, for example if a clinician studies bipolar disease, we would suggest to have a look at the use of lithium, which is an often used medication.
As the number of embeddings models increases, we want to have methods to compare the clinical usefulness of these models. In our case, we found that BGE embeddings had very satisfying performance although showed some limitations, e.g. relating schizophrenia to leukemia and bacteria. Another promising model for clinicial use is SAPBERT, which however focuses more on identifying synonyms, while BGE focuses more on "relatedness" (as the relationship between a disease and a medication). The question that thus arises is, how can we efficiently compare BGE and SAPBERT ? We have clinician-curated databases that compile "known pairs", i.e. two concepts that are related. These known pairs can be synonyms, parent-child relationships (e.g. bipolar disorder "is a" mental health disorder), or relatedness (e.g. lithium "may treat" bipolar disorder). These known pairs can function as "true positives" with which we can compute area under the operating curve (AUC) and positive predictive value (PPV) performance scores. However, depending on the clinical study, we might in one case want to prefer finding synonyms, and in another related terms.
These reasons are why we have built R / Python pipelines for biomedical LLM semantic search apps, leveraging Python's GPU indexing capabilities, while using R for data management and evaluation purposes. While some R packages exist to leverage GPU capabilities, Pytorch is by far more used for these purposes, which brings with it more stability and optimizations. On the other hand, R is well-known to be very efficient for data management, evaluation and visualization, thanks in part to its large network of collaborating researchers. How should one go about to build reproducible and trustworthy pipelines incorporating the best of both worlds ? My answer is well-designed pipelines with Docker and Makefile.
In this talk I will show my design approaches on how to build such pipelines. While R and Python both have capabilities to natively interact with each other, building Docker images that can run both a R/Shiny server and Pytorch CUDA can prove challenging. Thus I chose to isolate the two environments, each one relying on an independent Docker image, and having an Elasticsearch database as the intermediary. The evaluation database (i.e. the known pairs acting as true positives) is first built and written using R, then Python reads the output and indexes it in the Elasticsearch database, and finally R is spun up again to read the indexed embeddings from the Elasticsearch database and evaluate the performance. Docker Compose enables to have each R and Python environments communicate seamlessly with the Elasticsearch image, and Makefile is used to order sequentially the components. With this approach, we can easily update any component when needed, as well as index the database on a GPU-capable machine before exporting the indexed database to a non-GPU online server as Amazon Web Services.
As the number of embeddings models increases, we will probably soon need to distinguish the best models for specific tasks, as finding synonyms or related pairs, and these case-by-case cross-examined studies can be efficiently implemented with R's capabilities, while Python can perform the GPU-related computations without needing to reimplement all the optimizations in R. To me, the future of open source AI and applied science will leverage a broad spectrum of capabilities, and a core question will be how to make these different environments communicate seamlessly while making them robust, isolated, and reproducible. In this talk I would like to showcase my own design approaches to such challenges.