Proscia Launches Concentriq Embeddings and Developer Toolkit to Unleash Pathology AI DevelopmentREAD MORE

The Hidden Costs of AI Development in Pathology and How Concentriq Embeddings Helps Life Sciences Organizations Mitigate Them

Proscia AI R&D Team
By Proscia AI R&D Team | October 1, 2024

Executive Summary

Building computational pathology AI models that are impactful for therapeutic drug discovery and development requires life sciences organizations to navigate complex infrastructure, machine learning operations challenges, and extensive data processing—often diverting focus from core research activities. Foundation models—which provide an ideal starting point—offer an opportunity to overcome many of these challenges to accelerate AI development, but operationalizing their use to impact R&D has remained difficult.

Proscia’s Concentriq® platform and its Concentriq Embeddings solution provide life sciences organizations with a streamlined way to generate high-quality whole slide image (WSI) embeddings using leading vision and vision-language foundation models. This allows data scientists to focus on the more scientific and innovative aspects of their AI model development programs, rather than being bogged down by technical tooling and infrastructure hurdles. This article explores the key challenges of pathology AI development in therapeutic R&D, the associated costs, and how Concentriq Embeddings alleviates these issues to unlock real savings.

Challenges of Pathology AI Development for Therapeutic R&D

AI and deep learning are revolutionizing precision medicine, particularly in computational pathology. Life sciences organizations are building advanced data science teams to develop AI models that improve R&D capabilities, from early-stage research through clinical trials. These innovations enable the discovery of novel biomarkers, identification of patients who will respond well to treatments for optimized clinical trial design, the ability to assess how treatments affect particular tissues, aiding in endpoint evaluation in both preclinical and clinical trials, and more.

However, operational barriers continue to impede progress in computational pathology AI development, leaving significant opportunities to drive AI-enabled discovery and development untapped. These challenges include:

  1. Massive Memory Footprint: WSIs are scanned at high resolutions and are therefore highly memory intensive. Managing these data at the scale required to generate AI models is challenging, leading to slowdowns in both development and inference. Loading more than one slide into graphical processing unit-random-access memory (GPU-RAM) at once on commercial hardware is often impossible, and bookkeeping intermediate data products is cumbersome. For example, tracking hundreds of thousands of image patches per slide presents its own challenges.
  2. WSI Format Inconsistencies: Different file formats and scanner metadata schemes require significant effort to standardize for AI development, making it hard for organizations to work with data that has been scanned across multiple vendors. Even opening WSI file formats often requires either open source libraries or proprietary software development kits (SDKs), which are not well maintained.
  3. Infrastructure Demands: Building, managing and maintaining the necessary infrastructure for computational pathology AI development, whether on-premises hardware or in the cloud, requires ongoing investment and specialized expertise. A deficit in these resources often slows down research and development, leaving scientists to plug this gap as best as they can and resulting in substantially diminished output. Furthermore, data transfer of WSIs across network connections is time-intensive, and mitigating these challenges requires the time of highly skilled engineers.
  4. Model Pipeline Management: The rapidly evolving landscape of foundation models demands frequent updates to build, manage, and continuously maintain AI pipelines, further diverting resources from innovative development work.
  5. Enterprise Pathology Platform: Storing large WSI datasets, including both the images and the final outputs of analysis pipelines, for use with foundation models requires building, managing, and maintaining an enterprise pathology platform to ensure proper access controls, meet compliance requirements, and allow scientists to analyze images and model products with advanced tools and applications. This requires a substantial investment even when integrating with an existing platform.

Now, we explore these challenges in greater detail before addressing their cost implications.

Massive Memory Footprint

Recent AI advancements underscore the importance of large datasets for developing effective AI models. Today’s pathology datasets often contain hundreds of thousands to millions of WSIs. Unfortunately, local computing resources built for earlier, smaller datasets can’t keep up with the throughput required for these larger, modern datasets.

WSIs, scanned at 20X-40X resolution, contain billions of pixels, requiring gigabytes of RAM to view a single slide. Loading multiple slides into GPU-RAM, or even central processing unit (CPU)-RAM, becomes impossible without breaking slides into smaller tiled crops. While deep learning model development benefits from mixing visual signals across slides in a single batched operation, the substantial memory requirements restrict this capability, forcing scientists to save intermediate data products (e.g., small image patches from a high resolution WSI).

However, doing so comes with another set of challenges. The storage footprint of a high resolution patched dataset is significantly higher than the original WSI dataset because the visual information is stored twice—once in the slide file and once in a patch. Plus, a single slide can result in hundreds of thousands of patches, requiring complicated bookkeeping to keep track of these intermediate data products.

WSI Format Inconsistencies

Developing and validating AI for pathology requires handling numerous WSI and metadata file formats across scanner vendors, leading to inconsistencies. Some of these issues are alleviated by libraries such as the popular OpenSlide for reading WSI files. However, the Python bindings are typically necessary to use OpenSlide, and OpenSlide-Python is an open-source library maintained by public contributors, which means production-level stability is hard to guarantee. Furthermore, using the OpenSlide-Python library often requires engineers to maintain multiple versions of these libraries to work with images from multiple datasets, since vendors frequently introduce changes that disrupt the OpenSlide functionality. 

Processing slides scanned by certain vendors can also require working with proprietary SDKs. For example, working with WSIs scanned by Phillips devices requires the Philips SDK, and changes to their file format necessitate specific SDK versions for different slides. This can result in the need to maintain multiple versions of the SDK to successfully access slides.

The lack of metadata standardization across scanner vendors exacerbates this challenge. Even after successfully opening these WSIs, analysis pipelines tailored to one metadata scheme typically do not translate seamlessly to another. Unlike radiology, which follows the DICOM standard, digital pathology lacks widespread adoption of uniform image file standards.

Infrastructure Demands

Establishing and maintaining infrastructure for AI development, whether through on-premises hardware or in the cloud, requires a significant upfront investment in high-performance computing hardware and skilled personnel. This infrastructure must be capable of handling the processing demands of computational pathology, including the storage and analysis of large-scale WSI datasets, with cloud setups requiring GPU-enabled nodes and clusters that can autoscale based on workload demands.

Ensuring the scalability and reliability of these systems demands ongoing updates, troubleshooting, and performance optimization. In many cases, organizations lack the necessary personnel and expertise to handle these tasks, leading to delays in R&D progress, and a diversion of resources as scientists are left to manage these technical challenges. 

Additionally, datasets with especially large WSIs can take weeks to transfer on enterprise network connections. For example, 1,000 slides at 3 GB each with a network transfer rate of 100 Mbps requires over 66 hours to transfer. When compute and storage are separated over a networking connection, the data transfer speed can become the bottleneck for the entire system. Handling these data transfers efficiently requires the involvement of highly skilled developers to optimize data pipelines. This adds another layer of complexity and resource requirements, as the expertise needed to streamline these processes is often in short supply. As a result, data scientists face significant challenges in maintaining smooth and efficient data transfer, further impacting the overall progress of implementing AI in therapeutic R&D.

Compounding the issue, when research teams face data transfer challenges, it often leads to multiple copies of the same data being stored in different locations, further driving up storage costs. Even under optimal conditions, running inference with foundation models typically involves duplicating visual data by caching intermediate products like thumbnails and image tiles. As explained above, this duplication inflates the dataset size because pixels are stored both in the original whole slide image and in the intermediate image files. Depending on the compression of the WSI format and the extent of the visual information duplicated, the storage footprint can increase by 0.5 to 4 times. Additionally, without regular maintenance, datasets may remain stored beyond their usefulness, further inflating costs for AI development teams.

Model Pipeline Management

Building and maintaining AI pipelines that work across multiple slide formats and foundation models is a constant challenge. Beyond what’s noted in the sections above, issues with distinct vendor file formats and metadata schemes can trickle past initial pre-processing steps and into model pipelines, requiring further development and maintenance of custom AI pipelines to ensure compatibility. This involves substantial effort in coding, testing, and validating these pipelines to handle the diverse requirements of various vendors.

Furthermore, the rapid pace of advancements in foundation models presents an immense opportunity to capitalize on the efficiencies they have to offer; however, it traditionally necessitates constant adaptation. As new foundation models emerge, AI pipelines require frequent updates to ensure compatibility, and integrating these models into existing workflows takes considerable time and expertise. Without continuous attention to these pipelines, ensuring proper interaction with required model inputs and outputs, organizations risk introducing errors, further delaying research.

Enterprise Pathology Platform

Most fundamentally, developing AI at scale requires an environment that effectively warehouses the organization’s image data, implements access controls, and stores final outputs from AI analysis pipelines alongside the images. While some organizations build their own platforms to serve this need, others integrate third-party solutions. Either way, substantial resources are needed to ensure the platform solution can handle large datasets, efficiently retrieve images, and integrate with AI pipelines. Plus, smooth image analysis workflows and efficient image handling across each stage of R&D are crucial for minimizing data processing delays and building an enriched digital data foundation for data science and AI teams to leverage for model development.

Maintaining the platform requires regular updates, system optimization, and security patches to ensure reliability and compliance with evolving regulatory standards. Dedicated IT personnel with specialized skills in managing large-scale data infrastructure are essential to maintaining the platform’s performance, further adding to an organization’s operational costs. These ongoing efforts ensure that the platform remains capable of supporting cutting-edge research and development activities in pathology, providing a solid foundation for data and AI-driven insights. 

Costs Associated with Pathology AI Development Challenges

The challenges of managing memory footprints, image formats, infrastructure, model pipelines, and the data warehousing environment all introduce significant, often hidden, costs. Below, we explore these cost implications in more detail.

Infrastructure Costs

Deep learning AI development in pathology demands high-memory computational components called GPUs. Workstations with GPUs can either be purchased and built for a local group, or they can be rented from a cloud provider. 

Building a minimal enterprise workstation costs around $50,000, because often 4-8 GPUs are needed to support an AI team of 5 engineers. Renting space in a data center is expensive, for example approximately $2,500 per year in Philadelphia. A proprietary GPU workstation incurs electrical costs even when the GPUs are not being used. An on-premise workstation also has a hard cap on the amount of data it is able to process per unit time, meaning they are impossible to scale without purchasing, installing, and maintaining more GPUs.

Alternatively, some organizations may be cloud-based. While this comes with many benefits, cloud computing costs can escalate quickly. A small mistake in workload management can cause huge inefficiencies in cloud computing uptime and therefore directly impact costs. Furthermore, while cheaper cloud compute may be had with more development investment and solid pipeline design, compute nodes well-suited for foundation model inference range from $1 to $15 per hour, causing costs to accumulate quickly. Costs for machines well-suited for foundation model training are even higher.

Personnel Costs

Optimizing code to handle high-resolution WSIs, managing data transfers, and maintaining model pipelines require specialized personnel. The need for engineering resources can actually grow when there is a need to support foundation model usage and development. Here, we examine engineering costs in more detail. 

First, managing the massive memory footprint of WSIs incurs significant engineering costs. Code must be optimized to handle high-resolution WSIs, requiring advanced knowledge and time investment. They need to implement efficient data handling and memory management techniques to process large datasets without overwhelming system resources. 

Maintaining and updating custom scripts to manage intermediate data products, such as image patches, also demands continuous engineering attention. Data scientists often find themselves spending a large amount of time simply moving data around rather than focusing on true model development and testing activities. To keep projects on track, they must ensure fast data transfer rates and minimize network bottlenecks, a task which often will require support of engineers to optimize data pipelines and storage solutions. Often, data transfer issues mean that engineers are spending cycles checking on processes that should ideally succeed without a human in the loop but often do not (e.g. bulk WSI download). All of these factors contribute to wasted resources in addition to burnout.

Second, the variability in image file formats and frequent updates needed to process new and existing datasets requires continuous adjustments, often to both the pathology platform and image processing and AI pipelines to maintain compatibility. This diverts time and resources from core research activities. A similar problem is encountered when managing the use of multiple foundation models. Foundation model technology is rapidly advancing and model formats are again non-standardized, meaning that data scientists often have to invest significant time in tooling development, rigorous testing, and regular updates to pipelines to incorporate new or additional models.

Aside from these costs, the infrastructure challenges for AI operations require significant time and specialized engineers to manage. Organizations often need technical operations specialists to tightly control these environments. These teams are responsible for ensuring the security of infrastructure, implementing robust access controls, and protecting sensitive data. Additionally, they must frequently update and optimize infrastructure to keep up with the evolving demands of AI workloads and maintain high performance. This involves regular software updates, system scaling, and performance tuning, all of which require extensive expertise and ongoing attention.

Further adding to AI development costs is the management of the model pipelines a team must develop and maintain. Image file format inconsistencies require additional work to maintain compatibility in model pipelines. Ideally, data scientists maintain a single pipeline that works for all models they use, but in practice this is nearly impossible and thus work is often duplicated developing multiple pipelines. Scientists also spend considerable time tracking down models to use, reading documentation to implement them properly and developing code required to unify them with existing model development pipelines. Testing pipelines undergoing continual maintenance requires additional data scientist cycles. Further, properly optimizing these pipelines often requires the expertise of specialized AI engineers. Without this expertise, organizations again suffer decreased output from their data science teams.

Finally, the engineering costs associated with building and integrating a pathology platform for data warehousing are substantial. Engineers must ensure that the platform can handle the high-resolution pathology images, implement secure access controls, and facilitate smooth data retrieval. Integrating the pathology platform with AI and data analysis pipelines requires specialized knowledge to create custom APIs, ensure compatibility of AI pipeline inputs and outputs, and maintain efficient data flow between systems. Continuous updates and optimizations are necessary to accommodate new technologies and meet evolving research needs. The complexity of these tasks demands skilled engineers and ongoing investment, further contributing to the overall costs and resource requirements.

Data scientists and engineers spend a significant portion of their time on all of these development and maintenance activities. With all of the above factors, even organizations with small data science teams can easily spend hundreds of thousands if not millions of dollars overcoming these computational challenges with developing pathology AI models.

Data Storage and Transfer Costs

The data storage costs associated with all of the challenges explored in this work are difficult to estimate and vary greatly from one organization to another. What is clear is that those working on hardware face considerable hardware requirements and huge personnel lift to enable efficient AI operations. In cloud-based organizations, the data scientists are often not directly privy to what their operations are costing the organizations, and therefore spending can easily balloon out of control. This is especially true when, as we discussed in the infrastructure section, multiple transfers and copies of data are often made due to inefficient setups.

Concentriq Embeddings Solution

With Concentriq Embeddings, Proscia offers a solution that addresses many of the challenges and associated costs for computational pathology AI development. Concentriq Embeddings accelerates and drives down the cost of AI model development by enabling organizations to generate WSI embeddings—— numerical representations of data that capture essential features and relationships within the data—from a collection of leading foundation models through the Concentriq platform API. 

Data scientists simply assemble a WSI dataset in Concentriq and click a button to request embeddings from a chosen foundation model, specifying the resolution, and optionally, the region of interest. The request for embeddings is sent to the chosen foundation model via the Concentriq API and tile embeddings are quickly generated and returned to the user, ready for downstream AI model development (Figure 1).

Figure 1. Workflow to generate embeddings using Concentriq Embeddings.

Concentriq Embeddings paired with the Concentriq platform drastically reduces memory, infrastructure, and data management challenges. Here’s how it helps:

Reduce WSI Memory Footprint

Traditionally, data scientists need to download large datasets of images from their pathology platforms and store them in another data warehouse to work with foundation models. With Concentriq Embeddings, foundation model embeddings are available directly through the Concentriq API and can be requested at any resolution supported by the WSI’s base magnification. Researchers can perform their entire workflow to build their own downstream models without ever having to manage WSI storage, manipulation, or transfer. 

Furthermore, foundation models heavily compress visual information, transforming slides into embeddings occupying roughly a 5-50X smaller memory footprint, while still capturing vital visual features. This enables AI development standard laptops, with no expensive GPU, memory or storage requirements.

Simplifies WSI Format Management

Instead of data scientists spending development time wrangling libraries and packages to open multiple slide vendor files and keeping up with the maintenance of such a system for new data types and packages, Concentriq Embeddings handles all of the file type support and management. WSIs with various file formats can be processed by foundation models through Concentriq Embeddings and the outputs are delivered in a standard and easy-to-use safetensor format.

Reduces Infrastructure Demands

While traditional AI development requires buying specialized hardware, or maintaining GPU-heavy and data-heavy specialized cloud computing resources, Concentriq Embeddings does all of this heavy lifting. Since Concentriq Embeddings performs the WSI manipulation, orchestration, and inference through large foundation models, data scientists only manage a very small portion of the required compute. In many cases, data scientists can perform AI development on CPU-based systems, even on standard laptops. Importantly, Concentriq Embeddings scales GPU resources and therefore can support extremely parallel processing that often is not available or supported by an organization’s current infrastructure, meaning that results come faster. 

Concentriq Embeddings moves all your computation closer to where the enterprise’s pathology data lives – in Concentriq, eliminating much of the data scientist’s pain with slow data transfer processes, and enabling them to focus more on the model development. Concentriq Embeddings also avoids storage bloat through duplicating visual information in intermediate data products, and the embeddings themselves are much more lightweight to transfer and store than their whole slide image counterparts.

Streamlines Model Pipeline Management

With the rapid pace of development of foundation models, it is a challenge for data scientists to be sure they’re up to date with the latest and best technology for their downstream applications. Concentriq Embeddings eliminates the majority of the burden of building and maintaining model pipelines that must be compatible with multiple image formats and foundation models with various formats, delivery methods, inputs and outputs. Since much of the model pipeline is reduced with Concentriq Embeddings, time spent building and maintaining these pipelines is also greatly reduced.

Minimizes Challenges Associated With Leveraging Data for AI Development

Developing AI and building on the promise of foundation models requires a significant investment in a pathology platform to seamlessly exchange data with AI pipelines. The Concentriq platform aggregates the enterprise’s pathology data and R&D workflows into one centralized location, generating a continuously growing, constantly enriched data foundation for computational science teams to leverage for AI innovation. Concentriq Embeddings brings AI development closer to the pathology data used to fuel it, ridding organizations of the need to build complex functionality and integrations from scratch just to leverage foundation models. By generating embeddings directly through the platform API, resource burdens are reduced, allowing researchers to focus on generating insights without the strain of maintaining complex infrastructure. 

Conclusion

To drive real impact from AI in pathology, organizations are spending hundreds of thousands to millions of dollars to overcome technical development challenges, including managing large memory footprints, inconsistent image formats, infrastructure demands, and complex pipelines. Much of an organization’s investment in AI is not going to scientific research and model development, but to navigating around the operational challenges in these areas. 

Concentriq Embeddings simplifies AI development by addressing these challenges, reducing the need for specialized hardware and enabling efficient AI workflows on standard laptops. Life sciences organizations can expedite AI model development that accelerates the discovery and development of precision medicine diagnostics and therapies, while reinvesting millions of dollars in computational savings back into data science and AI programs to drive high-value innovation at scale.

Vaughn Spurrier, Ph.D., is an AI Research Team Lead at Proscia
Corey Chivers, Ph.D., is a Senior AI Scientist at Proscia
Julianna Ianni, Ph.D., is the Vice President, AI Research & Development at Proscia 

Our website uses cookies. By using this site, you agree to its use of cookies.