AI/ML Innovation Requires a Flexible Yet Governed Data Architecture
Data analytics, as with all technical disciplines, is a tale of two opposing forces: flexibility and governance. You need flexibility to support innovation, but governance controls to mitigate the risks of that innovation.
This blog explores the implications of this maxim for artificial intelligence (AI), including machine learning (ML) and generative AI (GenAI). On one hand, a data architecture must flexibly support many data structures, integration styles, and analytical models. On the other hand, it also must help govern usage to reduce AI/ML risks related to data quality, privacy, intellectual property (IP), bias, and explainability. Both sides of this coin require the careful attention of the data architects that design the environment and data engineers that manage it.
Data architectures encompass three layers: infrastructure, integration, and access. Let’s explore the flexible elements in each layer, working from the bottom up in the following diagram, that support AI/ML
Data Architectures that Support AI/ML
Flexible infrastructure
Data infrastructure comprises various platforms that store, manipulate, and retrieve multi-structured datasets. Data teams are making their infrastructure increasingly flexible to handle new data types and workloads because AI/ML projects need more than traditional tables. They might need semi-structured logs from internet of things (IoT) sensors to track the performance of factory parts. They might need unstructured text summaries of customer conversations, or bio-medical research images.
To support all this, data teams are implementing lakehouses such as Databricks and Snowflake alongside legacy databases and data warehouses. They’re also adopting open table formats such as Apache Iceberg to flexibly support multiple tools and processing engines—Apache Spark, Trino, and so on—as AI/ML projects evolve.
A flexible infrastructure must support open table formats such as Apache Iceberg.
Flexible integration
The integration layer uses these infrastructure resources to prepare and deliver the data that feeds analytical models. It must be flexible because AI/ML projects need more than the traditional pipelines that extract, transform, and load (ETL) periodic batches of operational tables.
For example, some ML projects require an ELT pipeline that ingests multi-sourced data, then merges, cleanses, and re-formats it in a lakehouse. A customer recommendation engine, meanwhile, might require data virtualization, and fraud prevention might need real-time streaming. To meet such varied requirements, data teams can evaluate flexible tools from providers such as CData that integrate multi-structured data using all these styles: ETL, ELT, streaming, and virtualization.
Data teams need flexible tools that integrate data using multiple styles: ETL, ELT, streaming, and virtualization.
Flexible access
The access layer is where analytical models retrieve and consume data as part of training or inference. This layer must accommodate many types of analytical models, ranging from simple regressions to clustering, anomaly detection, prescriptive ML, and generative AI. Such flexibility requires open APIs and easy integration with a vibrant commercial and open-source ecosystem. This ecosystem includes AI/ML libraries such as PyTorch, programming languages such as Python, and MLOps tools such as MLflow. Data teams can achieve this flexibility by avoiding proprietary tools, platforms, or formats that limit interoperability.
A flexible data architecture must integrate with a vibrant commercial and open-source ecosystem.
Governed architecture
Flexibility creates complexity. And complexity raises the risk that teams mishandle data and run afoul of customers or regulators. Data teams must therefore maintain vigilant oversight and control of data usage. This brings us to the critical requirement for data governance across all three layers: infrastructure, integration, and access.
- Observability: First, data teams must observe how the architecture processes and delivers data, for example by monitoring latency, uptime, and so on.
- Validation: Second, they must validate that data is complete, consistent, and accurate, for example by comparing pipeline inputs and outputs.
- Lineage: Data teams also must track the lineage of that data. This helps understand, for instance, the steps involved in transforming unstructured objects—emails, images, audio files, etc.—into vector embeddings for a GenAI language model to consume.
- Access controls: Data teams must govern data consumption with role-based access controls that restrict user actions.
- Masking: Finally, they must identify and mask personally identifiable information (PII) to ensure customer privacy and comply with regulations such as the General Data Protection Regulation (GDPR) or the California Consumer Privacy Act (CCPA).
A governed data architecture supports AI/ML innovation with observability, validation, lineage, access controls, and masking.
Controlled chaos
Innovation requires experimentation, which invites chaos. In this sense data architects and data engineers have a lot in common with kindergarten teachers! The most effective data architectures—and classrooms—offer the flexibility you need to innovate while still providing the necessary rules and guardrails. Data teams can achieve this by implementing the elements described in this blog across the holistic data environment: infrastructure, integration, and access.
To learn more about this topic, check out my recent webinar with Nick Golovin, CData SVP, Enterprise Data Platform.