The Structure and Interpretations of Multimodal Systems

I. The Impedance Mismatch

I came to multimodal AI from a background shaped by SICP — Abelson and Sussman's conviction that engineering discipline means managing complexity, not heroically wrestling with it. What I found was a field drowning in accidental complexity.

A photograph is a two-dimensional array of pixel intensities. A video is a sequence of such arrays indexed by time. An audio recording is a one-dimensional signal of pressure variations. A document is a tree of semantic elements containing text runs, embedded images, and structural markers.

These are not the same kind of thing. Yet modern AI systems routinely consume all of them, transform them, and produce outputs that blend modalities — generating images from text, transcribing speech to text, describing videos in natural language. The engineering challenge is not making models that can do this (that problem, surprisingly, is largely solved) but building systems that can orchestrate these heterogeneous data types without drowning in accidental complexity.

The fundamental tension: storage systems want uniform representations, but multimodal data resists uniformity. You can serialize anything to bytes, but bytes don't capture the semantics of frames-within-videos or pages-within-documents. You can store everything in a blob column, but then every operation requires deserialization, type-checking, and format-specific handling scattered throughout your codebase.

This is the impedance mismatch of multimodal systems.

II. Data as Transformation

The traditional view treats data as static: you have a corpus, you process it, you store the results. But I've come to see multimodal AI workflows differently — as transformation graphs. A video is not just a file. It's a potential source of frames, audio tracks, transcripts, embeddings, summaries, detected objects, and extracted text. Each of these derived forms has its own downstream uses.

The insight that changed how I think about this: derived data and source data have the same ontological status. A transcript is not "less real" than the audio it came from. An embedding is not a second-class citizen compared to the text it represents. These are all projections of the same underlying information, viewed through different lenses.

Once you see data this way, the question changes from "how do I store and retrieve data" to "how do I define and compose transformations." The storage problem becomes secondary to the transformation problem.

III. The One-to-Many Problem

Many multimodal operations have a structural property that traditional data systems handle poorly: they produce multiple outputs from a single input. A video becomes many frames. A document becomes many chunks. An audio file becomes many segments with timestamps.

This is not a detail — it's a fundamental pattern. Retrieval-augmented generation requires chunking. Video analysis requires frame extraction. Audio processing requires segmentation. Any system that treats these as special cases, handled by application code outside the data layer, will accumulate complexity at the boundaries.

The relational model handles one-to-many through foreign keys and joins. But the multimodal case is different: the "many" side doesn't exist independently. Frames don't have identity apart from their source video and position. Chunks don't exist except as views into a document. The relationship is constitutive, not referential.

A sound architecture treats these derived collections as first-class entities with automatic lifecycle management. When the source changes, the derived collection updates. When the source is deleted, the derived collection disappears. No orphan cleanup, no consistency bugs, no manual synchronization.

IV. Declarative over Imperative

SICP taught me to distinguish essential complexity from accidental complexity. The imperative approach to multimodal pipelines is almost entirely accidental complexity: write a script that iterates over videos, extracts frames, runs them through a model, stores embeddings, tracks which videos have been processed, handles failures, manages retries. The declarative approach: specify that embeddings should exist for all frames, and let the system figure out execution.

The declarative approach wins not because it's shorter (though it usually is) but because it separates concerns correctly. The specification of what you want is stable; the implementation of how to achieve it can evolve. You can change batch sizes, add parallelism, implement caching, and handle errors differently — all without touching the specification.

This is the same insight that made SQL successful. Nobody writes imperative code to implement a join; you declare the join and the query optimizer handles execution. Multimodal AI needs the same separation.

V. Incremental Computation

Batch processing assumes you can afford to recompute everything. This assumption breaks down as datasets grow and transformations become expensive (LLM calls, model inference, API costs).

Incremental computation means: when new data arrives, compute only what's affected. When a transformation changes, recompute only the outputs that depend on it. When data is deleted, clean up derived artifacts automatically.

This requires dependency tracking — knowing which outputs depend on which inputs and which transformations. In imperative code, dependencies are implicit in control flow and nearly impossible to extract. In declarative specifications, dependencies are explicit in the structure of the specification itself.

A properly designed multimodal system maintains a dependency graph and uses it to minimize redundant computation. Insert a new video, and the system computes its frames, embeddings, and summaries — but doesn't touch existing videos. Update a model, and the system knows which derived columns need recomputation.

VI. The Embedding as Interface

Embeddings are vectors that capture semantic similarity. Two pieces of content with similar meaning have vectors that are close in some high-dimensional space. This is now routine — but the architectural implications are underappreciated.

An embedding is a universal interface. Once you have embeddings, retrieval becomes vector similarity search regardless of the original modality. Text, images, audio, and video can all be projected into the same embedding space (or compatible spaces), enabling cross-modal retrieval and comparison.

This suggests a design pattern I've found reliable: embed early, embed everything, and keep embeddings synchronized with source data. The embedding layer becomes the semantic index of your system. Queries resolve to vector lookups. Relevance is computed by distance. The heterogeneity of modalities collapses into the homogeneity of vectors.

The cost is that embeddings are lossy — they capture similarity but discard detail. A sound architecture maintains both: the original data for faithful reproduction, the embedding for efficient retrieval. These are complementary views, not alternatives.

VII. The Composition Problem

Real workflows chain multiple transformations: transcribe audio, extract entities from the transcript, generate embeddings from the entities, build a search index over the embeddings. Each step depends on the previous step's output.

Imperative composition is fragile. If the entity extraction step fails, you need error handling that knows about the downstream embedding step. If you want to retry a failed transcription, you need to track which downstream steps need re-running. If you change the entity extraction model, you need to invalidate the right cached results. I've seen this code. I've written this code. It's not fun to maintain.

Declarative composition handles this automatically. Each transformation is specified in terms of its inputs; the system infers the execution order, manages intermediate results, and propagates updates correctly. You define the shape of the pipeline, not the mechanics of execution.

The analogy is to spreadsheet formulas. Cell C1 contains =A1+B1. Cell D1 contains =C1*2. You don't write code to update D1 when A1 changes — the spreadsheet handles the dependency chain. Multimodal AI systems need the same property at scale.

VIII. State and Identity

What does it mean for multimodal data to "change"? A video file is typically immutable — you don't edit individual frames in place. But the derived data (transcripts, summaries, embeddings) might change when models improve or when you update transformation parameters.

This suggests a distinction between source identity and derived state. The source data (the video file) has stable identity. The derived data (the transcript) has state that depends on the transformation that produced it. When the transformation changes, the derived state should update — but the source identity remains the same.

A sound architecture tracks this distinction explicitly. It knows which derived data came from which transformations, and it can re-derive state when transformations change without losing the connection to source identity. This enables versioning, rollback, and experimentation — you can try a new model on your data and compare results without duplicating the source.

IX. Abstraction Barriers

SICP introduces abstraction barriers as boundaries that separate levels of a system. Code above the barrier uses the abstraction; code below implements it. Changes below the barrier don't affect code above, as long as the interface is preserved.

Multimodal AI systems need clear abstraction barriers:

Storage vs. computation. How data is stored (local files, S3, database blobs) should be invisible to transformation logic. A transformation that generates frame embeddings shouldn't know whether frames are cached locally or fetched on demand.

Transformation vs. orchestration. The logic of a transformation (how to transcribe audio, how to extract entities) should be separate from how transformations are scheduled, parallelized, and retried. A transcription function shouldn't contain retry logic or batch size parameters.

Specification vs. execution. The description of a pipeline (these transformations, these dependencies) should be separate from how the pipeline runs (what hardware, what parallelism, what caching strategy). You should be able to run the same specification locally for development and distributed for production.

Systems that blur these barriers accumulate technical debt. Storage details leak into transformation code. Retry logic duplicates across functions. Execution concerns pollute pipeline definitions. The result is code that's hard to test, hard to modify, and hard to reason about.

X. Toward a Principled Foundation

The patterns I've described here are not novel. They appear in database theory (declarative queries, views, normalization), in functional programming (composition, referential transparency, laziness), and in reactive systems (dependency tracking, incremental updates). The contribution of multimodal AI engineering is recognizing that these patterns apply — and that ignoring them leads to the tangled orchestration code that characterizes most multimodal systems today.

A principled foundation for multimodal AI would provide: native types for images, video, audio, and documents; declarative transformation specifications with automatic dependency tracking; first-class support for one-to-many derivation; incremental computation that minimizes redundant work; clean abstraction barriers between storage, transformation, and orchestration.

Such a foundation doesn't eliminate complexity — the inherent complexity of working with heterogeneous data remains. But it localizes complexity in the right places, behind the right abstractions, managed by the right mechanisms. The result is systems where the interesting work (the models, the transformations, the retrieval logic) isn't buried under infrastructure concerns.

That's the goal. Data infrastructure should be invisible. The engineer should think in terms of data and transformations, not pipelines and state machines. Multimodal AI is hard enough without making it harder than it needs to be.

i33ym