May 17, 2026

Why Multimodal: The Case for a Shared Space Across Video, Image, and Text

By KINETK Team

Why Multimodal: The Case for a Shared Space Across Video, Image, and Text

The most popular content on the internet is no longer text based. It is shot, edited, and posted as a clip. The captions are short, the hashtags are noisy, and the most interesting versions of a trend often have neither. By the time a movement reaches the platforms that primarily index text, it has already mutated through several versions of itself that were never described in words.

This is the situation any system trying to understand the modern social web has to confront. Most AI infrastructure does not. It treats text as the universal medium, indexes captions and tags, and applies the same retrieval and reasoning techniques that worked on documents to content where the document is the wrong unit. The bet is that text has enough signals. For news, papers, and code, that bet holds. For multimedia and social content where the same trend lives in a clip, a screenshot, and a reaction video, it does not.

Kinetk made the opposite bet. The data lake is built on a shared embedding space across video, image, and text from day one. This post is the defense of that choice. Why it matters, what it gives you, what it costs, and why we believe anyone serious about social intelligence will eventually arrive at the same conclusion.

The text bias in current AI infrastructure

Almost every popular tool in the AI infrastructure stack today is text-first. Vector databases are presented with text retrieval examples. RAG architectures assume the documents being retrieved are textual. Knowledge graphs assume entities are described in language. Embedding models that have multimodal variants are typically evaluated and benchmarked on text-only tasks. The tutorials, the blog posts, the example notebooks all use text as the demo modality.

This is not because builders do not care about images and video. It is because text is easier. A text token fits in a token budget. A document boundary is obvious. The cost of indexing a billion documents is well understood and well-amortized in the existing stack. The latency of a text-to-text similarity search is predictable. Every layer of the infrastructure stack assumes the basic unit is text and that the basic operation is text similarity.

Visual content breaks each of those assumptions. The unit is not a token but a frame, or a segment, or a clip. The natural "document boundary" depends on how you decide to chunk. The cost of generating a multimodal embedding is meaningfully higher than a text embedding, because the input is much richer. Cross-modal similarity does not behave like within-modal similarity, which means search latency, threshold tuning, and ranking calibration all shift.

The path of least resistance, even for teams that recognize the problem, is to keep the pipeline text-only and gesture at images later. The result is an industry where most AI products are text-first in places where the user's actual signal is visual. The text bias is not a deliberate choice. It is the accretion of years of infrastructure shaped around a different assumption about what content looks like.

Why social content forces the issue

A specific fitness routine starts appearing on TikTok. Five creators post variations of it. Within days, the same routine reappears on Instagram, reshot by larger creators. A week later, screenshots circulate on Reddit. A reaction video appears on YouTube. The underlying piece of culture is the same. Each version has different captions, different hashtags, different creators, different platforms, and different audiences. Some of them have no caption text at all.

A text-only intelligence system sees the captions. It might catch some of the original TikTok versions if the routine has a memorable name. It will probably miss the reshots, since the new captions are different. It will miss the screenshots almost entirely, because the visual content there is the entire signal. It will see the YouTube reaction video as a separate thing, since the caption emphasizes the reactor's frame rather than the routine.

A multimodal system sees the same visual thing in all five places. The clips are similar in the vector space because the actual content is similar. The screenshots cluster with the videos because the frames overlap. The reaction video sits adjacent to the originals because the source footage is embedded inside the reactor's frame. The system recognizes a single piece of culture moving across platforms, even though the text describing it disagrees at every step.

This is not an unusual case. It is how most things move on the modern social web. Every trend that matters is visual. Every creator that matters is judged partly by what their video looks like, not only by what they say about it. Every campaign that matters travels on visual cues that move without being labeled. A system that cannot reason about images and video is reasoning about a shadow of what is actually happening, and the shadow is shrinking every year as text becomes a smaller fraction of what users actually post.

The argument for multimodal is therefore not aesthetic. It is the only way to build an intelligence layer that sees what social content has become. Anything less is a system that was built for the internet of 2015.

What a shared embedding space actually means

The technical move that makes multimodal real is a single embedding model that can take any of text, image, or video as input and produce a vector in the same coordinate system. A text query for "luxury running watch" and an image of a running watch both map into the same high-dimensional space. They are vectors with the same dimensionality and the same notion of proximity.

This is meaningfully different from having two separate models, one for text and one for images. With separate models, "find images similar to this text" requires either training a joint projection or doing a clumsy cross-system lookup. The relationship between text and image lives outside both models, in a third piece of infrastructure that has to be maintained. Inconsistencies between the two models leak into every query that crosses modalities.

With a shared model, the relationship lives inside the model itself. The same model that produces the text vector also produces the image vector. Both already know about each other, because both were trained jointly to occupy a single space. Search becomes a single operation: embed the query, look up nearest neighbors in the unified index. The pipeline does not need to know whether the request is text-to-image, image-to-video, or text-to-text. It runs the same code path.

The cost of this uniformity is that every component of the pipeline has to assume cross-modal behavior. The retrieval layer cannot fall back on text-to-text assumptions. The ranking layer cannot use thresholds calibrated on a single modality. The deduplication pass cannot assume that a high cosine similarity implies textual similarity. The discipline is in not letting text-only assumptions creep back in once the multimodal index is in place. They will try to. Every off-the-shelf reference architecture nudges you back toward text-first defaults. Staying multimodal is an act of continuous architectural restraint.

The cross-modal gap is a real constraint, not a footnote

The promise of a shared space is that cross-modal queries work. The reality is that they work in a specific, measurable way that requires understanding to use.

When you compare a text query to a video using a shared multimodal model, the cosine similarity values you get back are systematically lower than when you compare a video to another video. A near-duplicate video might score around 0.9 against its original. A perfectly relevant text query might cap around 0.5 against the same video. This is the cross-modal similarity gap.

The gap is not a bug. It comes from the way the model maps each modality into the shared space. Video and image embeddings cluster tightly with their own modality because the input is high-dimensional and information-rich. Text embeddings sit in a sparser, narrower region of the same space because the input carries less signal. The space is shared but not uniformly populated.

The implication is that you cannot use a global similarity threshold across modalities. A 0.7 cosine that is a strong score for text-to-video is a weak score for video-to-video. A pipeline that uses a single threshold for "is this result relevant" will silently return nothing for text queries and everything for video queries. The same threshold cannot be right on both sides.

Our retrieval pipeline handles this by normalizing within the result set rather than against a global anchor. The candidate set for a single query has its own range of similarity scores, and relevance is decided relative to that range. We never use a global cutoff. This is not a clever trick. It is the only consistent way to run a multimodal pipeline at production scale. Without it, every cross-modal query degenerates into either silence or noise.

The gap is also poorly surfaced in most multimodal demos. Tutorials tend to use text queries against text-heavy datasets, where the gap does not bite. The gap shows up the moment you try to search a video corpus with a text query, which is the use case social intelligence actually requires. Anyone deploying a multimodal system without designing around the gap is going to discover it in production, on a request path, in front of users.

What multimodal unlocks downstream

The downstream consequences of a shared embedding space are not limited to retrieval. Every layer above the vector index changes.

Ranking becomes media-agnostic. The same scoring function that ranks text content by engagement and recency also ranks video content. There is no separate code path for image ranking, because there is no separate vector space. The signals are uniform because the candidates are uniform. Adding a new modality does not require duplicating the ranking layer. It requires extending the embedding pipeline, and the rest of the stack inherits the addition.

Clustering becomes media-agnostic. A narrative cluster can include video, image, and text content if the underlying ideas overlap. A fitness movement that exists primarily as video on TikTok and primarily as screenshots on Reddit ends up in the same cluster, because the vectors group them together. The narrative layer does not need to be aware of which modalities its members come from. It does not need separate clustering logic per modality.

Duplicate detection works across modalities. A clip that exists as both a full-resolution video on YouTube and a screenshot on Reddit can be flagged as the same underlying content. The screenshot's image vector and the video's frame vector are close enough in the shared space for the dedup pass to recognize them as siblings. A text-only dedup pass would see different captions and treat them as separate pieces of content, missing one of the most common forms of cross-platform reproduction.

Search by example becomes possible without text crutches. A user, an agent, or an analyst can submit a clip and get back content that looks like it, regardless of what any of the results were captioned. This is not exotic. It is what social platforms' internal recommendation systems already do at scale. The difference is that Kinetk exposes it as a queryable API rather than burying it inside an algorithmic feed.

Narrative formation works on the underlying culture rather than the surface vocabulary. When a trend reshapes itself with different captions each time it moves, the multimodal layer follows the underlying visual signature. The narrative system does not have to be told that the same routine has five different names on five different platforms. It infers the connection from the vectors. The captions are useful when they exist, but the system does not depend on them.

None of these is dramatic by itself. The compound effect is that the whole intelligence stack stays consistent across the modalities the modern internet actually uses. Everything built on top inherits that consistency. The cost was paid once, at the foundation. The benefit is paid out on every query.

The foundational choice

The decision to build multimodal from day one is the load-bearing decision in our architecture. Almost every other choice in the platform follows from it. The vector index supports multiple target vectors because content is not one modality. The schema separates content rows from creator and community nodes because relationships exist independent of medium. The retrieval pipeline normalizes per result set because cross-modal similarity is not globally comparable. The ranking layer uses signals that work across modalities because the candidates are mixed by design.

If we had started text-only and tried to bolt on multimodal later, we would now be running two indexes that do not talk to each other, two retrieval paths with different latency profiles, two ranking systems with separate calibration, and two dedup passes that disagree about whether a clip is a duplicate of its own screenshot. We would be doing more work and getting less for it.

The bet underneath the architecture is that anyone serious about social intelligence eventually arrives at the same conclusion. The trends do not happen in text. The creators are not text. The communities are not text. The signals that matter are visual, temporal, and cross-platform. Building infrastructure that treats text as one modality among several, rather than as the universal medium, is what makes the rest of the platform usable in the world the internet has actually become.

This is the argument for multimodal as the foundation rather than the feature. Every other choice in the platform either follows from it or fights against it. We chose to let it lead. The case is that this is not optional infrastructure for the next decade of AI on social content. It is the only infrastructure that fits the shape of the data.

The argument is the product.