Click-Through Rate Prediction Using Two-Tower Neural Networks: A Scalable Multi-Modal Deep Learning Framework

Published on 22 Apr 2026

CTR prediction retrieval pipeline workflow

Abstract

Click-through rate (CTR) prediction is a foundational component of modern recommender systems and digital advertising platforms, directly influencing user engagement, personalization, and revenue optimization. Traditional machine learning models often struggle to scale effectively with high-dimensional, heterogeneous data. This white paper presents a comprehensive analysis of the Two-Tower Neural Network architecture, emphasizing its scalability, efficiency, and adaptability in large-scale retrieval systems.

The study further extends the classical two-tower framework into a multi-modal paradigm that integrates textual, visual, and structured data for improved representation learning. Advanced interaction mechanisms, including token-level matching inspired by late interaction models such as ColBERT, are examined to address limitations of coarse vector similarity. Experimental validation using benchmark datasets demonstrates that the proposed multi-modal two-tower model significantly outperforms unimodal baselines in retrieval and ranking tasks. The findings establish the architecture as a robust solution for real-time, high-throughput CTR prediction in modern AI-driven ecosystems.

Introduction

CTR prediction estimates the probability that a user will interact with a given item, such as an advertisement, product, or recommendation. It is a critical metric in domains such as digital advertising, e-commerce, and content recommendation, where accurate predictions directly impact user satisfaction and monetization strategies.

With the exponential growth of user data and item catalogs, traditional models—including logistic regression and gradient-boosted trees—face limitations in capturing complex user-item interactions at scale. Deep learning approaches, particularly the Two-Tower architecture, have emerged as a scalable alternative capable of learning high-dimensional representations while supporting real-time inference.

The Two-Tower model is widely adopted in industrial systems due to its ability to decouple user and item representations, enabling efficient offline computation and fast online retrieval. This paper explores its architecture, extensions, and application in multi-modal environments.

Two-Tower Architecture for CTR Prediction

The Two-Tower model consists of two independent neural networks: a User Tower and an Item Tower. Each tower encodes its respective input into dense vector representations within a shared embedding space.

The User Tower processes user-specific features such as demographics, browsing history, and contextual signals.
The Item Tower encodes item attributes, including metadata, textual descriptions, and visual features.

The relevance score between a user and an item is computed using a similarity function, typically the dot product or cosine similarity between their embeddings. This late interaction mechanism ensures computational efficiency.

A key advantage of this architecture is its ability to precompute item embeddings offline, enabling rapid nearest-neighbor search during inference. This significantly reduces latency and makes the model suitable for high-throughput applications such as real-time bidding systems.

Learning and Optimization Strategies

Training the Two-Tower model involves learning to distinguish between positive and negative user-item interactions.

Positive pairs represent actual user engagements (e.g., clicks or purchases).
Negative pairs are sampled from non-interacted items.

The most commonly used optimization objective is softmax cross-entropy loss, often combined with in-batch negative sampling. This approach improves training efficiency by leveraging other samples within the batch as implicit negatives.

Alternative loss functions such as Bayesian Personalized Ranking (BPR) can also be employed for ranking-focused optimization.

Advanced Interaction Mechanisms

While the Two-Tower model is efficient, its reliance on single-vector representations can lead to information loss. To address this, advanced architectures such as late interaction models introduce finer-grained matching mechanisms.

Token-Level Matching

Inspired by models like ColBERT, token-level matching computes similarity at a more granular level by comparing individual token embeddings rather than aggregated vectors.

Instead of collapsing all information into a single embedding:

Queries and documents are represented as sequences of token embeddings.
A MaxSim operation computes the maximum similarity between query tokens and document tokens.
The final relevance score is obtained by aggregating these similarities.

This approach preserves semantic richness and improves retrieval accuracy, albeit with increased computational complexity.

Multi-Modal Extension of the Two-Tower Model

Modern recommendation systems require the integration of diverse data modalities, including text, images, and structured features. The Two-Tower architecture can be extended to support multi-modal inputs, significantly enhancing its representational power.

5.1 Multi-Modal Item Representation

The Item Tower incorporates specialized encoders:

Text Encoder: Processes descriptions and metadata using transformer-based models.
Image Encoder: Extracts visual features using CNNs or Vision Transformers.
Categorical Encoder: Handles discrete attributes such as category or brand.

These embeddings are fused using concatenation or neural layers to produce a unified item representation.

5.2 User Representation

The User Tower integrates:

Historical interaction sequences
Contextual information (time, device, location)
User identifiers

Sequential models such as LSTM or GRU capture temporal dependencies in user behavior.

5.3 Feature Fusion

Feature fusion is critical in multi-modal systems. The combined embeddings are passed through multilayer perceptrons (MLPs) to produce compact, fixed-size vectors that encode rich semantic information.

System Deployment and Scalability

The Two-Tower architecture is designed for production-scale deployment.

Offline Phase

Item embeddings are precomputed for millions or billions of items.
These embeddings are indexed using Approximate Nearest Neighbor (ANN) algorithms such as FAISS.

Online Phase

User embeddings are generated in real time.
Fast vector search retrieves the top candidate items within milliseconds.

This separation of offline and online computation enables low-latency, high-throughput performance, which is essential for industrial CTR systems.

Experimental Evaluation

The proposed multi-modal Two-Tower model was evaluated using the MS COCO dataset, which contains images paired with descriptive captions. The task simulates cross-modal retrieval, serving as a proxy for CTR prediction in multi-modal environments.

7.1 Model Configuration

Image features extracted using Vision Transformers
Text encoded using transformer-based language models
Training performed using contrastive learning with in-batch negative sampling

7.2 Performance Metrics

The model was evaluated using:

Recall@K
Mean Rank
Normalized Discounted Cumulative Gain (NDCG)

7.3 Results

The multi-modal Two-Tower model consistently outperformed unimodal baselines:

Higher Recall@1 and Recall@10 scores
Lower mean rank, indicating better retrieval accuracy
Improved NDCG, reflecting better ranking quality

These results confirm that multi-modal feature fusion significantly enhances model performance, validating the effectiveness of the proposed architecture.

Challenges and Future Directions

Despite its advantages, the Two-Tower model faces several challenges:

Information Bottleneck

The use of simple similarity functions (e.g., dot product) may fail to capture complex interactions between users and items.

Fusion Complexity

Combining multiple modalities efficiently remains a non-trivial task, requiring optimized architectures to balance accuracy and computational cost.

Hybrid Models

Future work should explore hybrid approaches that integrate:

Token-level interactions
Re-ranking strategies
Cross-encoder architectures

These methods aim to bridge the gap between efficiency and accuracy in large-scale retrieval systems.

Conclusion

The Two-Tower neural network architecture represents a powerful and scalable solution for CTR prediction in modern recommendation systems. Its ability to decouple user and item representations enables efficient large-scale deployment, while its adaptability supports integration with advanced deep learning techniques.

The extension to multi-modal data further enhances its capability, allowing richer representations and improved prediction accuracy. Experimental results demonstrate that the proposed model outperforms traditional and unimodal approaches, making it a strong candidate for real-world applications.

As digital ecosystems continue to evolve, the Two-Tower framework augmented with advanced interaction mechanisms and multi-modal learning will remain a cornerstone of next-generation recommender systems.

Do you accept all cookies?

Click-Through Rate Prediction Using Two-Tower Neural Networks: A Scalable Multi-Modal Deep Learning Framework