ProofBridge: Auto-Formalization of Natural Language Proofs in Lean via Joint Embeddings

Abstract

Translating human-written mathematical theorems and proofs from natural language (NL) into formal languages (FLs) like Lean 4 has long been a significant challenge for AI. Most state-of-the-art methods focus either on theorem-only NL-to-FL auto-formalization or on FL proof synthesis from FL theorems alone. In practice, auto-formalization of both theorem and proof still requires human intervention, as seen in AlphaProof's silver-medal performance at the 2024 International Mathematical Olympiad, where problem statements were manually translated before automated proof synthesis. This paper presents ProofBridge, a unified framework for automatically translating entire NL theorem-proof pairs into Lean 4. At its core is a joint embedding model that maps semantically equivalent NL and FL (NL-FL) theorem-proof pairs into a shared semantic space, enabling cross-modal retrieval of semantically relevant FL examples to guide translation. ProofBridge integrates retrieval-augmented fine-tuning with iterative proof repair, leveraging Lean's type checker and semantic equivalence feedback to ensure both syntactic correctness and semantic fidelity. Experiments show substantial improvements over strong baselines including GPT-5, Gemini-2.5, Kimina-Prover, and DeepSeek-Prover, achieving +31.14% semantic correctness (SC) and +1.64% type correctness (TC) on MINIF2F-TEST-PF.

Introduction

In mathematics, ensuring the correctness of proofs is a crucial yet inherently difficult task. Traditionally, mathematicians rely on the peer-review process for proof verification, yet as proofs grow increasingly complex, even careful human scrutiny can overlook subtle errors. To mitigate the challenges of verifying complex proofs, proof assistants and formal mathematical languages like Coq, Isabelle, HOL Light, Lean 4, Metamath, and Peano have been developed, offering a way to create computer-verifiable formal proofs. However, constructing FL proofs is time-intensive and demands both deep mathematical expertise and detailed knowledge of the language and its libraries, making the process challenging even for experienced mathematicians. Two key research directions have emerged to simplify the task: auto-formalization (translating NL to FL) and automated formal proof synthesis (AFPS, generating FL proofs from FL theorems). Most prior works focus only on formalizing theorems (statements), not proofs. The key insight of ProofBridge is to address the task of proof auto-formalization end-to-end, treating it as learning from demonstrations: the LLM is guided not only by the NL proof but also by FL proofs retrieved using a novel NL/FL joint embedding model.

Tactic-Style Proofs in Lean 4

Lean 4 is a functional programming language and interactive theorem prover based on the propositions-as-types principle, where proving a proposition is equivalent to constructing a term of the corresponding type. Users write proofs in a tactic language, which provides high-level steps to guide term construction. Lean 4 represents tactic-style proofs as directed acyclic graphs (DAGs) of proof states and tactics, automatically generating the corresponding proof term in the background. The kernel then verifies the term by enforcing the axiomatic foundations of Lean's logic, the Calculus of Inductive Constructions. Each proof state $S_k \equiv [G_1, \cdots, G_n]$ consists of a sequence of zero or more open goals. Each open goal $G_i \equiv pr_i \vdash cn_i$ represents a proposition $cn_i$ that needs to be proven given premises $pr_i$ . Each tactic $\text{tac}_i$ represents a proof step applied to an open goal, producing a new proof state. A parent goal is resolved once all subgoals are resolved. The proof auto-formalization problem is formalized as: given an NL theorem-proof pair $M_{\text{NL}} = \langle T_{\text{NL}}, P_{\text{NL}} \rangle$ , learn a function $f: M_{\text{NL}} \mapsto M_{\text{FL}}$ that produces a corresponding Lean theorem-proof pair $M_{\text{FL}} = \langle T_{\text{FL}}, P_{\text{FL}} \rangle$ satisfying both type correctness (passes Lean type-checking) and semantic correctness ( $T_{\text{FL}} \equiv T_{\text{NL}}$ ).

Joint Embedding for Cross-Modal Retrieval

A core component of ProofBridge is the joint embedding model, which learns to represent NL theorem-proof pairs and their FL (Lean) counterparts in a shared semantic space, enabling cross-modal retrieval. During training, NL-FL pairs $(M_{\text{NL}}^{(i)}, M_{\text{FL}}^{(i)})$ are encoded into vectors using two modality-specific encoders. The NL encoder $f$ uses all-MiniLM-L6-v2 (22.7M parameters) to encode $M_{\text{NL}}^{(i)}$ into 384-dimensional embeddings, projected to dimension $d = 512$ via a trainable linear layer. The FL encoder $g$ first extracts the linearized DAG traversal of tactics from $P_{\text{FL}}^{(i)}$ using Lean REPL, representing the entire proof as an ordered sequence of proof-state transformations. Each state $S_h$ is encoded using LeanDojo's ByT5 proof-state encoder (218M parameters), and a single embedding is obtained via mean-pooling. The two encoders are aligned via a symmetric contrastive loss over mini-batches of NL-FL pairs:

$\mathcal{L}(\mathcal{B}) = -\frac{1}{2n}\sum_{i=1}^{n}\left[\log\frac{\exp([\hat{v}_{\text{NL}}^{(i)}, \hat{v}_{\text{FL}}^{(i)}]/\tau)}{\sum_{j=1}^{n}\exp([\hat{v}_{\text{NL}}^{(i)}, \hat{v}_{\text{FL}}^{(j)}]/\tau)} + \log\frac{\exp([\hat{v}_{\text{FL}}^{(i)}, \hat{v}_{\text{NL}}^{(i)}]/\tau)}{\sum_{j=1}^{n}\exp([\hat{v}_{\text{FL}}^{(i)}, \hat{v}_{\text{NL}}^{(j)}]/\tau)}\right]$

where $\tau > 0$ is a temperature hyperparameter. This loss encourages each NL embedding to be closest to its corresponding FL embedding, and vice versa. At inference, given a query theorem-proof pair in either modality, it is encoded into the shared space, and the top- $K$ nearest neighbors from the other modality are retrieved as in-context demonstrations.

Retrieval-Augmented Fine-Tuning and Iterative Repair

ProofBridge fine-tunes Kimina-Prover-RL-1.7B using supervised learning on the NuminaMath-Lean-PF dataset, a curated collection of 38.9k NL↔Lean 4 theorem-proof pairs specialized for proof auto-formalization. For each training instance, a prompt is constructed containing the input NL theorem-proof pair $M_{\text{NL}}$ and the top- $K$ retrieved FL theorem-proof pairs with relevance scores. The model is trained to generate an FL theorem-proof pair $\widetilde{M}_{\text{FL}}$ using the standard auto-regressive language modeling loss $\mathcal{L}_{\text{CE}} = -\frac{1}{|\mathcal{T}|}\sum_{i=1}^{|\mathcal{T}|}\log P_\theta(\tau_i \mid \tau_{<i}, \mathcal{C})$ . During inference, however, the LLM may still generate FL pairs containing syntactic errors or semantic misalignments. ProofBridge addresses this through an iterative proof repair mechanism combining two verification steps: (1) Syntactic verification via Lean's type checker, extracting specific error messages if compilation fails; (2) Semantic verification using an LLM-based equivalence judge to assess whether $\widetilde{T}_{\text{FL}}$ accurately represents $T_{\text{NL}}$ . When either check fails, error feedback is generated and the LLM repairs the output, iterating up to $R_{\max} = 5$ times until both checks pass.

NuminaMath-Lean-PF Dataset

For training, ProofBridge constructs NuminaMath-Lean-PF from NuminaMath-LEAN, containing 104,155 competition-level problems in algebra, geometry, number theory, combinatorics, and calculus. Each instance pairs an NL theorem $T_{\text{NL}}$ with a human-written Lean v4.15.0 theorem; 38,951 include FL proofs $P_{\text{FL}}$ (30% human-written, rest by KiminaProver). Each FL pair is type-checked in Lean, with approximately 6% (2,337) failing due to syntax, library mismatches, or incomplete proofs — these are automatically repaired via Gemini-2.5-Pro extraction of error messages and iterative re-verification. NL proofs are generated in two stages: first, solution sketch retrieval matches $T_{\text{NL}}$ to NuminaMath 1.5 problem-solution pairs via exact string matching; second, FL-to-NL informalization uses Gemini-2.5-Pro to translate verified $P_{\text{FL}}$ into detailed, human-readable NL proofs. For validation, the paper curates MINIF2F-TEST-PF, a Lean v4.15.0 version of the widely-used miniF2F-test auto-formalization benchmark containing 244 Olympiad-level problems from AIME, AMC, IMO, and undergraduate courses.

Experimental Results

ProofBridge is evaluated against 13 state-of-the-art LLMs across three categories. Theorem auto-formalization LLMs (Kimina-Autoformalizer-7B, Herald-Translator) achieve 0% TC and SC across all pass@k, as they are designed for theorem statements only, leaving proofs incomplete with sorry placeholders. Among foundation models, GPT-5-mini attains 9.02% TC and 34.84% SC at pass@32, struggling with Lean 4's strict syntax. Among AFPS LLMs, Kimina-Prover-72B achieves the strongest zero-shot performance at 46.31% SC and 79.51% TC. ProofBridge, built on top of the much smaller Kimina-Prover-RL-1.7B, surpasses Kimina-Prover-RL-1.7B's zero-shot performance by +22.54% SC and +20.49% TC. Relative to Kimina-Prover-RL-1.7B's random few-shot performance, ProofBridge achieves +31.14% SC and +1.64% TC. For cross-modal retrieval, ProofBridge's joint embedding model achieves 3.28× higher Recall@ $K$ at $K=1$ for NL→FL and 1.94× for FL→NL compared to the baseline all-MiniLM-L6-v2 encoder, with top- $K$ retrieved embeddings 23% closer and non-retrieved embeddings 104% farther apart.

Ablation Studies and Analysis

The ablation study reveals the contribution of each component. ProofBridge (SFT only), fine-tuned on labeled NL-FL pairs and evaluated with semantically relevant examples via the joint embedding model, improves SC by +2.06% but reduces TC by −11.06% relative to Kimina-Prover-RL-1.7B. ProofBridge (Retrieval-augmented SFT), which fine-tunes the LLM with semantically relevant FL proofs included in the input, achieves +22.55% SC. ProofBridge (Retrieval-augmented SFT + Repair), adding iterative proof repair, yields the best results at +29.92% SC and +5.74% TC. The study also shows that random examples improve TC (syntax) but hurt SC by causing the model to hallucinate semantically misaligned proofs, while text-based retrieval via Qwen3-Embedding-8B improves SC but reduces TC, likely because QA-based retrieval favors proofs with similar tactics that reduce tactic diversity. This highlights the need for retrieving semantically relevant examples via a DAG-aware encoder, as in ProofBridge.

Conclusion

ProofBridge presents a unified NL-to-Lean proof auto-formalization framework that translates both theorems and proofs end-to-end. Its joint embedding model captures the DAG structure of Lean proofs, enabling highly effective cross-modal retrieval of semantically relevant FL proofs that serve as demonstrations for retrieval-augmented fine-tuning. Combined with an iterative verifier-guided repair loop using Lean type-checking and semantic equivalence checking, ProofBridge significantly outperforms state-of-the-art LLMs in both semantic correctness and type correctness on the MINIF2F-TEST-PF benchmark. The work demonstrates that integrating structured embeddings, retrieval guidance, and verifier feedback leads to more reliable proof auto-formalization, advancing the goal of bridging the gap between human mathematical reasoning and computer-verifiable formal proofs.