We constructed a neural theorem prover for Lean that realized to unravel quite a lot of difficult high-school olympiad issues, together with issues from the AMC12 and AIME competitions, in addition to two issues tailored from the IMO. The prover makes use of a language mannequin to seek out proofs of formal statements. Every time we discover a new proof, we use it as new coaching knowledge, which improves the neural community and allows it to iteratively discover options to tougher and tougher statements.

We achieved a brand new state-of-the-art (41.2% vs 29.3%) on the miniF2F benchmark, a difficult assortment of high-school olympiad issues. Our strategy, which we name *assertion curriculum studying*, consists of manually accumulating a set of statements of various issue ranges (with out proof) the place the toughest statements are much like the benchmark we goal. Initially our neural prover is weak and may solely show a number of of them. We iteratively seek for new proofs and re-train our neural community on the newly found proofs, and after 8 iterations, our prover finally ends up being vastly superior when examined on miniF2F.

Formal arithmetic is an thrilling area to review due to (i) its richness, letting you show arbitrary theorems which require reasoning, creativity and perception and (ii) its similarity to video games—the place AI has been spectacularly profitable—in that it has an automatic means of figuring out whether or not a proof is profitable (i.e., verified by the formal system). As demonstrated within the trivial instance under, proving a proper assertion requires producing a sequence of proof steps, every proof step consisting in a name to a tactic. These ways take mathematical phrases as arguments and every tactic name will remodel the present assertion to show, into statements which can be simpler to show, till nothing is left to show.

Downside 1

Tailored from AMC12 2000 Downside 5

Show that if $|x – 2| = p$, the place $x

```
theorem amc12_2000_p5
(x p : ℝ)
(h₀ : x -- to show
(h₁ : abs (x - 2) = p) :
x - p = 2 - 2 * p :=
start
have h₂ : abs (x - 2) = -(x - 2), {
apply abs_of_neg,
linarith,
},
rw h₁ at h₂,
linarith,
finish
```

We observe that the aptitude to generate authentic mathematical phrases required as arguments of ways, which can’t be executed with out a neural language mannequin, emerges from our coaching process. The proof under is an instance of it: the proof step `use n + 1`

(totally generated by our fashions) proposes to make use of `n + 1`

as an answer, the remainder of the formal proof counting on the `ring_exp`

tactic to confirm that it’s certainly legitimate.

Downside 2

Tailored from AMC12B 2020 Downside 6

For all integers $n ≥ 9$, show that $((n + 2)! −(n + 1)!) / n!$ is an ideal sq..

```
theorem amc12b_2020_p6
(n : ℕ)
(h0 : 9 ≤ n) :
∃ x : ℕ, (x:ℝ)^2 =
(nat.factorial (n + 2) - nat.factorial (n + 1))
/ nat.factorial n :=
start
use n + 1,
field_simp [nat.factorial_ne_zero, pow_succ'],
ring_exp
finish
```

We additionally observe that our fashions and search process are able to producing proofs that chain a number of non-trivial reasoning steps. Within the proof under, the mannequin begins by utilizing contraposition resulting in the existential assertion (`∃ (x : ℝ), f x ≠ a * x + b`

). It then generates a witness for it with `use (0 : ℝ)`

and finishes the proof by leveraging the `norm_num`

tactic.

Downside 3

Tailored from the MATH dataset

Let $f(x) = Ax + B$ and $g(x) = Bx + A$, the place $A ne B$. If $f(g(x)) – g(f(x)) = B – A$, show that $A + B = 0$.

```
theorem mathd_train_algebra_217
(a b : ℝ)
(f g : ℝ → ℝ)
(h₀ : ∀ x, f x = a * x + b)
(h₁ : ∀ x, f x = b * x + a)
(h₂ : a ≠ b)
(h₃ : ∀ x, f (g x) - g (f x) = b - a) :
a + b = 0 :=
start
revert h₀ h₁ h₂ h₃,
contrapose!,
rintro ⟨h₀, ⟨h₁, h₂⟩⟩,
use (0 : ℝ),
simp solely [sub_eq_iff_eq_add, h₀, mul_zero, zero_add],
norm_num at h₀,
finish
```

Our fashions, educated with *assertion curriculum studying*, have been in a position to shut quite a lot of issues from coaching textbooks in addition to AMC12 and AIME competitions, and a pair of issues tailored from the IMO. We current under three examples of such generated proofs.

Downside 4

Tailored from IMO 1964 Downside 2

Suppose $a$, $b$, $c$ are the perimeters of a triangle.

Show that $a^2(b + c − a) + b^2(c + a − b) + c^2(a + b − c) leq 3abc$.

```
theorem imo_1964_p2
(a b c : ℝ)
(h₀ : 0 start
nlinarith [sq_nonneg (b - a),
sq_nonneg (c - b),
sq_nonneg (c - a)]
finish
```

Downside 5

Tailored from AIME 1984 Downside 1

Show that $a2 + a4 + a6 + a8 + …+ a98 = 93$ if $a1$, $a2$, $a3…$ is an arithmetic development with frequent distinction $1$, and $a1 + a2 + a3 + … + a98 = 137$.

```
theorem aime_1984_p1
(u : ℕ → ℚ)
(h₀ : ∀ n, u (n + 1) = u n + 1)
(h₁ : ∑ ok in finset.vary 98, u ok.succ = 137) :
∑ ok in finset.vary 49, u (2 * ok.succ) = 93 :=
start
rw finset.sum_eq_multiset_sum,
dsimp [finset.range] at h₁,
simp [h₀],
ring,
norm_num at h₁,
norm_num,
apply eq_of_sub_eq_zero,
{ simp solely [*, abs_of_pos, add_zero] at *, linarith },
finish
```

Downside 6

Tailored from IMO Longlist 1990 Downside 77

For $a, b, c$ reals, show that $(a^2 + ab + b^2)(b^2 + bc + c^2)(c^2 + ca + a^2) geq (ab + bc + ca)^3$.

```
theorem imo_longlist_1990_p77
(a b c : ℝ) :
(a * b + b * c + c * a)^3 ≤
(a^2 + a * b + b^2) * (b^2 + b * c + c^2) *
(c^2 + c * a + a^2) :=
start
let u : euclidean_space ℝ (fin 2) := ![a, b],
let v : euclidean_space ℝ (fin 2) := ![b, c],
have h₀ := real_inner_mul_inner_self_le u v,
simp [u, v, fin.sum_univ_succ,
←pow_two, ←pow_two, le_of_lt, mul_assoc] at h₀,
have h₃ : 0 ≤ (c + a) * (c + a),
{ nlinarith, },
have h₄ := sq_nonneg (a * b + b * c + c * a),
simp [sq, h₀, h₃, mul_add, add_mul] at h₄ ⊢,
nlinarith [sq_nonneg (b - a),
sq_nonneg (c - b),
sq_nonneg (a - c)]
finish
```

Formal arithmetic entails two fundamental challenges that make a naive software of reinforcement studying unlikely to succeed.

- (i)
**Infinite motion house**: not solely does formal arithmetic have a particularly giant search house (like Go for instance), it additionally has an infinite motion house. At every step of a proof search, the mannequin should select not from a well-behaved finite set of actions, however a fancy and infinite set of ways, involving exogenous mathematical phrases that must be generated (e.g., producing a mathematical assertion for use as a witness, an object utilized in steps akin to “there exists an $x$ s.t. …”, or a reduce, the introduction and the chaining of a lemma in the midst of a proof). - (ii)
**Lack of self-play**: conversely to 2-player video games, a prover just isn’t enjoying in opposition to an opponent however in opposition to a set of statements to show. When confronted with a press release that’s simply too exhausting, there isn’t a apparent reframing that may let the prover generate middleman simpler statements to deal with first. This asymmetry prevents naive software of the self-play algorithms that have been profitable with 2-player video games.

In our work, we deal with the infinite motion house drawback by sampling actions from a language mannequin as we seek for a proof. Language fashions have the aptitude to generate the tactic calls in addition to the unique mathematical phrases usually required as arguments. Our foundation for addressing the dearth of self-play is the commentary that the important thing position of self-play in 2-player video games is to offer an unsupervised curriculum. Our methodology proposes to switch this unsupervised curriculum with an auxiliary set of drawback statements (with out requiring proofs) of various issue. We empirically present that, when the problem of those auxiliary issues is diverse sufficient, our coaching process is ready to remedy a curriculum of more and more troublesome issues, ultimately generalizing to the set of issues we care about.

Whereas these outcomes are extraordinarily thrilling, as they show that deep studying fashions are able to non-trivial mathematical reasoning when interacting with a proper system, we’re nonetheless very removed from best-student efficiency on these competitions, solely often, somewhat than constantly, closing difficult olympiad issues. We hope nonetheless that our work will encourage analysis on this area, particularly in the direction of the IMO Grand Problem and that the *assertion curriculum studying* methodology we suggest will assist speed up progress in automated reasoning basically.