An autonomous agent that solves competitive programming problems through a closed multi-model feedback loop — a GPT-4/5 generator paired with a specialized debugging critic (DeepSeek-R1, Llama-3.3-70B, or Codestral-2508) — submitted to the live Codeforces judge for up to three refinement iterations.
Can role-separated, iterative multi-model feedback reliably improve LLM code generation beyond what a single model can achieve alone?
Separate solution generation from debugging using specialized critics with persistent conversation context maintained across all refinement iterations.
167 ICPC World Finals + 200 Codeforces problems, submitted to the live online judge — binary, automated correctness with no human evaluation.
Stateful multi-model feedback achieves 2.2–2.3× greater gains than multi-round stateless refinement, confirming that accumulated context — not just more compute — drives improvement.
How much do solve rates improve when LLMs iteratively refine solutions using live judge feedback across up to three iterations (Itr1–Itr3)?
Do reasoning-focused (DeepSeek-R1), general-purpose (Llama-3.3-70B), or code-specialized (Codestral-2508) critics provide more effective debugging hints?
Does maintaining full conversation history across iterations reduce error repetition and improve solve rates compared to stateless (context-reset) refinement?
Do code-specialized, general-purpose, and reasoning-focused critics show differential performance across algorithmic categories (Graph Theory, DP, Math, Geometry, etc.)?
How does A-ProS compare to simpler baselines — zero-shot, single-round stateless, and multi-round stateless — on the same 47-problem subset?
167 elite-level problems from 14 years of ICPC World Finals, submitted to live Codeforces Gym judge
200 problems (rating 1200–1800) from recent contests — cumulative solve rate by iteration
| Workflow | ICPC Itr0 | ICPC Itr3 | Improvement | CF Itr3 |
|---|---|---|---|---|
| GPT-5 + DeepSeek-R1 | 39 (23.4%) | 90 (53.9%) | +131% | 41.0% |
| GPT-5 + Llama-3.3-70B | 39 (23.4%) | 87 (52.1%) | +123% | 38.5% |
| GPT-5 + Codestral-2508 | 39 (23.4%) | 85 (50.9%) | +118% | 36.5% |
| GPT-4 + DeepSeek-R1 | 15 (9.0%) | 38 (22.8%) | +153% | 26.0% |
| GPT-4 + Llama-3.3-70B | 15 (9.0%) | 34 (20.4%) | +127% | 23.5% |
| GPT-4 + Codestral-2508 | 15 (9.0%) | 31 (18.6%) | +107% | 21.0% |
Controlled paired experiment on 47 stratified Codeforces problems — stateful (A-ProS) vs. stateless (context reset each iteration)
| Metric | Stateful | Stateless | Δ |
|---|---|---|---|
| Itr0 | 21.3% | 21.3% | 0.0 pp |
| Itr3 | 40.4% | 29.8% | +10.6 pp |
| Error repeat | 11.8% | 41.8% | −30.0 pp |
| Metric | Stateful | Stateless | Δ |
|---|---|---|---|
| Itr0 | 10.6% | 10.6% | 0.0 pp |
| Itr3 | 25.5% | 17.0% | +8.5 pp |
| Error repeat | 14.2% | 40.8% | −26.6 pp |
| Approach | GPT-5 gain | GPT-4 gain |
|---|---|---|
| Single-round stateless | +10% | +21% |
| Multi-round stateless | +40% | +60% |
| A-ProS stateful | +90% | +141% |
A-ProS achieves 2.2–2.3× greater gains than multi-round stateless, confirming that accumulated conversational memory — not just additional iterations — drives improvement.
A-ProS achieves competitive performance with AlphaCode-class systems while using ~10,000× fewer attempts (4 submissions vs. 1M samples).
Stateful A-ProS achieves 2.2–2.3× greater gains over zero-shot compared to multi-round stateless refinement with the same iteration budget, confirming that accumulated memory — not just more compute — is what matters.
DeepSeek-R1 (reasoning-focused) consistently outperforms both Codestral-2508 (code-specialized) and Llama-3.3-70B (general-purpose), achieving 2–7 more solved ICPC problems at Itr3.
GPT-4 shows larger relative improvement (+153%) compared to GPT-5 (+131%) on ICPC. Superior debugging feedback partially compensates for weaker generation — critic effects are more pronounced at lower generator tiers.
Even the weakest GPT-5 workflow outperforms the strongest GPT-4 workflow. Generator capability is the primary performance driver; critic choice provides a secondary but consistent effect.
The DeepSeek-R1 > Llama-3.3-70B > Codestral-2508 ranking holds across all problem categories (Graph Theory, DP, Math, Geometry) and both generators, indicating stable specialization effects.
Reaches 95% of AlphaCode 2 performance through systematic debugging of a single solution trajectory rather than massive parallel sampling and filtering.
A-ProS is an autonomous AI agent that solves competitive programming problems by combining two types of language models in a closed feedback loop. The key design decision is separation of roles: generation and debugging are handled by different, independently specialized models rather than asking a single model to do both.
All pairwise comparisons use McNemar's exact test for paired binary outcomes, with Holm–Bonferroni correction across 6 comparisons per analysis. Effect sizes reported as Cohen's h (appropriate for binary proportions). Bootstrap 95% CIs use 10,000 stratified resamples. The ablation study (RQ3) uses n = 47 stratified problems with stateful vs. stateless conditions perfectly paired at Itr0.