ACM TOSEM

A-ProS: Towards Reliable Autonomous Programming
Through Multi-Model Feedback

Anika Tabassum*  ·  Md Sifat Hossain*  ·  Md. Fahim Arefin (University of Dhaka)   ·   Tariqul Islam  ·  Tarannum Shaila Zaman (UMBC)
* Equal contribution

An autonomous agent that solves competitive programming problems through a closed multi-model feedback loop — a GPT-4/5 generator paired with a specialized debugging critic (DeepSeek-R1, Llama-3.3-70B, or Codestral-2508) — submitted to the live Codeforces judge for up to three refinement iterations.

367
Problems Evaluated
6
Workflow Combinations
+131%
Max Improvement (ICPC)
2.3×
Gain vs. Stateless

Research Overview

🎯

Problem Statement

Can role-separated, iterative multi-model feedback reliably improve LLM code generation beyond what a single model can achieve alone?

🔄

Approach

Separate solution generation from debugging using specialized critics with persistent conversation context maintained across all refinement iterations.

📊

Evaluation

167 ICPC World Finals + 200 Codeforces problems, submitted to the live online judge — binary, automated correctness with no human evaluation.

Key Finding

Stateful multi-model feedback achieves 2.2–2.3× greater gains than multi-round stateless refinement, confirming that accumulated context — not just more compute — drives improvement.

Research Questions

RQ1

Iterative Improvement

How much do solve rates improve when LLMs iteratively refine solutions using live judge feedback across up to three iterations (Itr1–Itr3)?

RQ2

Critic Specialization

Do reasoning-focused (DeepSeek-R1), general-purpose (Llama-3.3-70B), or code-specialized (Codestral-2508) critics provide more effective debugging hints?

RQ3

Persistent Context

Does maintaining full conversation history across iterations reduce error repetition and improve solve rates compared to stateless (context-reset) refinement?

RQ4

Category Effectiveness

Do code-specialized, general-purpose, and reasoning-focused critics show differential performance across algorithmic categories (Graph Theory, DP, Math, Geometry, etc.)?

RQ5

Baseline Comparison

How does A-ProS compare to simpler baselines — zero-shot, single-round stateless, and multi-round stateless — on the same 47-problem subset?

Performance Results

ICPC World Finals Problems (2011–2024)

167 elite-level problems from 14 years of ICPC World Finals, submitted to live Codeforces Gym judge

90/167
GPT-5 + DeepSeek-R1 at Itr3
+131% over Itr0
38/167
GPT-4 + DeepSeek-R1 at Itr3
+153% over Itr0
h ≈ 0.44
Cohen's h (Itr3 vs Itr0)
Small effect, approaching medium

Codeforces Contest Problems

200 problems (rating 1200–1800) from recent contests — cumulative solve rate by iteration

41.0%
GPT-5 + DeepSeek-R1 at Itr3
Best workflow (82/200)
26.0%
GPT-4 + DeepSeek-R1 at Itr3
+174% over Itr0 (52/200)
VC = 7.6
Verification Cost (GPT-5 + DeepSeek)
Attempts per solved problem
Workflow ICPC Itr0 ICPC Itr3 Improvement CF Itr3
GPT-5 + DeepSeek-R1 39 (23.4%) 90 (53.9%) +131% 41.0%
GPT-5 + Llama-3.3-70B 39 (23.4%) 87 (52.1%) +123% 38.5%
GPT-5 + Codestral-2508 39 (23.4%) 85 (50.9%) +118% 36.5%
GPT-4 + DeepSeek-R1 15 (9.0%) 38 (22.8%) +153% 26.0%
GPT-4 + Llama-3.3-70B 15 (9.0%) 34 (20.4%) +127% 23.5%
GPT-4 + Codestral-2508 15 (9.0%) 31 (18.6%) +107% 21.0%

Ablation Study: Context Matters

Controlled paired experiment on 47 stratified Codeforces problems — stateful (A-ProS) vs. stateless (context reset each iteration)

GPT-5 + DeepSeek-R1
MetricStatefulStatelessΔ
Itr021.3%21.3%0.0 pp
Itr340.4%29.8%+10.6 pp
Error repeat11.8%41.8%−30.0 pp
GPT-4 + DeepSeek-R1
MetricStatefulStatelessΔ
Itr010.6%10.6%0.0 pp
Itr325.5%17.0%+8.5 pp
Error repeat14.2%40.8%−26.6 pp
RQ5 — Baseline Comparison
ApproachGPT-5 gainGPT-4 gain
Single-round stateless+10%+21%
Multi-round stateless+40%+60%
A-ProS stateful+90%+141%

A-ProS achieves 2.2–2.3× greater gains than multi-round stateless, confirming that accumulated conversational memory — not just additional iterations — drives improvement.

State-of-the-Art Comparison

A-ProS achieves competitive performance with AlphaCode-class systems while using ~10,000× fewer attempts (4 submissions vs. 1M samples).

A-ProS (Ours)

41.0%
GPT-5 + DeepSeek-R1
4 submissions, iterative refinement

AlphaCode 2

43%
Gemini-based
1M samples, filtering

AlphaCode 1

34%
Original system
1M samples, ≤1300 rating

Key Insights

🧠

Context Accumulation is the Key Driver

Stateful A-ProS achieves 2.2–2.3× greater gains over zero-shot compared to multi-round stateless refinement with the same iteration budget, confirming that accumulated memory — not just more compute — is what matters.

Error repetition: 12% vs. 42%
🎯

Reasoning Beats Specialization

DeepSeek-R1 (reasoning-focused) consistently outperforms both Codestral-2508 (code-specialized) and Llama-3.3-70B (general-purpose), achieving 2–7 more solved ICPC problems at Itr3.

Ranking: DeepSeek > Llama > Codestral
📈

Weaker Generators Benefit More

GPT-4 shows larger relative improvement (+153%) compared to GPT-5 (+131%) on ICPC. Superior debugging feedback partially compensates for weaker generation — critic effects are more pronounced at lower generator tiers.

Critic range: 22.6% (GPT-4) vs. 5.9% (GPT-5)
⚖️

Generator Dominates Critic

Even the weakest GPT-5 workflow outperforms the strongest GPT-4 workflow. Generator capability is the primary performance driver; critic choice provides a secondary but consistent effect.

h = 0.68 generator gap (medium effect)
🏆

Consistent Critic Ranking

The DeepSeek-R1 > Llama-3.3-70B > Codestral-2508 ranking holds across all problem categories (Graph Theory, DP, Math, Geometry) and both generators, indicating stable specialization effects.

Stable across 12 comparisons
💡

Efficiency vs. Sampling

Reaches 95% of AlphaCode 2 performance through systematic debugging of a single solution trajectory rather than massive parallel sampling and filtering.

~10,000× fewer attempts

About This Research

A-ProS: Autonomous Programming with Specialized Critics

A-ProS is an autonomous AI agent that solves competitive programming problems by combining two types of language models in a closed feedback loop. The key design decision is separation of roles: generation and debugging are handled by different, independently specialized models rather than asking a single model to do both.

System Architecture

  • Solution Generators: GPT-4 and GPT-5 write C++17 solutions (temperature 0.1)
  • Debugging Critics: DeepSeek-R1-0528, Llama-3.3-70B (Groq), Codestral-2508 (Mistral AI)
  • Persistent Context: Full conversation history maintained across all iterations — the key feature studied in RQ3
  • Live Judge: All solutions submitted to Codeforces; verdicts drive the feedback loop
  • Confidence Scoring: Critics append a confidence score (1–5) to each hint; ECE measures calibration quality

Evaluation Dataset

  • 167 ICPC World Finals problems (2011–2024) via Codeforces Gym
  • 200 Codeforces problems (rating 1200–1800) from recent contests
  • Hidden test suites prevent shortcut heuristics; binary verdicts require no human evaluation
  • Max 4 submissions per problem: 1 zero-shot (Itr0) + 3 feedback iterations (Itr1–Itr3)

Statistical Framework

All pairwise comparisons use McNemar's exact test for paired binary outcomes, with Holm–Bonferroni correction across 6 comparisons per analysis. Effect sizes reported as Cohen's h (appropriate for binary proportions). Bootstrap 95% CIs use 10,000 stratified resamples. The ablation study (RQ3) uses n = 47 stratified problems with stateful vs. stateless conditions perfectly paired at Itr0.