ACM TOSEM

A-ProS: Towards Reliable Autonomous Programming
Through Multi-Model Feedback

Anika Tabassum* · Md Sifat Hossain* · Md. Fahim Arefin (University of Dhaka) · Tariqul Islam · Tarannum Shaila Zaman (UMBC)
* Equal contribution

An autonomous agent that solves competitive programming problems through a closed multi-model feedback loop — a GPT-4/5 generator paired with a specialized debugging critic (DeepSeek-R1, Llama-3.3-70B, or Codestral-2508) — submitted to the live Codeforces judge for up to three refinement iterations.

367

Problems Evaluated

Workflow Combinations

+131%

Max Improvement (ICPC)

2.3×

Gain vs. Stateless

Research Overview

🎯

Problem Statement

Can role-separated, iterative multi-model feedback reliably improve LLM code generation beyond what a single model can achieve alone?

🔄

Approach

Separate solution generation from debugging using specialized critics with persistent conversation context maintained across all refinement iterations.

📊

Evaluation

167 ICPC World Finals + 200 Codeforces problems, submitted to the live online judge — binary, automated correctness with no human evaluation.

✨

Key Finding

Stateful multi-model feedback achieves 2.2–2.3× greater gains than multi-round stateless refinement, confirming that accumulated context — not just more compute — drives improvement.

Research Questions

RQ1

Iterative Improvement

How much do solve rates improve when LLMs iteratively refine solutions using live judge feedback across up to three iterations (Itr₁–Itr₃)?

RQ2

Critic Specialization

Do reasoning-focused (DeepSeek-R1), general-purpose (Llama-3.3-70B), or code-specialized (Codestral-2508) critics provide more effective debugging hints?

RQ3

Persistent Context

Does maintaining full conversation history across iterations reduce error repetition and improve solve rates compared to stateless (context-reset) refinement?

RQ4

Category Effectiveness

Do code-specialized, general-purpose, and reasoning-focused critics show differential performance across algorithmic categories (Graph Theory, DP, Math, Geometry, etc.)?

RQ5

Baseline Comparison

How does A-ProS compare to simpler baselines — zero-shot, single-round stateless, and multi-round stateless — on the same 47-problem subset?

Performance Results

ICPC World Finals Problems (2011–2024)

167 elite-level problems from 14 years of ICPC World Finals, submitted to live Codeforces Gym judge

90/167

GPT-5 + DeepSeek-R1 at Itr₃

+131% over Itr₀

38/167

GPT-4 + DeepSeek-R1 at Itr₃

+153% over Itr₀

h ≈ 0.44

Cohen's h (Itr₃ vs Itr₀)

Small effect, approaching medium

Codeforces Contest Problems

200 problems (rating 1200–1800) from recent contests — cumulative solve rate by iteration

41.0%

GPT-5 + DeepSeek-R1 at Itr₃

Best workflow (82/200)

26.0%

GPT-4 + DeepSeek-R1 at Itr₃

+174% over Itr₀ (52/200)

VC = 7.6

Verification Cost (GPT-5 + DeepSeek)

Attempts per solved problem

Workflow	ICPC Itr₀	ICPC Itr₃	Improvement	CF Itr₃
GPT-5 + DeepSeek-R1	39 (23.4%)	90 (53.9%)	+131%	41.0%
GPT-5 + Llama-3.3-70B	39 (23.4%)	87 (52.1%)	+123%	38.5%
GPT-5 + Codestral-2508	39 (23.4%)	85 (50.9%)	+118%	36.5%
GPT-4 + DeepSeek-R1	15 (9.0%)	38 (22.8%)	+153%	26.0%
GPT-4 + Llama-3.3-70B	15 (9.0%)	34 (20.4%)	+127%	23.5%
GPT-4 + Codestral-2508	15 (9.0%)	31 (18.6%)	+107%	21.0%

Ablation Study: Context Matters

Controlled paired experiment on 47 stratified Codeforces problems — stateful (A-ProS) vs. stateless (context reset each iteration)

GPT-5 + DeepSeek-R1

Metric	Stateful	Stateless	Δ
Itr₀	21.3%	21.3%	0.0 pp
Itr₃	40.4%	29.8%	+10.6 pp
Error repeat	11.8%	41.8%	−30.0 pp

GPT-4 + DeepSeek-R1

Metric	Stateful	Stateless	Δ
Itr₀	10.6%	10.6%	0.0 pp
Itr₃	25.5%	17.0%	+8.5 pp
Error repeat	14.2%	40.8%	−26.6 pp

RQ5 — Baseline Comparison

Approach	GPT-5 gain	GPT-4 gain
Single-round stateless	+10%	+21%
Multi-round stateless	+40%	+60%
A-ProS stateful	+90%	+141%

A-ProS achieves 2.2–2.3× greater gains than multi-round stateless, confirming that accumulated conversational memory — not just additional iterations — drives improvement.

State-of-the-Art Comparison

A-ProS achieves competitive performance with AlphaCode-class systems while using ~10,000× fewer attempts (4 submissions vs. 1M samples).

A-ProS (Ours)41.0%
GPT-5 + DeepSeek-R1
4 submissions, iterative refinement

AlphaCode 2

43%

Gemini-based

1M samples, filtering

AlphaCode 1

34%

Original system

1M samples, ≤1300 rating

Key Insights

🧠

Context Accumulation is the Key Driver

Stateful A-ProS achieves 2.2–2.3× greater gains over zero-shot compared to multi-round stateless refinement with the same iteration budget, confirming that accumulated memory — not just more compute — is what matters.

Error repetition: 12% vs. 42%

🎯

Reasoning Beats Specialization

DeepSeek-R1 (reasoning-focused) consistently outperforms both Codestral-2508 (code-specialized) and Llama-3.3-70B (general-purpose), achieving 2–7 more solved ICPC problems at Itr₃.

Ranking: DeepSeek > Llama > Codestral

📈

Weaker Generators Benefit More

GPT-4 shows larger relative improvement (+153%) compared to GPT-5 (+131%) on ICPC. Superior debugging feedback partially compensates for weaker generation — critic effects are more pronounced at lower generator tiers.

Critic range: 22.6% (GPT-4) vs. 5.9% (GPT-5)

⚖️

Generator Dominates Critic

Even the weakest GPT-5 workflow outperforms the strongest GPT-4 workflow. Generator capability is the primary performance driver; critic choice provides a secondary but consistent effect.

h = 0.68 generator gap (medium effect)

🏆

Consistent Critic Ranking

The DeepSeek-R1 > Llama-3.3-70B > Codestral-2508 ranking holds across all problem categories (Graph Theory, DP, Math, Geometry) and both generators, indicating stable specialization effects.

Stable across 12 comparisons

💡

Efficiency vs. Sampling

Reaches 95% of AlphaCode 2 performance through systematic debugging of a single solution trajectory rather than massive parallel sampling and filtering.

~10,000× fewer attempts

About This Research

A-ProS: Autonomous Programming with Specialized Critics

A-ProS is an autonomous AI agent that solves competitive programming problems by combining two types of language models in a closed feedback loop. The key design decision is separation of roles: generation and debugging are handled by different, independently specialized models rather than asking a single model to do both.

System Architecture

Solution Generators: GPT-4 and GPT-5 write C++17 solutions (temperature 0.1)
Debugging Critics: DeepSeek-R1-0528, Llama-3.3-70B (Groq), Codestral-2508 (Mistral AI)
Persistent Context: Full conversation history maintained across all iterations — the key feature studied in RQ3
Live Judge: All solutions submitted to Codeforces; verdicts drive the feedback loop
Confidence Scoring: Critics append a confidence score (1–5) to each hint; ECE measures calibration quality

Evaluation Dataset

167 ICPC World Finals problems (2011–2024) via Codeforces Gym
200 Codeforces problems (rating 1200–1800) from recent contests
Hidden test suites prevent shortcut heuristics; binary verdicts require no human evaluation
Max 4 submissions per problem: 1 zero-shot (Itr₀) + 3 feedback iterations (Itr₁–Itr₃)

Statistical Framework

All pairwise comparisons use McNemar's exact test for paired binary outcomes, with Holm–Bonferroni correction across 6 comparisons per analysis. Effect sizes reported as Cohen's h (appropriate for binary proportions). Bootstrap 95% CIs use 10,000 stratified resamples. The ablation study (RQ3) uses n = 47 stratified problems with stateful vs. stateless conditions perfectly paired at Itr₀.

A-ProS: Towards Reliable Autonomous ProgrammingThrough Multi-Model Feedback