Let's Try Again: Eliciting Multi-Turn Reasoning in Language Models via Simplistic Feedback

1Imperial College London   |   2Northwestern University   |   3University of Washington   |   4IBM Research AI
This website is maintained by the first author, who co-designed and conducted the experiments, drafted the initial paper, and released the code/models. Currently applying to Ph.D. programs (Fall 2026).
UFO Method Teaser
An example of using single-turn RL model for multi-turn problem solving. Model loses multi-turn capability, producing identical reasoning chains across interaction turns after being prompted that its answer is incorrect.

Abstract

Large language models often struggle with multi-turn reasoning, especially when trained with single-turn reinforcement learning paradigms. We identify that the lack of effective feedback in multi-turn scenarios leads to repetitive and suboptimal responses. To address this, we propose Unary Feedback as Observation (UFO), a simple yet powerful method that leverages minimal user feedback, such as "Let's try again", to guide models during iterative problem solving. Our approach can be seamlessly integrated into existing RL frameworks and significantly improves multi-turn reasoning accuracy by 14%, while maintaining single-turn performance. UFO demonstrates that even the simplest feedback can unlock more robust and interactive reasoning abilities in language models.

Unique Answer Ratio Comparison
Figure 1. Comparison of effective (unique) answer ratio (%) before and after RL training. Across single-turn RL methods, the unique answer ratio consistently drops after training across multiple model scales.

Method Overview

We address the challenge that single-turn reinforcement learning (RL) often fails to endow language models with multi-turn reasoning abilities, leading to repetitive and unadaptive responses. Our approach, Unary Feedback as Observation (UFO), reformulates multi-turn problem solving as a Markov Decision Process (MDP) using only static single-turn datasets and minimal feedback.

In UFO, the model interacts over multiple turns, receiving only negative feedback (e.g., "Try Again") after incorrect answers. The observation at each step concatenates the original question, all previous attempts, and their feedback, requiring the model to revise its reasoning based solely on a history of failed attempts. This design enables multi-turn training without the need for dense supervision or tool-augmented environments.

We optimize the model using Proximal Policy Optimization (PPO), with two key reward shaping strategies: reward decay (to encourage early success) and a repetition penalty (to promote answer diversity). This framework allows the model to develop revision-aware, context-sensitive reasoning strategies, unlocking robust multi-turn capabilities from static datasets and minimal supervision.

UFO Multi-turn Training Framework
Figure 2. Multi-turn training framework of UFO showing how models learn from unary feedback during iterative problem-solving processes.

Key Results

Multi-Turn Reasoning Performance

We compare our multi-turn UFO model against a strong single-turn PPO baseline. For a fair comparison, the baseline is evaluated on 5 independent samples (Pass@5), while our model uses 5 sequential attempts with feedback (Succ@5). Success is recorded if any of the 5 responses is correct. We also analyze the impact of varying the maximum number of interaction turns at training.

UFO Performance Comparison
Figure 4. Multi-turn (5-turn) RL significantly outperforms single-turn baseline, achieving higher success rates (Pass @ 5) with similar inference cost.
Multi-turn Training Process
Figure 5. Performance comparison when evaluating with 5 turns after training with different maximum turns (1, 5, and 10). Training with 5 turns yields the best performance, while increasing to 10 turns offers no significant gain.

Effectiveness of Unary Feedback

To further investigate the role of unary feedback, we compare model performance under different feedback availability conditions. In scenario (a), unary feedback is provided during both training and validation phases, while in scenario (b), unary feedback is available only during training but not at validation. The results show that access to unary feedback during both phases substantially improves validation success rate. In contrast, providing unary feedback solely during training does not yield improvements, indicating that the benefit of unary feedback is contingent on its availability at inference time.

Effectiveness of Unary Feedback
Figure 6. Success rate comparison under different unary feedback settings: (a) feedback in both training and validation; (b) feedback only in training.

Experimental Analysis

🚀

Multi-Turn RL Boosts Reasoning

Performance

+14% success rate over single-turn PPO baseline

Benefits generalize to both multi-turn and single-turn inference

Best results with 5-turn training; more turns yield diminishing returns

💡

Unary Feedback Matters

Ablation

Feedback in both training and validation is crucial for improvement

Feedback only in training phase does  not  help at inference

⏱️

Reward Design

Efficiency

Exponential reward decay decreases the average number of actions required to solve problems by ~10%.

Encourages faster and more efficient problem solving

🔀

Diversity of Answers

Exploration

Non-repetitive answer ratio increases from 79.7% to 92.8%

Multi-turn RL with UFO encourages answer diversity and strengthens robustness

Conclusion

In this work, we identify a key limitation of conventional single-turn reinforcement learning: its tendency to undermine multi-turn reasoning by encouraging repetitive and superficial responses. To overcome this, we introduce Unary Feedback as Observation (UFO)—a simple yet effective approach that incorporates minimal feedback into standard RL pipelines. UFO enables language models to recover and enhance both single-turn and multi-turn reasoning capabilities. Our experiments demonstrate a 14% improvement in multi-turn accuracy while maintaining single-turn performance. Furthermore, we show that integrating reward decay and repetition penalties fosters deeper reasoning, self-correction, and greater response diversity. Our method is lightweight, broadly applicable, and can be seamlessly integrated into existing RL training frameworks.

Acknowledgement

We thank the DeepSeek team for providing the DeepSeek-R1 model and early conceptual inspirations. We are grateful to the veRL team for their infrastructure support and the RAGEN team for their multi-turn RL framework.