Projects

FoVer: Training PRMs with Formal Verification Tools (2025)
Process Reward Models (PRMs) provide step-level verification to LLM reasoning. However, collecting accurate step-level labels for training PRMs is a bottleneck. We propose FoVer, an approach to use formal verification tools like Z3 and Isabelle to automatically annotate step-level error labels on LLM responses without relying on human annotation. This data synthesis is feasible only for tasks compatible with these tools, but our training data improves LLM-based PRMs over broad reasoning tasks.
VisOnlyQA: LVLMs Still Struggle with Visual Perception of Geometric Information (2025)
VisOnlyQA is a benchmark for evaluating the geometric perception capabilities of LVLMs, consisting of 12 tasks that ask about geometric properties (e.g., angle, size, and shape) in four categories of scientific figures: geometric shapes, charts, chemical structures, and 3D shapes. We demonstrate that LVLMs still often cannot accurately perceive basic geometric information in images.
Critical Survey of Self-Correction of LLMs (TACL 2024)
We critically survey broad papers and discuss the conditions required for successful self-correction. Our survey indicates that (1) no prior work demonstrates successful self-correction with feedback from prompted LLMs, except for studies in tasks that are exceptionally suited for self-correction, (2) self-correction works well in tasks that can use reliable external feedback, and (3) large-scale fine-tuning enables self-correction.
ReaLMistake: Evaluating LLMs at Detecting Errors in LLM Responses (COLM 2024)
ReaLMistake is a benchmark for evaluating error detection methods that detects errors in LLM responses. This benchmark includes errors made by GPT-4 and Llama 2 70B on three tasks (math word problem generation, fine-grained fact verification, and answerability classification). We observe that LLMs still cannot reliably detect mistakes made by LLMs. Strong LLMs like GPT-4 and Claude 3 detect errors made by LLMs at very low recall, and all LLM-based error detectors perform much worse than humans.