Projects

FoVer: Generalizable Process Reward Models via Formally Verified Training Data (2025)

Process Reward Models (PRMs) provide step-level verification to LLM reasoning. However, prior work rely on training data with human annotated labels or noisy labels. We propose FoVer, a novel method to create PRM training data by using formal verification tools like Z3 and Isabelle. FoVer is the first method that creates accurate PRM training data without relying on human annotation. FoVer creates PRM training data using symbolic tasks compatible with formal verification tools, but FoVer improves LLM-based PRMs over broad reasoning tasks.

VisOnlyQA: LVLMs Still Struggle with Visual Perception of Geometric Information (COLM 2025)

VisOnlyQA is a benchmark for evaluating the geometric perception capabilities of LVLMs, consisting of 12 tasks that ask about geometric properties (e.g., angle, size, and shape) in four categories of scientific figures: geometric shapes, charts, chemical structures, and 3D shapes. We demonstrate that LVLMs still often cannot accurately perceive basic geometric information in images.

Critical Survey of Self-Correction of LLMs (TACL 2024)

We critically survey broad papers and discuss the conditions required for successful self-correction. Our survey indicates that (1) no prior work demonstrates successful self-correction with feedback from prompted LLMs, except for studies in tasks that are exceptionally suited for self-correction, (2) self-correction works well in tasks that can use reliable external feedback, and (3) large-scale fine-tuning enables self-correction.

ReaLMistake: Evaluating LLMs at Detecting Errors in LLM Responses (COLM 2024)

ReaLMistake is a benchmark for evaluating error detection methods that detects errors in LLM responses. This benchmark includes errors made by GPT-4 and Llama 2 70B on three tasks (math word problem generation, fine-grained fact verification, and answerability classification). We observe that LLMs still cannot reliably detect mistakes made by LLMs. Strong LLMs like GPT-4 and Claude 3 detect errors made by LLMs at very low recall, and all LLM-based error detectors perform much worse than humans.