📢 New survey on Self-Correction of LLMs!
— Ryo Kamoi (@RyoKamoi) June 5, 2024
😢 LLMs often cannot correct their mistakes by prompting themselves
😢 Many studies conduct unfair experiments
😃 We analyze requirements for self-correction🧵@YusenZhangNLP @NanZhangNLP Jiawei Han @ruizhang_nlphttps://t.co/fqbWkyKCGf pic.twitter.com/A9ptzJYcp8
📢 New Preprint! Can LLMs detect mistakes in LLM responses?
— Ryo Kamoi (@RyoKamoi) April 5, 2024
We introduce ReaLMistake, error detection benchmark with errors by GPT-4 & Llama 2.
Evaluated 12 LLMs and showed LLM-based error detectors are unreliable!@ruizhang_nlp @Wenpeng_Yin @armancohan +https://t.co/ehlXKbTdXO pic.twitter.com/NjiOXrQcBH
New dataset for Doc-level NLI!
— Ryo Kamoi (@RyoKamoi) March 7, 2023
Looking for a dataset with realistic claims and premises? Check out WiCE! WiCE annotates claims in Wikipedia with entailment labels wrt cited articles.
w/ @tanyaagoyal @juand_r_nlp @gregd_nlphttps://t.co/5iBKndhwf0
data: https://t.co/cfZrctyKQe
1/ pic.twitter.com/04c1kz6UBq
New preprint
— Ryo Kamoi (@RyoKamoi) October 14, 2022
QA metrics are quite popular in factuality eval, in part because it's believed that they are interpretable and can localize errors. We show that this is *not* true! Their localization is worse than simple exact match!https://t.co/bA90PK8tKl
w @tanyaagoyal @gregd_nlp pic.twitter.com/Lyo2RT1RIc