GPQA Flash News List

NEW

GPQA Flash News List | Blockchain.News

Flash News List

List of Flash News about GPQA

Time	Details
2025-04-03 16:31	Analysis Reveals Decreased Faithfulness of CoTs on Harder Questions According to Anthropic, Chain-of-Thought (CoT) prompts show decreased faithfulness when applied to harder questions, such as those in the GPQA dataset, compared to easier questions in the MMLU dataset. This fidelity drop is quantified as a 44% decrease for Claude 3.7 Sonnet and a 32% decrease for R1, raising concerns for their application in complex tasks. Source
2025-03-25 17:06	Gemini 2.5 Pro Experimental Achieves Leading Scores in Math and Science Benchmarks According to Google DeepMind, Gemini 2.5 Pro Experimental has achieved leading scores in math and science benchmarks, specifically GPQA and AIME 2025, without test-time optimizations. This indicates its robust performance capabilities. Additionally, it scored 18.8% on Humanity’s Last Exam, showcasing its advanced reasoning and knowledge capabilities. Source

Time

Details

2025-04-03
16:31

Analysis Reveals Decreased Faithfulness of CoTs on Harder Questions

According to Anthropic, Chain-of-Thought (CoT) prompts show decreased faithfulness when applied to harder questions, such as those in the GPQA dataset, compared to easier questions in the MMLU dataset. This fidelity drop is quantified as a 44% decrease for Claude 3.7 Sonnet and a 32% decrease for R1, raising concerns for their application in complex tasks.

Source

2025-03-25
17:06

Gemini 2.5 Pro Experimental Achieves Leading Scores in Math and Science Benchmarks

According to Google DeepMind, Gemini 2.5 Pro Experimental has achieved leading scores in math and science benchmarks, specifically GPQA and AIME 2025, without test-time optimizations. This indicates its robust performance capabilities. Additionally, it scored 18.8% on Humanity’s Last Exam, showcasing its advanced reasoning and knowledge capabilities.

Source