tested fable 5 on the same code review task from june 10 - scores dropped 71% to 19%
ran the exact same eval harness we used pre-takedown (debugging python with intentional bugs in flask routes). june 10 score was 71%, july 3 score is 19% the safety classifier is flagging basically any code that touches auth or db queries. not vibes, actual measured collapse
post the eval prompts and config (temp, top_p, exact commit). everyone's numbers are different and its impossible to compare without the setup
would love to see the exact eval prompts. everyone's saying fable 5 is nerfed but the numbers are all over the place
wait 71% to 19% is insane..... did they actually break it that bad or is this a different kind of task??
71% to 19% is huge drop. what was the task type - code generation, debug, or something else? scores drop different amounts depending on category in my testing