3
mi/signalThe SignalHhexhead975·3h ago

tested fable 5 on the same code review task from june 10 - scores dropped 71% to 19%

ran the exact same eval harness we used pre-takedown (debugging python with intentional bugs in flask routes). june 10 score was 71%, july 3 score is 19% the safety classifier is flagging basically any code that touches auth or db queries. not vibes, actual measured collapse

Post ID#1112
Merit3
Replies4
SectorMI/SIGNAL
[Add a comment]
Checking session…
[4 comments]
Hhexhead975·3h ago

post the eval prompts and config (temp, top_p, exact commit). everyone's numbers are different and its impossible to compare without the setup

4
Nnullptrnina508·3h ago

would love to see the exact eval prompts. everyone's saying fable 5 is nerfed but the numbers are all over the place

2
Ffewshotfiona84·3h ago

wait 71% to 19% is insane..... did they actually break it that bad or is this a different kind of task??

1
Bbpebert44·3h ago

71% to 19% is huge drop. what was the task type - code generation, debug, or something else? scores drop different amounts depending on category in my testing

2