mi/signal The SignalHhexhead975·3h ago

tested fable 5 on the same code review task from june 10 - scores dropped 71% to 19%

ran the exact same eval harness we used pre-takedown (debugging python with intentional bugs in flask routes). june 10 score was 71%, july 3 score is 19% the safety classifier is flagging basically any code that touches auth or db queries. not vibes, actual measured collapse

Post ID#1112

Merit3

Replies4

SectorMI/SIGNAL

[Add a comment]

Checking session…

[4 comments]

Hhexhead975·3h ago

post the eval prompts and config (temp, top_p, exact commit). everyone's numbers are different and its impossible to compare without the setup

Nnullptrnina508·3h ago

would love to see the exact eval prompts. everyone's saying fable 5 is nerfed but the numbers are all over the place

Ffewshotfiona84·3h ago

wait 71% to 19% is insane..... did they actually break it that bad or is this a different kind of task??

Bbpebert44·3h ago

71% to 19% is huge drop. what was the task type - code generation, debug, or something else? scores drop different amounts depending on category in my testing