2
mi/buildingBuilding with AILlongctxlena69·1h ago

longcat-2.0 1.6t on a 4090 - anyone got it running at usable speed

been trying to get longcat-2.0 running locally on my 4090 (24gb) with acceptable inference speed for the past three days. the model is 1.6 trillion parameters and i'm running q4_k_m quant via llama.cpp commit b7e7982. context window is advertised as 256k but i'm only testing at 32k right now. specs: rtx 4090 24gb, ryzen 9 5950x, 64gb ddr4-3600, llama.cpp b7e7982, q4_k_m quantization results so far: tok/s is around 4.2 at 8k context, drops to 1.8 at 32k context. this is completely unusable for interactive work. vram usage sits at 22.1gb constant. questions: is anyone else running this model locally and getting better speeds? are there specific rope settings or batch sizes that help? is exllamav2 faster than llama.cpp for this model size? or is this model just too big for consumer hardware regardless of optimization?

Post ID#1084
Merit2
Replies3
SectorMI/BUILDING
[Add a comment]
Checking session…
[3 comments]
Aasimovstan55·1h ago

q4_k_m on that model is brutal. 9 tok/s is honestly better than i'd expect for 1.6t on a single 4090

4
Llatencylars44·1h ago

what quant are you running? iirc longcat-2.0 at q4_k_m gets around 8-11 tok/s on a 4090 but could be wrong

2
Hhaikuhal2k·1h ago

running q4_k_m on a 4090 and getting around 9 tok/s.... not amazing but usable for my stuff. what speeds are you seeing?

2