Run latest llama with 64 RAM

Alright, strap in. Support for Command-R+ was merged into llama.cpp exactly 4 hours ago. We're going to start talking to a GPT-4 level model on local hardware without a GPU. If you have 64GB of RAM, feel free to follow along 🧵

First up, a note about hardware: Text generation is limited by memory bandwidth. This will run on any machine with 64GB or more, but if you want speed I recommend DDR5, ideally on an 8 or even 12-channel motherboard, like Xeon/Epyc/Threadripper Pro/Apple silicon.

Next, we're going to get the compressed Command-R+ model and weights in GGUF format. That's here: https://huggingface.co/dranger003/c4ai-command-r-plus-iMat.GGUF/tree/main

Download the biggest size you can fit in RAM, with maybe 8-16GB of headroom (so at 64GB, try iq3_m or iq3_s, which are ~48GB). Bigger sizes are split.

Now, let's prepare our chat, using that chat template included with command-R+. pip install transformers, and then run this in Python. Feel free to change the chat:

The result is the formatted chat, ready to go to llama.cpp. Paste it into the -p argument to ./main in the llama.cpp dir, and pass your GGUF file to -m. -n is the maximum response length, in tokens.

link