llama.cpp Notes


I've been working with llama.cpp for most of 2023 now, and it's probably time to capture some of my findings.

Basic Setup

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

This compiles an executable called main, which invokes a CLI-based interface. There's also a file called server, which instead starts a web server. The web server provides a web interface that I've tested, and is also supposed to provide an OpenAI-compatible API. I haven't tested that yet, but there is a extensive and well-written README for the server example in the llama.cpp repository.

Obtaining Models

By far the best source of models is HuggingFace. HuggingFace is to AI models what GitHub is to source code. There are a lot of different formats to download, but because I'm operating completely without a GPU on my Framework laptop (12th gen Intel Core i5), my options are very limited, since my available compute is so low. This keeps me using llama.cpp (as outlined above), and also confines me to GGUF files. By far the best source to keep up to date on the latest GGUF files (along with a good variety of quantizations) is TheBloke's page on HuggingFace. It is an absolute goldmine of data to read through if you want to dive into this world.

Top links below are to GGUF models that are compatible with llama.cpp. I'll include the original link as well, since those model cards have useful information about what makes the model interesting. As of 11 December 2023, my top model choices are:

OpenHermes 2.5 Mistral 7B

OpenHermes 2.5 Mistral 7B is a state of the art Mistral Fine-tune, a continuation of OpenHermes 2 model, which trained on additional code datasets.

After Mistral came out, I replaced all my usage of llama 7B with Mistral 7B. My initial trial was on the day Mistral 7B Instruct v0.1 was released, and I was so impressed with its performance, I decided I'd try and seek out the inevitably-better fine-tunes, and of the handful I tried, OpenHermes 2.5 Mistral 7B was the best. I used this every day for a couple of weeks and it was quite pleasant.

In terms of comparing relatively between models, there's no easy method I can find. I wanted to use MTBench, since Mistral 7B Instruct v0.1 gets a 6.84, but I learned that OpenHermes 2.5 Mistral 7B is ungraded on that scale. One neat comparison might be Elo Rating in the lmsys Chatbot Arena. Mistral 7B Instruct v0.1 gets 1018, and OpenHermes 2.5 Mistral 7B gets 1075, which is several places up in the leaderboards.

Original Model

Zephyr 7B Beta

Zephyr is a series of language models that are trained to act as helpful assistants. Zephyr-7B-β is the second model in the series, and is a fine-tuned version of mistralai/Mistral-7B-v0.1 that was trained on on a mix of publicly available, synthetic datasets using Direct Preference Optimization (DPO). We found that removing the in-built alignment of these datasets boosted performance on MT Bench and made the model more helpful.

I've heard great things about Zephyr in the community, but haven't actually tried it myself. I have a list of three models to play around with, and Zephyr is next, though.

Zephyr 7B Beta is also ranked in the leaderboards, and has an Elo of 1045.

Original Model

Una Cybertron 7B v2

We strike back, introducing Cybertron 7B v2 a 7B MistralAI based model, best on it's series. Trained on SFT, DPO and UNA (Unified Neural Alignment) on multiple datasets. He scores EXACTLY #1 with 69.67+ score on HF LeaderBoard board, #8 ALL SIZES top score.

Una Cybertron feels like a different kind of model than the others, but I've only been playing with it for a couple of days. It's based on a new technique called Uniform Neural Alignment (the quote above has a typo that calls it 'Unified Neural Alignment'), but the exact details aren't published yet.

In general, I've found it to be able to handle more levels of indirection in my requests, for example creating RPG missions that mash up the Mini-Six roleplaying game with the Firefly universe. The original model card mentions logic:

The model excels in mathematics, logic, reasoning, overall very smart. He can make a deep reasoning over the context and prompt, it gives the impression of not missing details around.

Una Cybertron is not currently tracked in the Chatbot Arena leaderboards.

Original Model

Mistral 7B Instruct v0.2

5 hours before I started writing this, I saw that Mistral 7B v0.2 just dropped, and I've been downloading it while I write this. Here's what they say:

The Mistral-7B-Instruct-v0.2 Large Language Model (LLM) is an improved instruct fine-tuned version of Mistral-7B-Instruct-v0.1.
For full details of this model please read our paper and release blog post.

The blog post makes it clear that this model is called Mistral-tiny, and they mostly mention it in the context of their new API endpoints:

Mistral-tiny. Our most cost-effective endpoint currently serves Mistral 7B Instruct v0.2, a new minor release of Mistral 7B Instruct. Mistral-tiny only works in English. It obtains 7.6 on MT-Bench. The instructed model can be downloaded here.

I'll be testing this tonight, but reporting a 7.6 on MTBench vs v0.1's score of 6.84 represents a significant jump, so there's a good chance folks will be able to spawn another set of fine tunes based on this latest release.

Original Model


I have not done any kind of extensive testing with different quantization levels. I generally value quantization for the inference speedup: disk space is not an issue at this scale for me, and I upgraded my Framework to 64GB RAM, so I have enough memory.

The basic idea of quantization is to reduce the precision of the weights in the model, which in turn allows us to store the model more compactly. This saves disk space, saves on memory, and reduces computational load during inference. It comes at the cost of fidelity, but I've found that even at moderate quantization levels, loss is minimal, just comparing a 16-bit model (fp16) with a 5-bit model (Q5_K_M), which is my standard choice when downloading.

Running Models

I run llama.cpp directly, because I like to keep things simple (even if it means they aren't quite as easy). For a long time, I ran in the CLI using main. I put all the models (GGUF files) that I download into a subdirectory of llama.cpp called models, and then launch like this:

./main -m models/openhermes-2.5-mistral-7b.Q5_K_M.gguf -ins

The help material for llama.cpp suggests that the ins flag is useful for Alpaca-models (a much older model that came out shortly after llama), but I've found it works well with all the models I use, so it's my default setting.

For the past few weeks I've found myself using the CLI more for batch jobs, and I've been using the the web interface more often, since it's nice for interactive use. Instead of main, we invoke server instead:

./server --port 7007 -m models/una-cybertron-7b-v2-bf16.Q5_K_M.gguf

I use a custom port because I run lots of different servers, but other than that, all that I'm specifying is the relative path to the model to use. Since GGUF has model metadata, llama.cpp can automatically start with the correct settings for that model. You can then get started by visiting http://localhost:7007.