Day 8 of the 12 days of Christmas! Go through Day 1, Day 2, Day 3, Day 4, Day 5, Day 6, and Day 7 to catch up.
The next part of the series is here. About the formats used by the LLMs. I am talking about LLMs in general, not just local LLMs. From what we know, your state-of-the-art models can also be available in these formats, and you can run theoretically run them locally. You would probably need a small power plant and custom hardware to do so. Small caveat. Yes.
We have a very nice blog from HuggingFace talking about the various formats. A good read if you want to delve a little deeper. I will be touching only on the periphery without going into the technical details.
An LLM file generally contains the weights of the model, or the actual “intelligence” of the model. It may also contain the configuration of the model and runtime information as to the details of how you can infer with the model. Since the file contains the weights of the model, we call it “open source”. To be fair, it should be called “open weights”. You really don’t have the source or the actual training data. Now that Ozempic is here, we may be able to call out “weights” without offending anybody.
Pickle or PyTorch format
The OG format that you will find the initial models in. You can still find it. It is the legacy serialization method for storing model weights using Python’s pickle module. It is unsafe as a rule. It can execute arbitrary Python code during loading, making them dangerous for untrusted models. Attackers can embed malicious payloads that execute when your program runs, creating supply chain attack vectors. Stay away. You don’t need them anymore. HuggingFace developed the safetensors format to replace it and it has been replaced successfully. You would need to write your own code to load the model with PyTorch, if at all you are inclined.
safetensors
safetensors is the evolution of the pickle format. It is a secure format created by HuggingFace. They contain the model weights. You need to use a library to load the model. A lot of them exist out there. PyTorch, HuggingFace Diffusers, AMD Quark, and a lot more in all the languages you might want to use. Once again. Not something you can just download and deploy. For local LLM use, it is not something that I would recommend.
GGUF
When every model releases, the question that props up immediately is, “GGUF when?” GGUF is the model format that you can just download and use. GGUF stands for GPT-Generated Unified Format. Aside from model weights, they also contain the configuration of the model and the runtime information required for model inference. GGUF was derived from GGML (Gerganov’s Generalized Machine Learning), created by Georgi Gerganov. While GGUF is often said to stand for GPT-Generated Unified Format, it’s actually a successor format that inherited GGML’s naming pattern. To make things plain, if you want to use local LLMs, you need to use GGUF (unless you are in Mac, in which case go to the next section).
MLX
MLX is Apple’s open-source machine learning framework designed specifically for Apple Silicon. It effectively utilizes the unified memory model of Mac hardware. MLX effectively turns safetensors into an Apple-compatible LLM. Unlike GGUF, you need to download safetensors files and configuration files. Fortunately, tools like LMStudio mask this complexity. To be honest, if you want to run a local model effectively, Mac is the way to go at a laptop form factor. Especially with MLX around. Unified memory works like a charm. With 18 GB memory, you can easily run a model that needs a beefy graphics card. If you can get more memory, even better. You would need a proper desktop set up to beat that in terms of thermal performance and price. Not all models are available in MLX format. But there is a dedicated MLX community that works to get most of the popular models out in the MLX format.
ONNX
ONNX is an open-source specification that defines how to serialize a model’s computation graph, operators, and data types into a portable file format. It is a cross-org effort to standardize models. ONNX is widely used for production deployment and can handle large transformer and LLM architectures. But in our discussion about using local LLMs, ONNX is not something you would use. But it is something you should know if you are remotely interested in LLM implementation at a device level. ONNX runtime Web can run ONNX models in the browser through web GPU and webassembly. I have been thinking of a project to run a model on the browser. But the main hesitation is with the memory constraints and the fear of breaking a customer’s browser. There are also WebLLM / WebGPT / WebRNN style projects that take GGUF or framework models and compile to WebGPU‑friendly formats.
GPTQ/AWQ/EXL2
These are models I have never used. They are quantized versions made to run on nvidia hardware with specialized runtimes like vLLM. They are faster and give better output than GGUF on high-end nvidia hardware and when you start paddling in local LLM land, you might graduate to using them.
P.S: We are missing stuff like quantization in this post. I thought I will reserve it for when we discuss about downloading and running the models. Hopefully, we can cover them in the next post along with a new choice models.
A smart and concise review. Just right.
I didn’t the missed opportunity. Hallucination? Claude or me is the question.
Soup seems to be flavour of the day. May be the winter and the cooling systems are making the LLMs crave for some heat. ChatGPT can’t resist yapping on such a short piece too.
On point is Grok today.
Kimi K2 is on a mission to dethrone ChatGPT on slop.
Comments
Loading comments...
Leave a comment