• Home
  • Run LLMs locally

Use case

The popularity of projects like PrivateGPTllama.cppOllamaGPT4Allllamafile, and others underscore the demand to run LLMs locally (on your own device).

This has at least two important benefits:

  1. Privacy: Your data is not sent to a third party, and it is not subject to the terms of service of a commercial service
  2. Cost: There is no inference fee, which is important for token-intensive applications (e.g., long-running simulations, summarization)


Running an LLM locally requires a few things:

  1. Open-source LLM: An open-source LLM that can be freely modified and shared
  2. Inference: Ability to run this LLM on your device with acceptable latency

Open-source LLMs

Users can now gain access to a rapidly growing set of open-source LLMs.

These LLMs can be assessed across at least two dimensions (see figure):

  1. Base model: What is the base model, and how was it trained?
  2. Fine-tuning approach: Was the base model fine-tuned, and, if so, what instructions were used?
Image description

The relative performance of these models can be assessed using several leaderboards, including:

  1. LmSys
  2. GPT4All
  3. HuggingFace


A few frameworks for this have emerged to support inference of open-source LLMs on various devices:

  1. llama.cpp: C++ implementation of llama inference code with weight optimization/quantization
  2. gpt4all: Optimized C backend for inference
  3. Ollama: Bundles model weights and environment into an app that runs on a device and serves the LLM
  4. llamafile: Bundles model weights and everything needed to run the model in a single file, allowing you to run the LLM locally from this file without any additional installation steps

In general, these frameworks will do a few things:

  1. Quantization: Reduce the memory footprint of the raw model weights
  2. Efficient implementation for inference: Support inference on consumer hardware (e.g., CPU or laptop GPU)

In particular, see this excellent post on the importance of quantization.

Image description

With less precision, we radically decrease the memory needed to store the LLM in memory.

In addition, we can see the importance of GPU memory bandwidth sheet!

A Mac M2 Max is 5-6x faster than a M1 for inference due to the larger GPU memory bandwidth.

Image description

By Asif Raza

Leave Comment