Local Large Language Models (Local LLM)

I haven't posted on my blog in ages, so I decided to buckle down and write something.
Since I haven't shared any "tutorials" yet - even though I have a category for them - I figured it's time for this kind of post.

Now, just to clarify, I'm not an expert in this field. But since it's a topic I find really interesting and I've picked up a few things, I thought I might be able to help out some newcomers.

What is a Local Large Language Model

By now, almost everyone has either encountered or at least heard of ChatGPT.
ChatGPT is exactly that—a Large Language Model (LLM).

To communicate with it, you need to visit the dedicated OpenAI website (https://chatgpt.com/).
Anything you write there is processed by OpenAI’s servers and models, and the data is stored in their database.

Local LLMs are very similar to ChatGPT and are based on the same technology.
The difference is that, instead of using a service provided by a third party, you download them onto your personal computer and use them offline (without needing an internet connection).

What are the pros and cons of local models?

At first glance, it seems fantastic - you download the model and use it whenever and as much as you want, right? :)
Unfortunately, things are never that simple.

Here’s a list of some of the positives and negatives of using Local LLMs compared to ChatGPT, Claude, etc.

Pros:
- You can use it completely offline without any restrictions
- You’re not dependent on OpenAI’s servers being operational to access it
- Your interactions are truly private, and no one else has access to them
- There are uncensored models available, allowing you to discuss absolutely anything

Cons:
- It can be expensive. Using a local large language model with more parameters requires powerful hardware, which can cost tens of thousands of dollars
- Most models with fewer parameters are significantly less capable, even compared to the free versions from OpenAI

What do these parameters represent

I’ve mentioned the word "parameters" a few times, right?
There’s a wide variety of local large language models available.

Some of the most popular ones right now include:
- Llama, developed by Meta
- Qwen (Alibaba)
- Gemma (Google)
- Phi (Microsoft)
- Mistral

Each of these models has various versions.
A single version of a model can come in different configurations based on the number of parameters.
For instance, let’s take Llama 3 as an example. The model can be downloaded with 1B, 3B, 8B, 70B, or even 405B parameters, where "B" stands for billion.
In other words, 3B means you’re downloading a version of the model that has been trained with 3 billion parameters. You might have already guessed that the more parameters a model has, the "smarter" it tends to be.
It understands you better, provides more relevant and coherent responses, and overall performs better for your needs.

However, there’s a significant caveat. The larger the model you download, the better hardware you need to run it.
For example, I have an RTX 3070 with 8GB of VRAM. With that, I can comfortably use models with up to around 10 billion parameters. Models with more parameters either run quite slowly or won’t work at all.

What is Quantization

There’s one more thing you’ll notice if you start exploring the different models, namely the so-called "Quantization".
LLM models are fundamentally built from numerous vectors, which you can think of as a lot of numbers - floating-point numbers, to be specific.
Each word, syllable, letter, and symbol is represented by such a number. When the model needs to choose one of them, it relies on these numbers (vectors).
Each of these numbers has many digits after the decimal point. Quantization reduces these numbers by truncating some of the digits, making the model smaller.
The lower the quantization, the more digits are cut off. They are typically categorized as Q2, Q3, Q4, Q5, Q6, Q8, and so on, with various types available.
For instance, in Q2, the model is significantly "lightened" or rather "trimmed down." In other words, it requires far fewer resources to operate.

However, keep in mind that the lower the quantization, the "dumber" the model becomes. It is generally not recommended to use quantization below 4, as the quality of the model drops dramatically.

Personally, if I use a model with fewer parameters and I know my computer can handle it without issues, I opt for Q6 or higher. If I’m on the edge, I’ll drop down to Q4.
If I can’t get it running with sufficient performance even at Q4, I simply look for another model or choose one with fewer parameters.

What’s the difference between the various local LLM

I've already mentioned that we have numerous large language models. The key difference among them is that they are created by different companies.
Each model has its own strengths and weaknesses - some excel at creative tasks, while others are better suited for technical subjects.

New versions of these models are released almost daily, and their performance is constantly evolving, making it impossible to definitively say - "model X is the best".
I’ve noticed that developers are currently focusing on creating models with fewer parameters that are smarter.
For instance, models released months ago with 70 billion parameters are now being surpassed by models with around 30 billion parameters.

One of the most popular websites where you can explore different models is huggingface.co.
There, you'll find models designed for various purposes: some can reproduce voices, others can generate images, and some can create text based on images, and so on.

What hardware do you actually need

I mentioned earlier what graphics card I'm currently using, and when it comes to AI, a good graphics card is definitely preferred over a processor. Nvidia is leading the way in this regard with their CUDA technology.

RAM, or more specifically VRAM, is extremely important. The more, the better. That's why I'm considering upgrading to an RTX 3090, as it offers three times the VRAM of my current card.
The goal is to load the entire model into your VRAM for optimal performance. Once your VRAM runs out and you have to load part of the model into RAM, performance drops significantly.

Currently, local models that can somewhat compete with the quality of ChatGPT are those with at least 72 billion parameters and high quantization. To comfortably use such a model, I estimate you would need at least three RTX 3090s. Of course, these are very general and rough calculations.

What other local models exist

There is a wide variety of local models. I've already mentioned some of them—some generate speech from text, while others create images from text, and so on.
Recently, the new version of Stable Diffusion 3.5 was released. This model generates images based on the text you provide, similar to Midjourney, with quite comparable quality. You can use the images it creates for whatever you like, completely free of charge.

There are also voice recognition models like Whisper.

Where to start

As long as you have the necessary hardware, the rest is relatively straightforward.
Personally, I use Ollama. It's extremely easy to install. Just head over to the official Ollama website at ollama.com and download it.
From there, you can use CMD (if you're on Windows) or your terminal to download the model of your choice.
You can find those on the Ollama site as well at https://ollama.com/library.

Let’s say you want to download the Llama 3.2 3B version. Simply type the following command in CMD or your terminal: ollama run llama3.2:3b
Ollama will then download the model for you and provide an option to chat with it.

Overall, Ollama functions more like an API. It offers an easy way to connect other interfaces to it. You can check out some popular ones at this link: https://itsfoss.com/ollama-web-ui-tools

What could you use local LLMs for

Pretty much for the same things you use ChatGPT for.
Many people simply chat with them, while others engage in role-playing games.

Another popular use is what's known as RAG (Retrieval-Augmented Generation). You provide the model with your own data, whether it’s documents or anything else you want to process. As a result, you can have the model return specific information based on that data.

Personally, I created my own gallery some time ago. Since I wasn’t very comfortable uploading personal photos to the cloud (like Google Photos), I set up a mini website to upload them to.
I host it on my personal server. Once a photo is uploaded, it goes through a model that "sees" what’s in it and returns a description. This description is then fed into another model (an Embedding model) that converts it into vectors. The idea is to enable searching within that description. Now I have a gallery where, if I type in the search bar that I want to see photos of waterfalls, it will show me those photos. :)

Well, that wraps up this "lesson". There’s certainly a lot more for you to learn, but the main idea was to give you a little nudge in the right direction.
I want to introduce some of you to the fact that ChatGPT isn’t the only player in the game.

Here are two additional links that will definitely be useful if you find this topic interesting and want to dive deeper:
• https://www.reddit.com/r/LocalLLaMA - The largest subreddit dedicated to local AI models. I learned a lot there, so I think it will be helpful for you too.
• [https://lmarena.ai - Here, you can chat with various models without having to download them. There’s also a leaderboard to give you an idea of how each model performs. The data is mostly for reference.

ai llm model artificial intelligence

TechByte Gaming