Hi, I’m new here, but I’d like to share some of the stuff I know on this topic for those who may not be very familiar with. While I’m not familiar with using online providers, I can talk about locally run models on PC. (Incoming wall of text)
Currently, privacy is one of the biggest advantages of locally run models compared to online providers, as it is capable of running completely offline and thus, nothing needs to leave from your computer (unless you want to remotely access your machine which requires explicitly configuring to do so). This also means you don’t have to trust your inputs to any providers practices or privacy policy, which is great for individuals and companies alike that may want to use sensitive information in their inputs (such as documents) without worrying about them being read/leaked/any other possible bad thing that could theoretically happen with such information in an online provider.
One issue to run them locally, is having to choose between speed, intelligence/usability (not sure if it’s this is the correct term) and/or context length. To best explain this issue, I will give an example. Let’s say someone has a computer with 16GB of RAM. That person will be able to run smaller AI models, but generation will be done with the CPU, which is quite slow. Under normal circumstances they have three options:
(1). Run the best largest model and/or highest context length their system can run (For example, Llama3.1 8B at Q8), while accepting that generation speeds will be extremely slow, and potentially run out of memory.
(2). Use a smaller model and/or less context length so that generation speeds can be higher and more usable, while accepting that the model won’t be as good and/or that it may forget some stuff due to a shorter context length.
(3). Buy a GPU to upgrade the system, so generations are now run with it instead of CPU, greatly boosting speeds and, depending on how much VRAM it has, running larger models and/or higher context length.
Let’s assume in our example we buy a GPU, let’s say, a RTX 4060ti 16GB and install the necessary drivers. Now we can run at significantly more usable speeds while using the same previous model in the option (1), but now that we have dedicated VRAM into the mix, we have a new topic to discuss. When running a model that’s loaded entirely within GPU, you tend to get the fastest speeds for your system. But with tools like llamacpp, it’s possible to load models while splitting between VRAM and RAM, so our previous 16GB of RAM can also be used. This allows you to run even larger models than you could previously run in this system, but there’s a catch, having even just one layer of the model’s weight in RAM instead of VRAM already slows down generation speeds, and the more layers you put in RAM instead of VRAM, the slower generation times become.
So we get into our previous three options again, which makes this a good opportunity to talk about the actual issue I wanted to talk about. Cost. While it is theoretically possible to locally run something as large as Llama 3.1 405B (Which is literally hundreds of gigabytes large even at the lowest usable quantizations), you’d need to spend a LOT of money building a computer with enough RAM and/or VRAM to run such model, while not necessarily being better in performance than something like ChatGPT/Claude from what I’ve heard (never utilized them so IDK).
As for what backbends to use for locally run LLMs on PC (I’m not familiar with how things are in cellphones, but I’ve been told it’s possible), there are two big options that AFAIK, all backends right now are in one way or another a derivative. Llamacpp and Exllama2. The former allows you to split the model layers between VRAM and RAM, but requires models to be quantized into .gguf files. while the latter is slightly faster but needs to fully fit inside VRAM. I personally use the former using a fork called Koboldcpp, which has extra features built on top of it while also being one of the easier ones to install as the devs provide precompiled binaries for both Linux, Windows and Mac, although you can compile it from source if you want to. It’s not the fanciest in terms of UI, but it’s perfectly usable, and if you so desire, there are other frontends available out there that you can use while still using koboldcpp as the backend.
As for security, I think a good rule of thumb to kep in mind is TO NOT RUN PICKLE FILES. I don’t know from the top of my head the exact details, but it’s an old file format they have vulnerabilities that have been reported in the past, to the point that a new file format “.safetensors” was created for newer models that explicitly lacks said vulnerabilities. If you’re running llamacpp or one one of it’s derivatives, you will be looking for a “.gguf” models, or quantize the model’s original “.safetensors” file yourself.
There’s a lot more stuff to talk about, but I don’t want to make this post longer than it already is. Also, most of the information here is from me remembering from the top of my head, so I hope I haven’t missed/said incorrectly any important details.