Ive been playing koboldai horde but the queue annoys me. I want a nsfw ai for playing on tavernai chat
koboldai horde
I mean, you can run KoboldAI locally.
I don’t know whether you’d consider that sufficiently fast. But if you’re already using that and happy with it, it’s probably what I’d try first.
The 7600 is the 16GB? I can’t say for AMD but a 16 GB 3080Ti can run a whole lot of something. I don’t do Kobold because building it was too much of a headache of dependencies. I don’t do silly tavern either because I prefer more control and versatility.
I’m using an 18 core 12th gen with 64GB of sysmem and mostly use llama.cpp so that I can split the load between CPU and GPU. I wrote a little command line function that polls nvidia-smi and parses the GPU memory to tell me exactly how much I have used and what I have left over. That runs every 5 seconds in the terminal and displays the metrics on the title bar. Knowing exactly how much RAM you’re using in the GPU and dialing in the settings with new models makes a big difference. The various models have very different requirements and settings optimisation potential.
I run an 8×7B quantized model at 5 bits most of the time. It takes around 50GB to initially load, but runs like a 13B after that and is quire light weight.
I’m somewhat limited when it comes to training LoRA’s. Like I can only do 7-8B model stuff in that space, but with a GGUF I can run up to a 70B. I wish I had more than 64 GB of system memory though. At 96 or 128 I could run some of the 120B models. Command R is pretty popular and powerful, but I can’t load that one.
The 16 GB can run something like moistral 11B in transformers and 4-bit using bits and bites too.
How much speed are you actually getting on Mixtral (I assume that’s the 8x7b). I have 64 GB of RAM and an AMD RX 6800 XT with 16 GB of VRAM. I get like 4 tokens per second with Q5_K_M quant.
Get una-thebeagle-7b-v1.Q4_K_M. I found it looking at this guide.
I cant clone it
What do you mean?
There’s a fork of text-generation-webui with HIP support, you should use that
https://github.com/YellowRoseCx/koboldcpp-rocm
That will be optimized for AMD and as far as I know has the same / a very similar user interface.
(The 8GB of VRAM on your graphics card will be some limitation. So maybe stick with smaller and quantized models.)
And share your success stories on !ChatbotsNSFW@lemmynsfw.com
Install ollama. It has ROCm support (on Linux at least). Then hook it up to your favorite client. It has its own API and an openai compatible one.