Ollama is an system for running Large Language Models (LLMs) locally.

While not required, GPU support is strongly recommended as LLMs are compute intensive and require acceleration to be generally thought of as useful.

Operational Notes

  1. Individual models must be downloaded before they can be used. You can do this by breaking into a running pod or using the ReST API. You should ensure you mount a PVC to /root/.ollama so that model downloads survive restarts.
ollama pull llama3`


curl -u 'username:password' -d '{"model": "llama3"}'
  1. You’ll likely need to pre load models to avoid time outs on initial calls, provide a keep alive metric to ensure its persisted to memory for an appropriate amount of time (negative numbers mean forever).
curl -u 'username:password' -d '{"model": "llama3", "keep_alive": -1}'
  1. Functional test.
curl -u 'username:password' -d '{
  "model": "llama3",
  "prompt":"Why is the sky blue?", "stream":false