Python Package · MIT License · v2.0+

One API for
All LLMs.
Zero Lock-In.

PlugLLM gives you a single, clean Python interface for 13+ AI providers — OpenAI, Gemini, Groq, Claude, Mistral, Ollama & more. Built-in memory, streaming, async. Switch providers in one line.

Get Started → View on GitHub

$ pip install plugllm

— GitHub Stars

13+ Providers

4 Core Methods

<5K Lines of Code

demo.py — PlugLLM python

from plugllm import ChatOpenAI, ChatGroq, LLMFactory

# ── Direct provider usage ──────────────────
llm = ChatOpenAI(api_key="sk-...", model="gpt-4o")
reply = llm.ask("What is machine learning?")
print(reply)

# ── Factory: switch providers in 1 line ───
llm = LLMFactory.create(
    "groq", api_key="gsk_...",
    model="llama-3.3-70b-versatile"
)
print(llm.generate("Explain AI in one sentence"))

# ── Built-in session memory ────────────────
chat = ChatGroq(api_key="gsk_...", max_history=10)
chat.chat("My name is Alice", session_id="u1")
response = chat.chat("What's my name?", session_id="u1")
# → "Your name is Alice!"

# ── Real-time streaming ────────────────────
for chunk in llm.stream("Tell me a story"):
    print(chunk, end="", flush=True)

Running · plugllm 2.0

Python 3.9+

✦ Why PlugLLM

Everything you need.
Nothing you don't.

A focused, production-ready library that does one thing exceptionally well — unified LLM access.

🔌

Unified API

Identical .generate(), .chat(), .stream(), and .ask() methods across every provider. Swap OpenAI for Groq in one character change.

🧠

Built-in Conversation Memory

Native sliding-window deque memory per session ID. Configure max_history=20 and forget about context management.

🌊

Real-time Streaming

Sync and async streaming out of the box. for chunk in llm.stream(prompt) just works for every provider — perfect for chat UIs.

🔄

Full Async Support

Every synchronous method has an async counterpart — agenerate(), achat(), astream(), aask(). Compatible with asyncio, FastAPI, and any async framework.

python · async

import asyncio
from plugllm import ChatGemini

async def main():
    llm = ChatGemini()
    result = await llm.agenerate("Hello!")

    async for chunk in llm.astream("Tell a story"):
        print(chunk, end="", flush=True)

asyncio.run(main())

🏭

LLMFactory & Fluent API

Create any provider dynamically with LLMFactory.create("groq", ...). Chain with .with_system(), .with_temperature(0.7), .call().

🔐

Production Error Handling

Typed exception hierarchy: AuthenticationError, RateLimitError, HTTPStatusError. Built-in retries, configurable timeouts, structured ChatResponse.

✦ 13+ Providers

Every major LLM.
One import.

From cloud giants to local models — all accessible through a single unified interface.

Provider	Class	Default Model	Env Var	Links
🤖 OpenAI	ChatOpenAI	gpt-4o	OPENAI_API_KEY	Get Key ↗
💎 Google Gemini	ChatGemini	gemini-2.5-flash	GEMINI_API_KEY	Get Key ↗ Free
⚡ Groq	ChatGroq	llama-3.3-70b	GROQ_API_KEY	Get Key ↗ Free
🧠 Anthropic Claude	ChatClaude	claude-sonnet-4-6	ANTHROPIC_API_KEY	Get Key ↗
🌟 Mistral AI	ChatMistral	mistral-large-latest	MISTRAL_API_KEY	Get Key ↗ Free
⚙️ xAI Grok	ChatGrok	grok-3	XAI_API_KEY	Get Key ↗
🔵 DeepSeek	ChatDeepSeek	deepseek-chat	DEEPSEEK_API_KEY	Get Key ↗
🏠 Ollama	ChatOllama	gemma3 / llama3.2	No key needed	Install ↗ Local
🌊 Cohere	ChatCohere	command-a-03-2025	CO_API_KEY	Get Key ↗
🐉 Alibaba Qwen	ChatQwen	qwen-plus	DASHSCOPE_API_KEY	Get Key ↗
🌙 Moonshot Kimi	ChatKimi	moonshot-v1-8k	MOONSHOT_API_KEY	Get Key ↗
🦙 Meta Llama	ChatLlama	llama-3.3-70b	Via Groq / Ollama	Flexible
🇮🇳 SarvamAI	ChatSarvam	sarvam-2b	SARVAM_API_KEY	Get Key ↗ Free

✦ Comparison

PlugLLM vs the rest.

Focused simplicity beats heavyweight orchestration for most production AI apps.

Capability	✦ PlugLLM	LangChain	LiteLLM	Sem. Kernel
Switch provider in 1 line	✓ Yes — change class or factory string	Multiple chain + callback changes	✓ Yes	Complex re-registration
Built-in memory	✓ Native deque, per-session	~ Requires BufferMemory	✗ None	~ Custom needed
Streaming (sync + async)	✓ Both sync & async, unified	~ Partial	✓ Yes	✓ Yes, heavy setup
Package footprint	✓ <5K LoC ultra-light	100+ modules, heavy	Moderate	Very heavy
Learning curve	✓ Minutes	Hours / days	Minutes	Days

✦ Tutorial

From zero to production
in 5 steps.

A complete, hands-on guide covering installation through advanced async patterns.

1 Install & SetupInstall 2 Basic GenerationBasic 3 Session MemoryMemory 4 StreamingStream 5 Async & ConcurrencyAsync

Install & Configure API Keys

Install the package and set your provider API keys as environment variables or pass them directly.

bash · installation

# Install PlugLLM from PyPI
pip install plugllm

# Optional: with dev dependencies
pip install "plugllm[dev]"

# Set your API keys
export OPENAI_API_KEY="sk-..."
export GEMINI_API_KEY="AIza..."
export GROQ_API_KEY="gsk_..."
export ANTHROPIC_API_KEY="sk-ant-..."

Basic Generation & Q&A

Four core methods work identically across all providers. Mix and match freely.

python · basic usage

from plugllm import ChatOpenAI, ChatGroq, LLMFactory

# .generate() — simple text completion
llm = ChatOpenAI(model="gpt-4o")
text = llm.generate("What is the difference between AI and ML?")

# .ask() — Q&A with optional system prompt
answer = llm.ask(
    "Explain transformers architecture",
    system_prompt="You are a senior ML engineer. Be concise.",
    max_tokens=300
)
print(answer.content)   # ChatResponse object
print(answer.model)     # "gpt-4o"

# LLMFactory — dynamic provider switching
llm2 = LLMFactory.create("gemini", model="gemini-2.5-flash")
result = llm2.generate("Summarize the Transformer paper")

# Fluent chaining interface
response = (
    ChatGroq()
    .with_system("You are a Python expert.")
    .with_temperature(0.2)
    .call("Write a binary search function")
)

Session-based Conversation Memory

Built-in sliding-window memory per session ID. No external storage required.

python · session memory

from plugllm import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini", max_history=20)

# Multi-user session isolation
llm.chat("My name is Alice, I live in Paris", session_id="alice")
llm.chat("My favourite language is Python",   session_id="alice")
llm.chat("I'm Bob, a data scientist in Tokyo", session_id="bob")

# Each session has completely isolated memory
print(llm.chat("Where do I live?",    session_id="alice"))
# → "You live in Paris."

print(llm.chat("What do I work with?", session_id="bob"))
# → "You mostly work with PyTorch."

# Clear memory for a session
llm.clear_history(session_id="alice")
history = llm.get_history(session_id="bob")

Real-time Token Streaming

Stream responses token-by-token for live UIs, terminals, and chat applications.

python · streaming

from plugllm import ChatGroq

llm = ChatGroq(model="llama-3.3-70b-versatile")

# Synchronous streaming
for chunk in llm.stream("Write a short poem about Python"):
    print(chunk, end="", flush=True)

# Stream with system prompt
for chunk in llm.stream(
    "Explain async/await in Python",
    system_prompt="You are a concise technical writer."
):
    print(chunk, end="", flush=True)

Async & High-Concurrency

Run parallel LLM calls with asyncio — ideal for batch processing and high-throughput APIs.

python · asyncio

import asyncio
from plugllm import ChatGemini, ChatGroq, ChatOpenAI

async def parallel_summarize(text: str) -> dict:
    providers = {
        "GPT-4o":  ChatOpenAI(model="gpt-4o-mini"),
        "Gemini":  ChatGemini(model="gemini-2.5-flash"),
        "Llama":   ChatGroq(model="llama-3.3-70b-versatile"),
    }
    prompt = f"Summarize in 2 sentences:\n{text}"
    tasks = {name: llm.agenerate(prompt) for name, llm in providers.items()}
    results = await asyncio.gather(*tasks.values())
    return dict(zip(tasks.keys(), results))

async def main():
    summaries = await parallel_summarize(
        "Transformer architecture uses self-attention..."
    )
    for provider, summary in summaries.items():
        print(f"\n[{provider}]\n{summary}")

asyncio.run(main())

✦ Real-World Projects

Build something real.

Three complete production-ready projects you can run right now.

PROJECT 01

Multi-LLM Content Summarizer

Benchmark multiple providers simultaneously and compare their summaries side by side.

OpenAI Groq Gemini

python

from plugllm import ChatOpenAI, ChatGroq, ChatGemini

def multi_summarize(text: str, max_tokens: int = 120):
    providers = {
        "GPT-4o": ChatOpenAI(model="gpt-4o-mini"),
        "Groq":   ChatGroq(model="llama-3.3-70b-versatile"),
        "Gemini": ChatGemini(model="gemini-2.5-flash"),
    }
    results = {}
    prompt = f"Summarize this concisely:\n\n{text}"
    for name, llm in providers.items():
        r = llm.ask(prompt, max_tokens=max_tokens)
        results[name] = {"summary": r.content, "tokens": r.usage}
    return results

PROJECT 02

Customer Support Bot

A stateful support bot with per-user session memory and sentiment-aware escalation.

Mistral Session Memory Stateful

python

from plugllm import ChatMistral

SYSTEM = """You are a helpful support agent.
Be empathetic, solution-focused, and brief."""

class SupportBot:
    def __init__(self):
        self.llm = ChatMistral(
            model="mistral-large-latest",
            max_history=15,
        )

    def handle(self, user_id: str, message: str) -> str:
        return self.llm.chat(
            message, session_id=user_id,
            system_prompt=SYSTEM,
        )

    def reset(self, user_id: str):
        self.llm.clear_history(session_id=user_id)

PROJECT 03

Local Code Assistant

Fully offline code explainer and reviewer using Ollama — no API key, no cloud, no cost.

Ollama 100% Local Free

python

from plugllm import ChatOllama

# Requires: ollama pull deepseek-coder
assistant = ChatOllama(model="deepseek-coder:6.7b")

SYSTEM = """You are an expert code reviewer.
Explain clearly, identify bugs, and suggest
improvements with specific line references."""

def explain_code(code: str) -> str:
    return assistant.ask(
        f"Explain and review:\n\n```python\n{code}\n```",
        system_prompt=SYSTEM,
    ).content

def review_code(code: str) -> str:
    return assistant.ask(
        f"Find bugs and improvements:\n\n```python\n{code}\n```",
        system_prompt=SYSTEM,
    ).content

✦ API Reference

Complete API documentation.

Every method, parameter, and return type — all in one place.

⚡ Core Methods

generate()Basic text completion. Returns str.

chat()Memory-aware conversation. Returns ChatResponse.

ask()Q&A with optional system prompt. Returns ChatResponse.

stream()Token streaming iterator. Yields str chunks.

agenerate()Async version of generate().

achat()Async version of chat().

astream()Async generator, yields str chunks.

aask()Async version of ask().

⚙️ Configuration Parameters

api_keystrProvider API key (or env var)

modelstrModel name/version string

temperaturefloatCreativity 0.0–2.0 (default: 0.7)

max_tokensintMax output tokens (default: 1024)

max_historyintMemory window size (default: 10)

timeoutintRequest timeout in seconds

max_retriesintRetry count on failure (default: 3)

top_pfloatNucleus sampling probability

📦 ChatResponse Object

python

response = llm.ask("What is Python?")

response.content       # str — the text response
response.model         # str — model used
response.usage         # dict — token counts
response.raw_response  # dict — full API response
response.finish_reason # str — "stop", "length"

# Message factory helpers
from plugllm.types import Message
msg = Message.user("Hello!")
msg = Message.assistant("Hi there!")
msg = Message.system("You are an expert.")

🔐 Error Handling

python

from plugllm import ChatOpenAI
from plugllm.types import (
    AuthenticationError,
    RateLimitError,
    HTTPStatusError,
    PlugLLMError,
)

llm = ChatOpenAI(max_retries=3)

try:
    response = llm.generate("Hello!")

except AuthenticationError as e:
    print(f"Invalid API key: {e}")

except RateLimitError as e:
    print(f"Rate limited: {e}")

except HTTPStatusError as e:
    print(f"HTTP {e.status_code}: {e}")

except PlugLLMError as e:
    print(f"General error: {e}")

One API for All LLMs. Zero Lock-In.

Everything you need.Nothing you don't.

Every major LLM.One import.

PlugLLM vs the rest.

From zero to productionin 5 steps.

Build something real.

Complete API documentation.

Start building today.

One API for
All LLMs.
Zero Lock-In.

Everything you need.
Nothing you don't.

Every major LLM.
One import.

From zero to production
in 5 steps.