Deep Dive: Understanding Sim Studio's Database Architecture
A comprehensive exploration of the PostgreSQL database schema powering Sim Studio's AI workflow builder - from normalized workflow graphs to vector embeddings
Sim Studio is an open-source visual workflow builder that enables developers to design, deploy, and monitor AI agent workflows. Behind its intuitive drag-and-drop canvas lies a sophisticated PostgreSQL database schema that persists everything from workflow definitions to execution traces, user authentication, and semantic search embeddings.
In this deep dive, we’ll explore the database architecture that powers over 60,000 developers building AI workflows on Sim Studio.
Overview: A Schema Built for Scale
The Sim Studio database schema is implemented using Drizzle ORM and consists of 38 tables organized into eight functional domains:
- Workflow System - Normalized workflow graphs with blocks, edges, and subflows
- Authentication & Authorization - Multi-provider OAuth with Better Auth
- Organizations & Workspaces - Multi-tenant architecture
- Execution Logging - Comprehensive traces with cost tracking
- Knowledge Base - Document storage with pgvector embeddings
- Scheduling & Webhooks - Cron jobs and external triggers
- Rate Limiting - Per-user and per-organization counters
- Permissions - Fine-grained access control
The schema leverages PostgreSQL’s advanced features including:
- JSONB columns for flexible, semi-structured data
- pgvector extension for semantic search capabilities
- HNSW indexes for high-performance similarity queries
- Atomic SQL operations for race-condition-free rate limiting
Part 1: The Workflow Data Model
At the heart of Sim Studio is its workflow system. Workflows are visual DAGs (Directed Acyclic Graphs) where blocks represent operations and edges represent data flow between them.
The Dual-State Architecture
Sim Studio maintains workflow definitions in two complementary formats:
1. Denormalized State (JSONB) - The workflow.state column contains the complete workflow as edited in the UI:
// workflow.state structure
{
blocks: {
"block-uuid-1": { type: "agent", name: "GPT-4 Agent", ... },
"block-uuid-2": { type: "api", name: "Fetch Data", ... }
},
edges: [
{ source: "block-uuid-1", target: "block-uuid-2", ... }
],
loops: { ... },
parallels: { ... }
}
This denormalized format is the source of truth for workflow execution - it provides instant access to the complete workflow topology without joins.
2. Normalized Tables - The same data is decomposed into relational tables for efficient querying:
Core Workflow Tables
workflow - The Primary Table
| Column | Type | Description |
|---|---|---|
id | uuid | Primary key |
userId | uuid | Owner reference |
workspaceId | uuid | Workspace container |
folderId | uuid | Optional folder for organization |
name | text | Workflow name |
description | text | Workflow description |
color | text | UI color code (hex) |
state | jsonb | Complete workflow definition |
isDeployed | boolean | Deployment status |
runCount | integer | Execution counter |
lastRunAt | timestamp | Last execution time |
variables | json | Workflow-level variables |
The state column deserves special attention - it’s a JSONB blob containing the entire workflow graph as visualized on the canvas. This design choice prioritizes read performance during execution over storage normalization.
workflow_blocks - Normalized Block Storage
CREATE TABLE workflow_blocks (
id uuid PRIMARY KEY,
workflow_id uuid REFERENCES workflow(id) ON DELETE CASCADE,
block_id text NOT NULL, -- Internal block identifier
block_type text NOT NULL, -- 'agent', 'api', 'function', etc.
block_name text NOT NULL, -- User-defined name
position jsonb, -- {x, y} canvas coordinates
sub_blocks jsonb, -- Configuration and inputs
outputs jsonb, -- Output schema definition
enabled boolean DEFAULT true,
metadata jsonb
);
Block types in Sim Studio include:
- Trigger blocks:
start,webhook,schedule,api,chat - Processing blocks:
agent,api,function,guardrails - Logic blocks:
condition,router,loop,parallel - Output blocks:
response,evaluator,wait
workflow_edges - Connection Graph
CREATE TABLE workflow_edges (
id uuid PRIMARY KEY,
workflow_id uuid REFERENCES workflow(id) ON DELETE CASCADE,
edge_id text NOT NULL,
source text NOT NULL, -- Source block ID
target text NOT NULL, -- Target block ID
source_handle text, -- Output handle identifier
target_handle text -- Input handle identifier
);
Edges define the DAG structure that determines execution order. The execution engine uses these to:
- Build dependency graphs
- Determine parallel execution opportunities
- Route data between blocks
workflow_subflows - Loops and Parallels
CREATE TABLE workflow_subflows (
id uuid PRIMARY KEY,
workflow_id uuid REFERENCES workflow(id) ON DELETE CASCADE,
subflow_id text NOT NULL,
subflow_type text NOT NULL, -- 'loop' or 'parallel'
config jsonb, -- Iteration configuration
block_ids text[] -- Contained block IDs
);
Subflows are container structures that group blocks for:
- Loop execution: ForEach, While, Do-While patterns
- Parallel execution: Concurrent branch processing
The execution engine expands these into sentinel nodes during DAG compilation, preserving the acyclic property while enabling iteration.
Workflow Organization
workflow_folder - Hierarchical Organization
CREATE TABLE workflow_folder (
id uuid PRIMARY KEY,
user_id uuid REFERENCES "user"(id) ON DELETE CASCADE,
workspace_id uuid REFERENCES workspace(id) ON DELETE CASCADE,
parent_id uuid REFERENCES workflow_folder(id), -- Self-reference
name text NOT NULL,
color text DEFAULT '#6B7280',
is_expanded boolean DEFAULT true,
sort_order integer DEFAULT 0
);
Folders support arbitrary nesting depth through self-referential parent_id, enabling project-style organization.
Part 2: Authentication & Multi-Tenancy
Sim Studio uses Better Auth for authentication, supporting multiple OAuth providers and enterprise SSO.
User & Session Management
user - Core Identity
CREATE TABLE "user" (
id text PRIMARY KEY,
name text NOT NULL,
email text NOT NULL UNIQUE,
email_verified boolean NOT NULL,
image text,
stripe_customer_id text,
is_super_user boolean DEFAULT false
);
session - Active Sessions
CREATE TABLE session (
id text PRIMARY KEY,
user_id text REFERENCES "user"(id) ON DELETE CASCADE,
token text NOT NULL UNIQUE,
expires_at timestamp NOT NULL,
ip_address text,
user_agent text,
active_organization_id text REFERENCES organization(id)
);
Sessions track the active organization context, enabling seamless workspace switching without re-authentication.
account - OAuth Provider Links
CREATE TABLE account (
id text PRIMARY KEY,
user_id text REFERENCES "user"(id) ON DELETE CASCADE,
account_id text NOT NULL,
provider_id text NOT NULL, -- 'github', 'google', etc.
access_token text, -- Encrypted
refresh_token text, -- Encrypted
access_token_expires_at timestamp,
scope text,
password text, -- Hashed (email/password auth)
UNIQUE(user_id, provider_id, account_id)
);
Multi-Tenant Architecture
organization - Team Containers
CREATE TABLE organization (
id uuid PRIMARY KEY,
name text NOT NULL,
slug text UNIQUE, -- URL-safe identifier
logo text,
metadata text -- JSON string
);
member - Organization Membership
CREATE TABLE member (
id uuid PRIMARY KEY,
organization_id uuid REFERENCES organization(id),
user_id uuid REFERENCES "user"(id),
role text NOT NULL -- 'owner', 'admin', 'member'
);
workspace - Workflow Containers
Workspaces provide isolation within organizations:
CREATE TABLE workspace (
id uuid PRIMARY KEY,
organization_id uuid REFERENCES organization(id),
name text NOT NULL,
slug text,
-- Additional workspace settings
);
Subscription & Billing
CREATE TABLE subscription (
id uuid PRIMARY KEY,
plan text NOT NULL, -- 'free', 'pro', 'team', 'enterprise'
status text NOT NULL, -- 'active', 'canceled', 'past_due'
seats integer,
reference_id uuid NOT NULL, -- User ID or Organization ID
metadata jsonb -- Stripe data and allowances
);
The reference_id polymorphically points to either a user (personal subscription) or organization (team subscription).
Part 3: Execution Logging & Observability
Every workflow execution generates comprehensive logs for debugging, cost tracking, and compliance.
workflow_execution_logs - The Execution Record
CREATE TABLE workflow_execution_logs (
id uuid PRIMARY KEY,
workflow_id uuid REFERENCES workflow(id),
execution_id text UNIQUE NOT NULL,
state_snapshot_id uuid, -- Workflow state at execution time
level text, -- 'info', 'error', 'success'
trigger text, -- 'manual', 'api', 'webhook', 'schedule', 'chat'
started_at timestamp,
ended_at timestamp,
total_duration_ms integer,
execution_data jsonb, -- Trace spans and block outputs
cost jsonb, -- Detailed cost breakdown
files jsonb -- Associated file uploads
);
The execution_data Structure
{
"traceSpans": [
{
"blockId": "agent-1",
"blockType": "agent",
"startTime": "2025-01-15T10:00:00Z",
"endTime": "2025-01-15T10:00:02Z",
"input": { ... },
"output": { ... },
"tokens": { "prompt": 150, "completion": 200 }
}
],
"finalOutput": { ... },
"enhanced": true
}
The cost Structure
{
"total": 0.0045,
"input": 0.0015,
"output": 0.0030,
"tokens": {
"total": 350,
"prompt": 150,
"completion": 200
},
"models": {
"gpt-4": { "calls": 1, "cost": 0.0045 }
}
}
paused_executions - Human-in-the-Loop
CREATE TABLE paused_executions (
id uuid PRIMARY KEY,
execution_id text NOT NULL,
workflow_id uuid REFERENCES workflow(id),
status text, -- 'pending', 'fully_resumed'
total_pause_count integer,
resumed_count integer,
pause_data jsonb -- Pause states and user inputs
);
This table enables workflows to pause for human review/input and resume exactly where they left off.
user_stats - Aggregated Metrics
CREATE TABLE user_stats (
user_id uuid PRIMARY KEY REFERENCES "user"(id),
total_executions integer DEFAULT 0,
total_scheduled_executions integer DEFAULT 0,
total_webhook_executions integer DEFAULT 0,
total_api_executions integer DEFAULT 0,
last_active timestamp,
metadata jsonb
);
Part 4: Knowledge Base & Vector Search
Sim Studio’s knowledge base system enables RAG (Retrieval-Augmented Generation) workflows through semantic search.
knowledge_base - Container Configuration
CREATE TABLE knowledge_base (
id uuid PRIMARY KEY,
user_id uuid REFERENCES "user"(id),
workspace_id uuid REFERENCES workspace(id),
name text NOT NULL,
description text,
token_count integer DEFAULT 0,
embedding_model text, -- 'text-embedding-3-small'
embedding_dimension integer, -- 1536
chunking_config jsonb,
deleted_at timestamp -- Soft delete
);
document - File Records
CREATE TABLE document (
id uuid PRIMARY KEY,
knowledge_base_id uuid REFERENCES knowledge_base(id),
filename text NOT NULL,
file_url text, -- S3/Blob storage URL
file_size integer,
mime_type text,
chunk_count integer,
token_count integer,
processing_status text, -- 'pending', 'processing', 'completed', 'failed'
processing_error text,
enabled boolean DEFAULT true,
tag1 through tag7 text, -- Custom metadata tags
deleted_at timestamp
);
The seven tag columns enable flexible metadata categorization without schema changes.
embedding - Vector Storage
CREATE TABLE embedding (
id uuid PRIMARY KEY,
document_id uuid REFERENCES document(id),
knowledge_base_id uuid REFERENCES knowledge_base(id),
chunk_index integer,
content text NOT NULL,
embedding vector(1536) NOT NULL, -- pgvector type
token_count integer,
tag1 through tag7 text -- Inherited from document
);
-- HNSW index for fast similarity search
CREATE INDEX embedding_hnsw_idx ON embedding
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
The HNSW index parameters are tuned for 1536-dimensional embeddings (OpenAI’s default), providing sub-millisecond similarity queries.
Part 5: Scheduling & Webhooks
workflow_schedule - Cron Jobs
CREATE TABLE workflow_schedule (
id uuid PRIMARY KEY,
workflow_id uuid REFERENCES workflow(id),
block_id uuid, -- Schedule trigger block
cron_expression text,
status text, -- 'active', 'disabled', 'paused'
next_run_at timestamp,
last_ran_at timestamp,
last_queued_at timestamp,
last_failed_at timestamp,
failed_count integer DEFAULT 0
);
-- Index for due schedule queries
CREATE INDEX schedule_next_run_idx ON workflow_schedule(next_run_at)
WHERE status = 'active';
webhook - External Triggers
CREATE TABLE webhook (
id uuid PRIMARY KEY,
workflow_id uuid REFERENCES workflow(id),
path text UNIQUE, -- Webhook URL path
is_active boolean DEFAULT true,
provider text, -- 'slack', 'github', 'twilio', 'generic'
provider_config jsonb, -- Provider-specific auth/routing
filters jsonb -- Event filtering rules
);
Part 6: Rate Limiting
user_rate_limits - Atomic Counters
CREATE TABLE user_rate_limits (
reference_id text PRIMARY KEY, -- User ID or Org ID
sync_api_requests integer DEFAULT 0,
async_api_requests integer DEFAULT 0,
api_endpoint_requests integer DEFAULT 0,
window_start timestamp,
last_request_at timestamp,
is_rate_limited boolean DEFAULT false,
rate_limit_reset_at timestamp
);
Rate limits are enforced using atomic SQL increments:
UPDATE user_rate_limits
SET sync_api_requests = sync_api_requests + 1,
last_request_at = NOW()
WHERE reference_id = $1
RETURNING sync_api_requests;
Rate Limit Tiers
| Plan | Sync API/min | Async API/min | External API/min |
|---|---|---|---|
| free | 10 | 50 | 10 |
| pro | 25 | 200 | 30 |
| team | 75 | 500 | 60 |
| enterprise | 150 | 1000 | 120 |
Part 7: Permissions System
permissions - Fine-Grained Access Control
CREATE TABLE permissions (
id uuid PRIMARY KEY,
user_id uuid REFERENCES "user"(id),
entity_type text NOT NULL, -- 'workspace', 'workflow', 'knowledge_base'
entity_id uuid NOT NULL,
permission_type text NOT NULL, -- 'read', 'write', 'admin'
UNIQUE(user_id, entity_type, entity_id)
);
Permissions are checked via efficient joins:
SELECT w.* FROM workflow w
LEFT JOIN permissions p ON p.entity_id = w.id
AND p.entity_type = 'workflow'
AND p.user_id = $1
WHERE w.user_id = $1
OR p.permission_type IN ('read', 'write', 'admin');
Entity Relationship Diagram
┌─────────────┐ ┌──────────────┐ ┌─────────────────┐
│ user │────<│ session │ │ organization │
└─────────────┘ └──────────────┘ └─────────────────┘
│ │ │
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌──────────────┐ ┌─────────────────┐
│ account │ │ member │────>│ workspace │
└─────────────┘ └──────────────┘ └─────────────────┘
│
┌──────────────────────────────────────────┤
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────────┐
│ workflow_folder │ │ workflow │
└─────────────────┘ └─────────────────────┘
│
┌──────────────┬──────────────┬───────────┼───────────┐
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
┌──────────────┐ ┌──────────┐ ┌───────────┐ ┌─────────┐ ┌─────────┐
│workflow_blocks│ │ edges │ │ subflows │ │ webhook │ │schedule │
└──────────────┘ └──────────┘ └───────────┘ └─────────┘ └─────────┘
┌─────────────────┐ ┌──────────────┐ ┌─────────────────┐
│ knowledge_base │────<│ document │────<│ embedding │
└─────────────────┘ └──────────────┘ └─────────────────┘
Design Patterns & Best Practices
1. Dual-State Pattern
The workflow system maintains both denormalized (JSONB) and normalized (relational) representations. This provides:
- Fast reads during execution (single JSONB fetch)
- Efficient queries for search and filtering (indexed columns)
- Atomic updates via JSONB operations
2. Soft Deletes
Tables like document and knowledge_base use deleted_at timestamps instead of hard deletes, enabling:
- Audit trails
- Undo functionality
- Compliance requirements
3. Polymorphic References
The subscription.reference_id and user_rate_limits.reference_id columns can point to either users or organizations, avoiding table duplication.
4. JSONB for Flexibility
Complex, evolving structures like block configurations and execution traces use JSONB columns, enabling schema evolution without migrations.
5. Strategic Indexing
- Foreign key indexes for join performance
- Timestamp indexes (DESC) for “recent items” queries
- Partial indexes for filtered queries (e.g., active schedules)
- HNSW vector indexes for similarity search
Conclusion
Sim Studio’s database schema represents a thoughtful balance between relational integrity and document flexibility. By combining:
- Normalized tables for efficient querying and referential integrity
- JSONB columns for complex, evolving structures
- pgvector for AI-powered semantic search
- Better Auth for enterprise-grade authentication
The architecture supports the demanding requirements of a modern AI workflow platform: real-time collaboration, comprehensive observability, multi-tenant isolation, and sub-second vector search across millions of embeddings.
For developers looking to contribute or deploy self-hosted instances, understanding this schema provides the foundation for extending Sim Studio’s capabilities while maintaining data integrity and performance.
Want to explore more? Check out the full schema on GitHub or dive into the official documentation.