AI-Assisted Synthetic Time Series Generation: Architecture and Technical Design

This document describes the architecture, data flow, and technical design of Phoenix's AI-assisted synthetic time series generation system. It focuses on how the LLM integration works, the role of Python in the generation pipeline, and the security model that makes this approach safe.

Design Philosophy
Architecture Overview
System Components
Data Flow
The Prompt Engineering Layer
LLM Integration with pydantic-ai
WebSocket Communication
The Python Generation Engine
Validation Pipeline
Security Model
Why Parameterized Generation, Not Code Generation

Design Philosophy

The AI layer in Phoenix follows a parameterized generation architecture. The LLM does not generate Python code that gets executed. Instead, it translates natural language descriptions into a structured JSON parameter schema, which is then fed into a deterministic Python engine (TimeSeriesGenerator) that produces the data using NumPy and Pandas.

This separation has three key benefits:

Safety: No arbitrary code execution. The attack surface is limited to validated numeric parameters.
Determinism: The same parameters always produce the same class of output (modulo random noise seeds).
Auditability: Every generated time series can be fully described by its parameter set, which is stored alongside the data.

Architecture Overview

                        +-----------------------+
                        |     User (Browser)    |
                        +----------+------------+
                                   |
                    +--------------+--------------+
                    |                             |
           Natural Language               Manual Form Input
           via Chat (WebSocket)           via HTTP POST
                    |                             |
                    v                             |
          +-------------------+                   |
          |   pydantic-ai     |                   |
          |   Agent (LLM)     |                   |
          +--------+----------+                   |
                   |                              |
              JSON Parameters                     |
              (structured schema)                 |
                   |                              |
                   v                              |
          +-------------------+                   |
          |  Frontend         |                   |
          |  (Alpine.js)      |<------------------+
          |  Form Population  |
          +--------+----------+
                   |
              HTTP POST (FormData)
                   |
                   v
          +-------------------+
          |  Django View       |
          |  + Form Validation |
          +--------+----------+
                   |
                   v
          +-------------------+
          |  TimeSeriesGenerator  |
          |  (NumPy / Pandas)     |
          +-----------+-----------+
                      |
                 pd.DataFrame
                      |
              +-------+-------+
              |               |
           Preview         Save to DB
           (JSON chart     (TimeSeries +
            response)       Metadata)

System Components

1. AI Agent (`apps/ai/`)

The AI layer is responsible for interpreting natural language and producing structured generation parameters.

File	Purpose
`apps/ai/prompts/tsgen.py`	System prompt with schema, domain knowledge, and generation rules
`apps/ai/agents.py`	Agent factory using pydantic-ai; defines the `tsgen` agent type
`apps/ai/types.py`	`UserDependencies` dataclass for injecting user context
`apps/ai/handlers.py`	Event stream handler for logging and monitoring

2. Chat Layer (`apps/chat/`)

Handles real-time WebSocket communication between the browser and the AI agent. Beyond text input, the chat interface supports image uploads and speech-to-text recognition, both of which significantly improve the user experience for time series generation. Users can photograph or screenshot an existing signal plot and send it directly to the agent, which interprets visual characteristics -- shape, frequency content, amplitude, noise levels, trends -- and produces matching generation parameters. Speech-to-text allows users to verbally describe complex multi-channel setups hands-free, which is particularly useful for engineers working in industrial environments where typing may be impractical. These multimodal inputs feed into the same AI pipeline: the LLM interprets the content and responds with the standard JSON parameter block, keeping the architecture and validation layers unchanged.

File	Purpose
`apps/chat/consumers.py`	`TsgenAgentChatConsumer` -- WebSocket consumer with JSON extraction
`apps/chat/routing.py`	WebSocket route: `ws/tsgenagent/`
`apps/chat/sessions.py`	`AgentSession` -- manages agent lifecycle and streaming

3. Generation Engine (`apps/phoenix/`)

The generation engine is the deterministic core of Phoenix. It receives validated parameters and produces synthetic time series data using pure Python numerical computation. The engine is built around the TimeSeriesGenerator class, which exposes a fluent builder API for constructing signals as a superposition of a baseline mean, linear trends, sinusoidal oscillations, and Gaussian noise. For multi-channel scenarios, cross-channel correlations are applied via Cholesky decomposition on the correlation matrix. All heavy computation is delegated to NumPy vectorized operations, and the output is returned as a Pandas DataFrame with a DatetimeIndex. Because the engine operates exclusively on bounded, validated numeric inputs -- never on generated code -- it is both fast (sub-second for the 10,000-point limit) and inherently safe.

File	Purpose
`apps/phoenix/generators.py`	`TimeSeriesGenerator` class and Pydantic config models
`apps/phoenix/views.py`	Django views for preview and save endpoints
`apps/phoenix/forms.py`	`GenerateTimeSeriesForm` with server-side validation
`apps/phoenix/constants.py`	System limits (`MAX_POINTS_PER_SERIES`, `MAX_CHANNELS`, etc.)

4. Data Storage (`apps/sentinel/`)

Generated time series are persisted in PostgreSQL through Django's ORM. Each saved series is stored as a TimeSeries record linked to the authenticated user, with the actual data (timestamps and values) held in a JSONField -- either as a single {timestamps, values} pair for single-channel data or as {timestamps, channels: [{name, values}]} for multi-channel data. An associated TimeSeriesMetadata record stores the complete generation parameters as JSON, enabling exact regeneration from an existing series. For datasets that grow beyond the 10,000-point limit imposed on basic accounts (superusers can generate larger series, as the point cap is skipped for them), or for uploaded/processed data elsewhere in the application, a TimeSeriesDataPoint model backed by a TimescaleDB hypertable provides efficient time-partitioned storage. Phoenix-generated data stays within the JSON storage path, keeping writes simple and reads fast for the visualization layer.

File	Purpose
`apps/sentinel/models.py`	`TimeSeries`, `TimeSeriesMetadata`, `TimeSeriesDataPoint` models

5. Frontend (`assets/javascript/phoenix/`)

The frontend is built with Django templates styled using Tailwind CSS and DaisyUI, with Alpine.js providing client-side reactivity. The main generation UI is a single Alpine.js component that manages multi-channel configuration (add/remove channels, per-channel oscillations, noise, trends, degradation), correlation editing, client-side Nyquist aliasing checks, and preview rendering via Plotly.js. The AI chat panel communicates over a WebSocket and dispatches a custom DOM event (tsgen:apply-params) when the LLM produces generation parameters, which the Alpine component listens for and uses to populate the form. JavaScript is bundled by Vite and served through django-vite.

File	Purpose
`assets/javascript/phoenix/generate.js`	Alpine.js component for the generation UI

Data Flow

Path A: AI-Assisted Generation

1. User types: "Generate a 24-hour temperature signal with daily oscillation"
       |
2. WebSocket message → TsgenAgentChatConsumer
       |
3. AgentSession.get_response_streaming() → pydantic-ai Agent
       |
4. Agent calls OpenAI API with tsgen system prompt + conversation history
       |
5. LLM streams response: explanation text + ```json { ... } ``` block
       |
6. Tokens streamed back to browser via WebSocket (real-time rendering)
       |
7. on_response_complete() fires → _extract_generation_params() extracts JSON
       |
8. WebSocket sends: { type: "generation_params", params: { ... } }
       |
9. Frontend dispatches 'tsgen:apply-params' DOM event
       |
10. applyAgentParams() populates form fields from JSON
       |
11. User reviews parameters, clicks "Preview"
       |
12. [Continues as Path B from step 1]

Path B: Manual / Preview Generation

1. HTTP POST /phoenix/generate/preview/ with FormData
       |
2. GenerateTimeSeriesForm validates all parameters
       |
3. _build_generator_from_form() constructs TimeSeriesGenerator
       |
4. generator.generate() → pd.DataFrame
       |
5. DataFrame serialized to Plotly.js chart traces + statistics
       |
6. JSON response → Plotly.js renders interactive chart
       |
7. User clicks "Save" → POST /phoenix/generate/save/
       |
8. Re-generates data, creates TimeSeries + TimeSeriesMetadata in DB

The critical design point is that the AI never bypasses validation. Its output populates the same form that manual users fill in, passing through identical server-side validation.

The Prompt Engineering Layer

The system prompt (apps/ai/prompts/tsgen.py) is a ~400-line structured document that gives the LLM everything it needs to produce valid generation parameters. It is divided into several sections:

Schema Definition

The prompt defines the exact JSON schema the LLM must produce, including every field, its type, and valid ranges. The schema covers:

Time configuration: Duration (days/hours/minutes/seconds), sampling frequency or time step
Channels: Name, unit, mean, noise amplitude, trend slope, oscillations list
Per-channel degradation: Removal and outlier settings per channel
Global degradation: Fallback removal and outlier settings
Channel correlations: Pairwise correlation coefficients

Nyquist-Shannon Compliance

The prompt contains explicit instructions for aliasing prevention:

Given f_s:
- The Nyquist frequency is f_nyquist = f_s / 2
- Hard limit: Every oscillation frequency must be BELOW f_nyquist
- Safe zone: Every oscillation frequency should ideally be below f_nyquist / 2

The LLM is instructed to verify every oscillation against the Nyquist limit before producing its JSON, and to either increase the sampling frequency or lower the oscillation frequency if a violation would occur.

Nonlinear Trend Approximation

Since the generator only supports linear trends natively, the prompt teaches the LLM a technique for approximating nonlinear curves using long-period oscillations:

Exponential growth: period = 4-8x duration, phase = 3pi/2
Logarithmic saturation: period = 4-8x duration, phase = 0
S-curve: Two overlapping long-period oscillations with specific phases

This is a creative use of the superposition model -- when period >> duration, only a small arc of the sine wave is visible, and different phase offsets produce different curve shapes.

Domain Knowledge

The prompt encodes typical defaults for industrial sensor types (vibration, temperature, pressure, flow, acoustic, electrical) so the LLM can make informed choices when users describe signals by domain rather than by mathematical properties.

Image Interpretation

The agent accepts chart images and can interpret visual characteristics (signal shape, frequency content, amplitude ranges, noise levels) to infer generation parameters. This is enabled by the multimodal capabilities of the underlying LLM.

LLM Integration with pydantic-ai

The AI agent is built using the pydantic-ai framework, which provides a structured interface for LLM interactions.

Agent Construction

def get_tsgen_agent():
    tsgen_instructions = [get_tsgen_system_prompt(), add_user_name, current_datetime]
    return _get_agent([], instructions=tsgen_instructions)

def _get_agent(toolsets, instructions=None):
    return Agent(
        settings.DEFAULT_AGENT_MODEL,   # e.g. "openai:gpt-4o"
        toolsets=toolsets,
        instructions=instructions,
        retries=2,
        deps_type=UserDependencies,
    )

Key characteristics of the tsgen agent:

No tools: The agent has an empty toolset ([]). It is a pure text-in/text-out agent with no ability to call functions, query databases, or execute code. Its only output is text containing a JSON block.
Dynamic instructions: add_user_name and current_datetime are async functions that inject runtime context (the user's display name and current timestamp) into the system prompt at execution time.
Retry policy: 2 retries on failure, handled by pydantic-ai.
Dependency injection: UserDependencies provides the authenticated user object to instruction functions via RunContext.

Streaming Execution

The agent is invoked through an async streaming interface:

async def run_agent_streaming(agent, user, message, message_history=None, ...):
    deps = UserDependencies(user=user)
    pydantic_messages = convert_openai_to_pydantic_messages(message_history) if message_history else None
    async with agent.run_stream(message, message_history=pydantic_messages, deps=deps, ...) as result:
        async for text in result.stream_text():
            yield text

Tokens are yielded one at a time and forwarded to the browser via WebSocket, providing real-time response rendering.

Conversation History

The system maintains conversation history in OpenAI message format ([{role, content}]) and converts it to pydantic-ai's ModelMessage format before each request. This allows multi-turn conversations where the user can iteratively refine their time series (e.g., "add more noise", "add a third channel").

WebSocket Communication

The chat interface uses Django Channels with an AsyncWebsocketConsumer.

TsgenAgentChatConsumer

class TsgenAgentChatConsumer(AgentChatConsumerBase):
    agent_type = AgentTypes.TSGEN

    async def on_response_complete(self, response: str):
        params = _extract_generation_params(response)
        if params:
            await self.send(text_data=json.dumps({
                "type": "generation_params",
                "params": params
            }))

The consumer extends a base class that handles message receipt, agent invocation, and token streaming. The tsgen-specific behavior is in on_response_complete(), which fires after the full response has been assembled.

JSON Extraction

def _extract_generation_params(response: str) -> dict | None:
    match = re.search(r"```json\s*(.*?)\s*```", response, re.DOTALL)
    if not match:
        return None
    try:
        params = json.loads(match.group(1))
    except json.JSONDecodeError:
        return None
    if not isinstance(params, dict) or "channels" not in params:
        return None
    return params

This function uses a regex to find the first ```json ``` fenced code block in the LLM's response, parses it with json.loads(), and validates that it is a dictionary containing a "channels" key. If any step fails, it returns None and the response is treated as text-only (no form population).

The Python Generation Engine

The TimeSeriesGenerator class is the core of the system. It is a pure Python engine built on NumPy and Pandas that transforms validated parameters into time series data.

Method Chaining API

The generator uses a fluent builder pattern:

ts = (TimeSeriesGenerator()
      .with_duration(days=7, time_step_seconds=60)
      .with_base_signal(mean=100, noise_amplitude=5.0)
      .with_trend(slope=0.5)
      .with_oscillation(frequency_hz=0.001, amplitude=10.0)
      .generate())

Signal Superposition Model

The generated signal is a superposition of independent components:

signal[i] = mean + trend[i] + sum(oscillations[i]) + noise[i]

Each component is computed as a NumPy array operation:

Component	Python Implementation
Mean	`np.full(n_points, mean)`
Trend	`slope * np.arange(n_points)`
Oscillation	`amplitude * np.sin(2pifreq * t + phase)` where `t = np.arange(n_points) * time_step`
Noise	`np.random.normal(0, amplitude, n_points)`

Multi-Channel Correlation via Cholesky Decomposition

When channels have specified correlations, the generator applies them using Cholesky decomposition. This is implemented entirely in NumPy:

def _apply_correlation(self, signals, correlations):
    # 1. Build correlation matrix (identity + user-specified off-diagonal values)
    C = np.eye(n_channels)
    for corr in correlations:
        C[i, j] = C[j, i] = corr.correlation

    # 2. Validate positive semi-definiteness
    eigenvalues = np.linalg.eigvals(C)
    if not np.all(eigenvalues >= -1e-10):
        raise ValueError("Correlation matrix is not positive semi-definite")

    # 3. Normalize to zero mean, unit variance
    normalized = (signals - means) / stds

    # 4. Apply Cholesky factor
    L = np.linalg.cholesky(C)
    correlated = L @ normalized

    # 5. Denormalize back to original scale
    return correlated * stds + means

This ensures the output channels have the desired pairwise correlations while preserving their individual statistical properties (mean, standard deviation, oscillation patterns).

Data Degradation

After generation, the engine can apply two forms of degradation to simulate real-world data quality issues:

Data point removal (simulating gaps/missing data):

indices = np.random.choice(n_points, size=n_remove, replace=False)
df.drop(df.index[indices])  # Single-channel: drop rows
df.iloc[indices, col] = np.nan  # Multi-channel: NaN per column

Outlier injection (simulating sensor anomalies): - Constant value mode: replace selected points with a fixed value - Random range mode: np.random.uniform(min, max, n_outliers) - Factor mode: multiply existing values by a scaling factor

Degradation is applied per-channel in multi-channel mode, allowing different channels to have different quality characteristics (e.g., a raw sensor channel with outliers alongside a clean smoothed channel).

Pydantic Configuration Models

Every configuration aspect is modeled as a Pydantic BaseModel with validation:

Model	Purpose	Key Validations
`TimeConfig`	Time axis specification	Mutually exclusive duration vs explicit times
`NoiseConfig`	Noise specification	Mutually exclusive amplitude vs SNR
`RemovalConfig`	Data removal specification	Mutually exclusive number vs percentage
`OutlierConfig`	Outlier specification	Mutually exclusive quantity modes and value modes
`OscillationConfig`	Single oscillation	Frequency/period reciprocal computation, non-negative amplitude
`ChannelConfig`	Per-channel settings	Non-negative noise amplitude
`ChannelCorrelation`	Pairwise correlation	Range -1 to 1, no self-correlation
`MultiChannelConfig`	Multi-channel container	1-10 channels, valid correlation indices

These models use @model_validator and @field_validator decorators for constraint enforcement at construction time.

Output Format

The generator returns a Pandas DataFrame with a DatetimeIndex:

Single-channel: One column named value
Multi-channel: One column per channel, named after the channel

This DataFrame is then serialized into Plotly.js trace format for preview, or stored as JSON in the database for persistence.

Validation Pipeline

Parameters pass through three layers of validation before reaching the generator:

Layer 1: LLM Prompt Constraints

The system prompt instructs the LLM to respect all constraints (max 10,000 points, max 10 channels, Nyquist compliance). This is a soft boundary -- the LLM usually complies, but its output is not guaranteed to be valid.

Layer 2: Django Form Validation (`GenerateTimeSeriesForm`)

Server-side validation that catches anything the LLM got wrong:

Sampling frequency vs time step consistency
Duration requirements
Data point limit enforcement (MAX_POINTS_PER_SERIES = 10,000)
Channel structure validation (name required, max 10 channels)
Oscillation parameter validation
Correlation matrix positive semi-definite check (eigenvalue verification)
Removal/outlier mode consistency
Aliasing detection (non-blocking warnings)

Layer 3: Pydantic Model Validation

The Pydantic models in generators.py enforce type and constraint validation at object construction time. This is the final guard before mathematical operations begin.

Layer 4: View-Level Authorization

Django views enforce:

@login_required on all endpoints
@require_POST on mutating endpoints
User ownership checks for existing time series
Per-user series count limits (MAX_SERIES_PER_USER = 3)
CSRF protection

Security Model

No Code Execution

The most important security property of this architecture is that the LLM never generates executable code. The entire pipeline operates on validated numeric parameters:

LLM output (text) → JSON extraction (regex + json.loads) → Form validation → Pydantic models → NumPy operations

There is no exec(), eval(), subprocess, or dynamic code execution anywhere in the pipeline. The TimeSeriesGenerator only performs deterministic mathematical operations on validated inputs.

JSON Extraction Safety

The _extract_generation_params() function applies defensive parsing:

Regex finds the first ```json ``` block (no arbitrary code execution)
json.loads() parses the content (rejects non-JSON)
Type check: isinstance(params, dict) (rejects arrays, primitives)
Schema check: "channels" in params (rejects arbitrary JSON objects)
Returns None on any failure (graceful degradation)

Input Boundaries

All numeric inputs are bounded by Pydantic validation before reaching NumPy:

Array sizes are capped at 10,000 points
Channel count is capped at 10
Correlation values are bounded to [-1, 1]
Amplitudes and frequencies must be non-negative
Duration must be positive

No Persistent State in the Agent

The tsgen agent has no tools, no database access, and no file system access. It can only produce text. The conversation history is managed server-side and is scoped to the WebSocket session.

Why Parameterized Generation, Not Code Generation

An alternative architecture would have the LLM generate Python code (e.g., NumPy scripts) that gets executed directly. Phoenix deliberately avoids this for several reasons:

Concern	Parameterized (Phoenix)	Code Generation
Security	No code execution; bounded numeric inputs	Requires sandboxing, code review, or restricted execution environments
Validation	Full server-side validation of every parameter	Difficult to validate arbitrary code for correctness and safety
Reproducibility	Parameters stored in DB; identical regeneration	Code must be stored and re-executed; environment-dependent
User Control	Users see and can modify every parameter in the form	Users must understand generated code to modify it
Error Handling	Validation errors map to specific form fields	Runtime errors in generated code are opaque
Flexibility Tradeoff	Limited to the superposition model	Can express any computable time series

The flexibility tradeoff is intentional. The superposition model (mean + trend + oscillations + noise) covers the vast majority of synthetic time series use cases in industrial settings. The nonlinear trend approximation technique (long-period oscillations) extends coverage to exponential, logarithmic, and sigmoid shapes without requiring arbitrary code.