AI-Assisted Synthetic Time Series Generation: Architecture and Technical Design
This document describes the architecture, data flow, and technical design of Phoenix's AI-assisted synthetic time series generation system. It focuses on how the LLM integration works, the role of Python in the generation pipeline, and the security model that makes this approach safe.
Table of Contents
- Design Philosophy
- Architecture Overview
- System Components
- Data Flow
- The Prompt Engineering Layer
- LLM Integration with pydantic-ai
- WebSocket Communication
- The Python Generation Engine
- Validation Pipeline
- Security Model
- Why Parameterized Generation, Not Code Generation
Design Philosophy
The AI layer in Phoenix follows a parameterized generation architecture. The LLM does not generate Python code that gets executed. Instead, it translates natural language descriptions into a structured JSON parameter schema, which is then fed into a deterministic Python engine (TimeSeriesGenerator) that produces the data using NumPy and Pandas.
This separation has three key benefits:
- Safety: No arbitrary code execution. The attack surface is limited to validated numeric parameters.
- Determinism: The same parameters always produce the same class of output (modulo random noise seeds).
- Auditability: Every generated time series can be fully described by its parameter set, which is stored alongside the data.
Architecture Overview
+-----------------------+
| User (Browser) |
+----------+------------+
|
+--------------+--------------+
| |
Natural Language Manual Form Input
via Chat (WebSocket) via HTTP POST
| |
v |
+-------------------+ |
| pydantic-ai | |
| Agent (LLM) | |
+--------+----------+ |
| |
JSON Parameters |
(structured schema) |
| |
v |
+-------------------+ |
| Frontend | |
| (Alpine.js) |<------------------+
| Form Population |
+--------+----------+
|
HTTP POST (FormData)
|
v
+-------------------+
| Django View |
| + Form Validation |
+--------+----------+
|
v
+-------------------+
| TimeSeriesGenerator |
| (NumPy / Pandas) |
+-----------+-----------+
|
pd.DataFrame
|
+-------+-------+
| |
Preview Save to DB
(JSON chart (TimeSeries +
response) Metadata)
System Components
1. AI Agent (apps/ai/)
The AI layer is responsible for interpreting natural language and producing structured generation parameters.
| File | Purpose |
|---|---|
apps/ai/prompts/tsgen.py |
System prompt with schema, domain knowledge, and generation rules |
apps/ai/agents.py |
Agent factory using pydantic-ai; defines the tsgen agent type |
apps/ai/types.py |
UserDependencies dataclass for injecting user context |
apps/ai/handlers.py |
Event stream handler for logging and monitoring |
2. Chat Layer (apps/chat/)
Handles real-time WebSocket communication between the browser and the AI agent. Beyond text input, the chat interface supports image uploads and speech-to-text recognition, both of which significantly improve the user experience for time series generation. Users can photograph or screenshot an existing signal plot and send it directly to the agent, which interprets visual characteristics -- shape, frequency content, amplitude, noise levels, trends -- and produces matching generation parameters. Speech-to-text allows users to verbally describe complex multi-channel setups hands-free, which is particularly useful for engineers working in industrial environments where typing may be impractical. These multimodal inputs feed into the same AI pipeline: the LLM interprets the content and responds with the standard JSON parameter block, keeping the architecture and validation layers unchanged.
| File | Purpose |
|---|---|
apps/chat/consumers.py |
TsgenAgentChatConsumer -- WebSocket consumer with JSON extraction |
apps/chat/routing.py |
WebSocket route: ws/tsgenagent/ |
apps/chat/sessions.py |
AgentSession -- manages agent lifecycle and streaming |
3. Generation Engine (apps/phoenix/)
The generation engine is the deterministic core of Phoenix. It receives validated parameters and produces synthetic time series data using pure Python numerical computation. The engine is built around the TimeSeriesGenerator class, which exposes a fluent builder API for constructing signals as a superposition of a baseline mean, linear trends, sinusoidal oscillations, and Gaussian noise. For multi-channel scenarios, cross-channel correlations are applied via Cholesky decomposition on the correlation matrix. All heavy computation is delegated to NumPy vectorized operations, and the output is returned as a Pandas DataFrame with a DatetimeIndex. Because the engine operates exclusively on bounded, validated numeric inputs -- never on generated code -- it is both fast (sub-second for the 10,000-point limit) and inherently safe.
| File | Purpose |
|---|---|
apps/phoenix/generators.py |
TimeSeriesGenerator class and Pydantic config models |
apps/phoenix/views.py |
Django views for preview and save endpoints |
apps/phoenix/forms.py |
GenerateTimeSeriesForm with server-side validation |
apps/phoenix/constants.py |
System limits (MAX_POINTS_PER_SERIES, MAX_CHANNELS, etc.) |
4. Data Storage (apps/sentinel/)
Generated time series are persisted in PostgreSQL through Django's ORM. Each saved series is stored as a TimeSeries record linked to the authenticated user, with the actual data (timestamps and values) held in a JSONField -- either as a single {timestamps, values} pair for single-channel data or as {timestamps, channels: [{name, values}]} for multi-channel data. An associated TimeSeriesMetadata record stores the complete generation parameters as JSON, enabling exact regeneration from an existing series. For datasets that grow beyond the 10,000-point limit imposed on basic accounts (superusers can generate larger series, as the point cap is skipped for them), or for uploaded/processed data elsewhere in the application, a TimeSeriesDataPoint model backed by a TimescaleDB hypertable provides efficient time-partitioned storage. Phoenix-generated data stays within the JSON storage path, keeping writes simple and reads fast for the visualization layer.
| File | Purpose |
|---|---|
apps/sentinel/models.py |
TimeSeries, TimeSeriesMetadata, TimeSeriesDataPoint models |
5. Frontend (assets/javascript/phoenix/)
The frontend is built with Django templates styled using Tailwind CSS and DaisyUI, with Alpine.js providing client-side reactivity. The main generation UI is a single Alpine.js component that manages multi-channel configuration (add/remove channels, per-channel oscillations, noise, trends, degradation), correlation editing, client-side Nyquist aliasing checks, and preview rendering via Plotly.js. The AI chat panel communicates over a WebSocket and dispatches a custom DOM event (tsgen:apply-params) when the LLM produces generation parameters, which the Alpine component listens for and uses to populate the form. JavaScript is bundled by Vite and served through django-vite.
| File | Purpose |
|---|---|
assets/javascript/phoenix/generate.js |
Alpine.js component for the generation UI |
Data Flow
Path A: AI-Assisted Generation
1. User types: "Generate a 24-hour temperature signal with daily oscillation"
|
2. WebSocket message → TsgenAgentChatConsumer
|
3. AgentSession.get_response_streaming() → pydantic-ai Agent
|
4. Agent calls OpenAI API with tsgen system prompt + conversation history
|
5. LLM streams response: explanation text + ```json { ... } ``` block
|
6. Tokens streamed back to browser via WebSocket (real-time rendering)
|
7. on_response_complete() fires → _extract_generation_params() extracts JSON
|
8. WebSocket sends: { type: "generation_params", params: { ... } }
|
9. Frontend dispatches 'tsgen:apply-params' DOM event
|
10. applyAgentParams() populates form fields from JSON
|
11. User reviews parameters, clicks "Preview"
|
12. [Continues as Path B from step 1]
Path B: Manual / Preview Generation
1. HTTP POST /phoenix/generate/preview/ with FormData
|
2. GenerateTimeSeriesForm validates all parameters
|
3. _build_generator_from_form() constructs TimeSeriesGenerator
|
4. generator.generate() → pd.DataFrame
|
5. DataFrame serialized to Plotly.js chart traces + statistics
|
6. JSON response → Plotly.js renders interactive chart
|
7. User clicks "Save" → POST /phoenix/generate/save/
|
8. Re-generates data, creates TimeSeries + TimeSeriesMetadata in DB
The critical design point is that the AI never bypasses validation. Its output populates the same form that manual users fill in, passing through identical server-side validation.
The Prompt Engineering Layer
The system prompt (apps/ai/prompts/tsgen.py) is a ~400-line structured document that gives the LLM everything it needs to produce valid generation parameters. It is divided into several sections:
Schema Definition
The prompt defines the exact JSON schema the LLM must produce, including every field, its type, and valid ranges. The schema covers:
- Time configuration: Duration (days/hours/minutes/seconds), sampling frequency or time step
- Channels: Name, unit, mean, noise amplitude, trend slope, oscillations list
- Per-channel degradation: Removal and outlier settings per channel
- Global degradation: Fallback removal and outlier settings
- Channel correlations: Pairwise correlation coefficients
Nyquist-Shannon Compliance
The prompt contains explicit instructions for aliasing prevention:
Given f_s:
- The Nyquist frequency is f_nyquist = f_s / 2
- Hard limit: Every oscillation frequency must be BELOW f_nyquist
- Safe zone: Every oscillation frequency should ideally be below f_nyquist / 2
The LLM is instructed to verify every oscillation against the Nyquist limit before producing its JSON, and to either increase the sampling frequency or lower the oscillation frequency if a violation would occur.
Nonlinear Trend Approximation
Since the generator only supports linear trends natively, the prompt teaches the LLM a technique for approximating nonlinear curves using long-period oscillations:
- Exponential growth:
period = 4-8x duration,phase = 3pi/2 - Logarithmic saturation:
period = 4-8x duration,phase = 0 - S-curve: Two overlapping long-period oscillations with specific phases
This is a creative use of the superposition model -- when period >> duration, only a small arc of the sine wave is visible, and different phase offsets produce different curve shapes.
Domain Knowledge
The prompt encodes typical defaults for industrial sensor types (vibration, temperature, pressure, flow, acoustic, electrical) so the LLM can make informed choices when users describe signals by domain rather than by mathematical properties.
Image Interpretation
The agent accepts chart images and can interpret visual characteristics (signal shape, frequency content, amplitude ranges, noise levels) to infer generation parameters. This is enabled by the multimodal capabilities of the underlying LLM.
LLM Integration with pydantic-ai
The AI agent is built using the pydantic-ai framework, which provides a structured interface for LLM interactions.
Agent Construction
def get_tsgen_agent():
tsgen_instructions = [get_tsgen_system_prompt(), add_user_name, current_datetime]
return _get_agent([], instructions=tsgen_instructions)
def _get_agent(toolsets, instructions=None):
return Agent(
settings.DEFAULT_AGENT_MODEL, # e.g. "openai:gpt-4o"
toolsets=toolsets,
instructions=instructions,
retries=2,
deps_type=UserDependencies,
)
Key characteristics of the tsgen agent:
- No tools: The agent has an empty toolset (
[]). It is a pure text-in/text-out agent with no ability to call functions, query databases, or execute code. Its only output is text containing a JSON block. - Dynamic instructions:
add_user_nameandcurrent_datetimeare async functions that inject runtime context (the user's display name and current timestamp) into the system prompt at execution time. - Retry policy: 2 retries on failure, handled by pydantic-ai.
- Dependency injection:
UserDependenciesprovides the authenticated user object to instruction functions viaRunContext.
Streaming Execution
The agent is invoked through an async streaming interface:
async def run_agent_streaming(agent, user, message, message_history=None, ...):
deps = UserDependencies(user=user)
pydantic_messages = convert_openai_to_pydantic_messages(message_history) if message_history else None
async with agent.run_stream(message, message_history=pydantic_messages, deps=deps, ...) as result:
async for text in result.stream_text():
yield text
Tokens are yielded one at a time and forwarded to the browser via WebSocket, providing real-time response rendering.
Conversation History
The system maintains conversation history in OpenAI message format ([{role, content}]) and converts it to pydantic-ai's ModelMessage format before each request. This allows multi-turn conversations where the user can iteratively refine their time series (e.g., "add more noise", "add a third channel").
WebSocket Communication
The chat interface uses Django Channels with an AsyncWebsocketConsumer.
TsgenAgentChatConsumer
class TsgenAgentChatConsumer(AgentChatConsumerBase):
agent_type = AgentTypes.TSGEN
async def on_response_complete(self, response: str):
params = _extract_generation_params(response)
if params:
await self.send(text_data=json.dumps({
"type": "generation_params",
"params": params
}))
The consumer extends a base class that handles message receipt, agent invocation, and token streaming. The tsgen-specific behavior is in on_response_complete(), which fires after the full response has been assembled.
JSON Extraction
def _extract_generation_params(response: str) -> dict | None:
match = re.search(r"```json\s*(.*?)\s*```", response, re.DOTALL)
if not match:
return None
try:
params = json.loads(match.group(1))
except json.JSONDecodeError:
return None
if not isinstance(params, dict) or "channels" not in params:
return None
return params
This function uses a regex to find the first ```json ``` fenced code block in the LLM's response, parses it with json.loads(), and validates that it is a dictionary containing a "channels" key. If any step fails, it returns None and the response is treated as text-only (no form population).
The Python Generation Engine
The TimeSeriesGenerator class is the core of the system. It is a pure Python engine built on NumPy and Pandas that transforms validated parameters into time series data.
Method Chaining API
The generator uses a fluent builder pattern:
ts = (TimeSeriesGenerator()
.with_duration(days=7, time_step_seconds=60)
.with_base_signal(mean=100, noise_amplitude=5.0)
.with_trend(slope=0.5)
.with_oscillation(frequency_hz=0.001, amplitude=10.0)
.generate())
Signal Superposition Model
The generated signal is a superposition of independent components:
signal[i] = mean + trend[i] + sum(oscillations[i]) + noise[i]
Each component is computed as a NumPy array operation:
| Component | Python Implementation |
|---|---|
| Mean | np.full(n_points, mean) |
| Trend | slope * np.arange(n_points) |
| Oscillation | amplitude * np.sin(2*pi*freq * t + phase) where t = np.arange(n_points) * time_step |
| Noise | np.random.normal(0, amplitude, n_points) |
Multi-Channel Correlation via Cholesky Decomposition
When channels have specified correlations, the generator applies them using Cholesky decomposition. This is implemented entirely in NumPy:
def _apply_correlation(self, signals, correlations):
# 1. Build correlation matrix (identity + user-specified off-diagonal values)
C = np.eye(n_channels)
for corr in correlations:
C[i, j] = C[j, i] = corr.correlation
# 2. Validate positive semi-definiteness
eigenvalues = np.linalg.eigvals(C)
if not np.all(eigenvalues >= -1e-10):
raise ValueError("Correlation matrix is not positive semi-definite")
# 3. Normalize to zero mean, unit variance
normalized = (signals - means) / stds
# 4. Apply Cholesky factor
L = np.linalg.cholesky(C)
correlated = L @ normalized
# 5. Denormalize back to original scale
return correlated * stds + means
This ensures the output channels have the desired pairwise correlations while preserving their individual statistical properties (mean, standard deviation, oscillation patterns).
Data Degradation
After generation, the engine can apply two forms of degradation to simulate real-world data quality issues:
Data point removal (simulating gaps/missing data):
indices = np.random.choice(n_points, size=n_remove, replace=False)
df.drop(df.index[indices]) # Single-channel: drop rows
df.iloc[indices, col] = np.nan # Multi-channel: NaN per column
Outlier injection (simulating sensor anomalies):
- Constant value mode: replace selected points with a fixed value
- Random range mode: np.random.uniform(min, max, n_outliers)
- Factor mode: multiply existing values by a scaling factor
Degradation is applied per-channel in multi-channel mode, allowing different channels to have different quality characteristics (e.g., a raw sensor channel with outliers alongside a clean smoothed channel).
Pydantic Configuration Models
Every configuration aspect is modeled as a Pydantic BaseModel with validation:
| Model | Purpose | Key Validations |
|---|---|---|
TimeConfig |
Time axis specification | Mutually exclusive duration vs explicit times |
NoiseConfig |
Noise specification | Mutually exclusive amplitude vs SNR |
RemovalConfig |
Data removal specification | Mutually exclusive number vs percentage |
OutlierConfig |
Outlier specification | Mutually exclusive quantity modes and value modes |
OscillationConfig |
Single oscillation | Frequency/period reciprocal computation, non-negative amplitude |
ChannelConfig |
Per-channel settings | Non-negative noise amplitude |
ChannelCorrelation |
Pairwise correlation | Range -1 to 1, no self-correlation |
MultiChannelConfig |
Multi-channel container | 1-10 channels, valid correlation indices |
These models use @model_validator and @field_validator decorators for constraint enforcement at construction time.
Output Format
The generator returns a Pandas DataFrame with a DatetimeIndex:
- Single-channel: One column named
value - Multi-channel: One column per channel, named after the channel
This DataFrame is then serialized into Plotly.js trace format for preview, or stored as JSON in the database for persistence.
Validation Pipeline
Parameters pass through three layers of validation before reaching the generator:
Layer 1: LLM Prompt Constraints
The system prompt instructs the LLM to respect all constraints (max 10,000 points, max 10 channels, Nyquist compliance). This is a soft boundary -- the LLM usually complies, but its output is not guaranteed to be valid.
Layer 2: Django Form Validation (GenerateTimeSeriesForm)
Server-side validation that catches anything the LLM got wrong:
- Sampling frequency vs time step consistency
- Duration requirements
- Data point limit enforcement (
MAX_POINTS_PER_SERIES = 10,000) - Channel structure validation (name required, max 10 channels)
- Oscillation parameter validation
- Correlation matrix positive semi-definite check (eigenvalue verification)
- Removal/outlier mode consistency
- Aliasing detection (non-blocking warnings)
Layer 3: Pydantic Model Validation
The Pydantic models in generators.py enforce type and constraint validation at object construction time. This is the final guard before mathematical operations begin.
Layer 4: View-Level Authorization
Django views enforce:
@login_requiredon all endpoints@require_POSTon mutating endpoints- User ownership checks for existing time series
- Per-user series count limits (
MAX_SERIES_PER_USER = 3) - CSRF protection
Security Model
No Code Execution
The most important security property of this architecture is that the LLM never generates executable code. The entire pipeline operates on validated numeric parameters:
LLM output (text) → JSON extraction (regex + json.loads) → Form validation → Pydantic models → NumPy operations
There is no exec(), eval(), subprocess, or dynamic code execution anywhere in the pipeline. The TimeSeriesGenerator only performs deterministic mathematical operations on validated inputs.
JSON Extraction Safety
The _extract_generation_params() function applies defensive parsing:
- Regex finds the first
```json ```block (no arbitrary code execution) json.loads()parses the content (rejects non-JSON)- Type check:
isinstance(params, dict)(rejects arrays, primitives) - Schema check:
"channels" in params(rejects arbitrary JSON objects) - Returns
Noneon any failure (graceful degradation)
Input Boundaries
All numeric inputs are bounded by Pydantic validation before reaching NumPy:
- Array sizes are capped at 10,000 points
- Channel count is capped at 10
- Correlation values are bounded to [-1, 1]
- Amplitudes and frequencies must be non-negative
- Duration must be positive
No Persistent State in the Agent
The tsgen agent has no tools, no database access, and no file system access. It can only produce text. The conversation history is managed server-side and is scoped to the WebSocket session.
Why Parameterized Generation, Not Code Generation
An alternative architecture would have the LLM generate Python code (e.g., NumPy scripts) that gets executed directly. Phoenix deliberately avoids this for several reasons:
| Concern | Parameterized (Phoenix) | Code Generation |
|---|---|---|
| Security | No code execution; bounded numeric inputs | Requires sandboxing, code review, or restricted execution environments |
| Validation | Full server-side validation of every parameter | Difficult to validate arbitrary code for correctness and safety |
| Reproducibility | Parameters stored in DB; identical regeneration | Code must be stored and re-executed; environment-dependent |
| User Control | Users see and can modify every parameter in the form | Users must understand generated code to modify it |
| Error Handling | Validation errors map to specific form fields | Runtime errors in generated code are opaque |
| Flexibility Tradeoff | Limited to the superposition model | Can express any computable time series |
The flexibility tradeoff is intentional. The superposition model (mean + trend + oscillations + noise) covers the vast majority of synthetic time series use cases in industrial settings. The nonlinear trend approximation technique (long-period oscillations) extends coverage to exponential, logarithmic, and sigmoid shapes without requiring arbitrary code.