Data Degradation: Simulating Data Quality Issues
Phoenix allows you to intentionally introduce data quality problems into your time series, essential for testing data cleaning pipelines, validating quality assurance algorithms, and preparing for real-world data challenges.
Why Simulate Bad Data?
Testing Data Cleaning Algorithms
Real sensor data often contains quality issues. Test your cleaning logic with controlled degradation: - Missing data: Dropped measurements, transmission failures - Outliers: Sensor malfunctions, interference, saturation
Validating FORGE Workflows
Before deploying data cleaning workflows in production: 1. Generate clean synthetic data 2. Add controlled degradation 3. Run through FORGE cleaning pipeline 4. Verify that cleaned data matches original
Training and Demonstration
Show stakeholders the importance of data quality: - Before/after comparisons - Impact of quality issues on analysis - Value of cleaning algorithms
Algorithm Benchmarking
Compare data cleaning algorithms on standardized test cases: - Same base signal - Different degradation levels - Measure cleaning effectiveness
Two Types of Data Degradation
Phoenix offers two independent degradation features:
- Data Point Removal - Creates gaps (missing data)
- Outlier Insertion - Adds corrupted values
You can use one, both, or neither.
Data Point Removal (Missing Data)
What It Does
Removes data points from the time series, creating gaps where measurements are missing. You control both how many points to remove and how those removals are distributed — scattered randomly or grouped into contiguous blocks.
Effect:
Before: [100, 101, 99, 102, 98, 100, 101]
After: [100, ---, 99, ---, 98, 100, ---] (random distribution)
[100, 101, ---, ---, ---, 100, 101] (gap distribution, 1 gap)
Timestamps of removed points are deleted — they don't appear as NaN or null.
Configuration Options
Quantity Mode: Number vs. Percentage
Controls the total number of points to remove.
Number - Remove exact count of points
Remove: 50 points
From 1000-point series → 50 points removed, 950 remain
From 500-point series → 50 points removed, 450 remain
Percentage - Remove proportion of points
Remove: 5%
From 1000-point series → 50 points removed, 950 remain
From 500-point series → 25 points removed, 475 remain
Choose Number When: - Testing specific gap sizes - Consistent absolute quantity needed - Series length varies
Choose Percentage When: - Testing proportional data loss - Comparing across different duration series - Simulating percentage-based reliability specs
Distribution Strategy: Random vs. Gaps
Controls how the removed points are placed within the series.
Random (default) - Removed points are selected individually at random: - Each point has equal probability of removal - No clustering or patterns - Uniform distribution over time - Simulates independent random dropouts (e.g., packet loss)
Gaps - Removed points are grouped into contiguous blocks:
- Specify number of gaps (e.g., 3)
- Total points to remove are randomly distributed across the gaps
- Each gap is placed at a random position in the series
- Simulates sustained outages, communication blackouts, or planned downtime windows
Gaps mode example:
Total removal: 10% of 1000 points = 100 points
Number of gaps: 3
Result:
Gap 1: 35 points at position 120
Gap 2: 41 points at position 450
Gap 3: 24 points at position 780
Choose Random When
- Simulating independent packet-by-packet loss
- Testing algorithms that handle sparse missing data
- Representing random sensor dropouts
Choose Gaps When
- Simulating power outages, network blackouts, or planned maintenance windows
- Testing algorithms designed to detect or impute contiguous missing sections
- Representing failures with a recovery time (e.g., a sensor reboots, a link restores)
Use Cases
Transmission Failures (Random)
Remove: 2-5%, Distribution: Random
Simulates: Wireless sensor network packet loss
Sensor Dropouts (Random)
Remove: 1-3%, Distribution: Random
Simulates: Intermittent sensor connection issues
Network Blackout (Gaps)
Remove: 10-20%, Distribution: Gaps, Num Gaps: 1-3
Simulates: Communication link failure and recovery
Scheduled Maintenance Windows (Gaps)
Remove: 5-15%, Distribution: Gaps, Num Gaps: 2-4
Simulates: Known maintenance periods with no data collection
Data Logger Failures (Gaps)
Remove: 10-20%, Distribution: Gaps, Num Gaps: 3-5
Simulates: Logger crashes with extended downtime before restart
Severe Communication Issues (Random or Gaps)
Remove: 30-50%
Simulates: Heavy RF interference (Random) or failing hardware (Gaps)
Step-by-Step Configuration
- Scroll to "Data Degradation" section in sidebar
- Find "Data Point Removal" subsection
- Choose quantity mode:
- Select "Number" and enter count (e.g., 100)
- OR select "Percentage" and enter % (e.g., 5)
- Choose distribution strategy:
- Select "Random" for scattered individual removals (default)
- Select "Gaps" for contiguous blocks, then enter number of gaps
- Preview to see effect
[Screenshot Required: Data Point Removal Configuration] 1. Configure: - Base signal: Mean=100, Noise=3 - Duration: 10 minutes, Sampling: 1 Hz - Data Removal: 5% (percentage mode) 2. Capture: Data Degradation section showing removal config 3. Purpose: Show removal configuration interface
[Screenshot Required: Before/After Removal] 1. Generate clean signal (no removal) 2. Screenshot 1: Clean signal 3. Add 10% removal 4. Screenshot 2: Signal with gaps 5. Purpose: Visually demonstrate removal effect
Outlier Insertion (Corrupted Values)
What It Does
Replaces existing data point values with anomalous values, simulating sensor malfunctions, saturation, or interference.
Effect:
Before: [100, 101, 99, 102, 98, 100, 101]
After: [100, 9999, 99, 102, 9999, 100, 101]
Timestamps remain unchanged - only values are replaced.
Configuration Options
Quantity Mode: Number vs. Percentage
Same as data removal:
Number - Insert exact count of outliers
Insert: 20 outliers
Percentage - Insert proportion of total points as outliers
Insert: 2% outliers
Value Mode: Three Methods
Phoenix offers three ways to generate outlier values:
1. Constant Value
Replace with a fixed value (same for all outliers).
Configuration:
Value Mode: Constant Value
Constant Value: 9999
Effect:
Original: [100, 101, 99, 102, 98]
Outliers: [100, 9999, 99, 9999, 98]
Use Cases: - Sensor saturation: Maximum reading (e.g., 9999, -9999) - Error codes: Specific values indicating failure - Out-of-range: Physically impossible values - Sentinel values: Traditional missing data markers
Examples:
Temperature sensor saturates at max: 9999
Pressure sensor error code: -1
Voltage sensor open circuit: 0
Flow meter stopped: 0
[Screenshot Instructions: Constant Outliers] 1. Configure: - Mean: 50, Noise: 2 - Duration: 5 minutes, Sampling: 1 Hz - Outliers: 3% (percentage), Constant Value: 999 2. Preview 3. Capture: Chart showing signal with obvious 999 spikes 4. Purpose: Demonstrate constant value outliers
2. Random Range
Replace with random values between min and max.
Configuration:
Value Mode: Random Range
Range Min: -50
Range Max: 500
Effect:
Original: [100, 101, 99, 102, 98]
Outliers: [100, 237, 99, -18, 98]
(random between -50 and 500)
Use Cases: - Electrical interference: Random noise spikes - Cross-talk: Values from other sensor ranges - A/D converter errors: Random bit flips - Testing robustness: Wide variety of outlier values
Examples:
EMI spikes on 0-100 signal:
Range: -200 to 300
Cross-talk from 0-1000 sensor to 0-100 sensor:
Range: 200 to 800
Random bit flips:
Range: 0 to 65535 (16-bit full scale)
[Screenshot Instructions: Random Range Outliers] 1. Configure: - Mean: 100, Noise: 5 - Duration: 5 minutes, Sampling: 1 Hz - Outliers: 5%, Random Range: 300 to 500 2. Preview 3. Capture: Chart showing signal with random spikes in 300-500 range 4. Purpose: Demonstrate random range outliers
3. Factor Multiplication
Multiply existing values by a factor.
Configuration:
Value Mode: Factor Multiplication
Multiplication Factor: 10
Effect:
Original: [100, 101, 99, 102, 98]
Outliers: [100, 1010, 99, 1020, 98]
(selected values × 10)
Use Cases: - Gain errors: Amplifier malfunction (×10, ×100) - Unit errors: Wrong calibration (×1000 for mm→m) - Scale errors: Decimal point shifts - Proportional drift: Multiplicative sensor drift
Examples:
Amplifier gain error (10× instead of 1×):
Factor: 10
Unit conversion error (kPa entered as Pa):
Factor: 0.001
Decimal point error:
Factor: 100
Special Cases:
Factor < 1: Attenuates values (gain loss)
Factor: 0.1 → values become 10% of original
Factor < 0: Inverts values (rare in practice)
Factor: -1 → sign flip
Factor = 1: No change (identity, not useful)
[Screenshot Instructions: Factor Multiplication Outliers] 1. Configure: - Mean: 50, Noise: 3 - Duration: 5 minutes, Sampling: 1 Hz - Outliers: 2%, Factor: 5 2. Preview 3. Capture: Chart showing signal with occasional 5× spikes 4. Purpose: Demonstrate factor multiplication outliers
How Outliers Are Selected
Like data removal, outliers are inserted at random timestamps: - Uniform distribution over time - Each point has equal probability - No clustering patterns
Combining Removal and Outliers
You can use both degradation types simultaneously:
Example Configuration:
Data Removal: 5% (percentage)
Outliers: 2% (percentage), Constant: 9999
From 1000 points:
- 50 points removed (gaps)
- 20 points replaced with 9999 (outliers)
- 930 points remain clean
Application Order: 1. Generate clean signal 2. Remove data points (creates gaps) 3. Insert outliers into remaining points
Note: Outliers are only inserted into points that weren't removed.
[Screenshot Instructions: Combined Degradation] 1. Configure: - Mean: 100, Noise: 5 - Oscillation: 0.1 Hz, Amplitude: 20 - Duration: 10 minutes, Sampling: 1 Hz - Data Removal: 8% - Outliers: 3%, Constant: 999 2. Preview 3. Capture: Chart showing both gaps and outliers 4. Purpose: Demonstrate realistic badly-degraded data
Multi-Channel Considerations
For multi-channel time series:
Data Removal
- Applied once to all channels
- Same timestamps removed from all channels
- Maintains synchronization across channels
Effect:
Before:
Time Ch1 Ch2 Ch3
0.0 10 20 30
1.0 11 21 31
2.0 12 22 32
After (remove 1.0):
Time Ch1 Ch2 Ch3
0.0 10 20 30
2.0 12 22 32
Outlier Insertion
- Applied independently to each channel
- Different points affected in each channel
- Outlier value method applies to all channels
Effect (constant value 999):
Before:
Time Ch1 Ch2 Ch3
0.0 10 20 30
1.0 11 21 31
2.0 12 22 32
After (2 outliers):
Time Ch1 Ch2 Ch3
0.0 10 999 30
1.0 999 21 31
2.0 12 22 999
Testing Scenarios
Scenario 1: Light Degradation
Goal: Test algorithms on realistic sensor data
Configuration:
Data Removal: 1-2%
Outliers: 0.5-1%, Constant: 9999
Simulates: High-quality sensors with occasional issues
Scenario 2: Moderate Degradation
Goal: Test robustness of cleaning algorithms
Configuration:
Data Removal: 5-10%
Outliers: 2-5%, Random Range: 0 to 10× mean
Simulates: Typical industrial sensor network
Scenario 3: Heavy Degradation
Goal: Stress test data pipelines
Configuration:
Data Removal: 20-30%
Outliers: 10-15%, Constant: 9999
Simulates: Failing sensors, severe environmental interference
Scenario 4: Specific Fault Types
Sensor Saturation:
Outliers: 5%, Constant: 9999
Removal: 0%
Communication Failures:
Removal: 15%
Outliers: 0%
EMI/RFI Interference:
Outliers: 10%, Random Range: -1000 to 1000
Removal: 3%
Calibration Drift:
Outliers: 20%, Factor: 1.5 (50% gain error)
Removal: 0%
Best Practices
Start Conservative
Begin with low degradation levels: 1. Generate clean signal 2. Add 1% degradation 3. Verify cleaning algorithms work 4. Gradually increase degradation
Match Reality
Base degradation levels on actual sensor specs: - Sensor datasheets (reliability %) - Historical data quality metrics - Field experience
Document Degradation Parameters
When saving, include degradation info in description:
Good: "Motor vibration, 5% missing, 2% outliers at 9999"
Poor: "Test data"
Test Multiple Levels
Create multiple versions with increasing degradation:
Series 1: Clean (0% degradation)
Series 2: Light (2% missing, 1% outliers)
Series 3: Moderate (10% missing, 5% outliers)
Series 4: Heavy (30% missing, 15% outliers)
Use Realistic Outlier Values
Good Outlier Values: - Sensor maximum: 9999, -9999 - Physical impossibility: -273°C (below absolute zero) - Out of range: 150°C for 0-100°C sensor
Poor Outlier Values: - Within normal range: 105 for 0-100°C sensor - Too subtle: 101 for mean=100, noise=5 signal
Consider Your Use Case
Algorithm Development: Use consistent degradation Robustness Testing: Use variable, random degradation Benchmarking: Use standardized degradation levels
Troubleshooting
Can't see removed points in chart
Expected Behavior: Removed points disappear completely (gaps in data)
How to Verify: - Check statistics: point count should be reduced - Look for gaps in time axis (timestamps skip) - Export CSV: missing rows for removed timestamps
If not seeing effect: - Check removal percentage isn't too low (< 1%) - Verify you clicked "Preview" after configuring - Zoom in on chart to see gaps
Can't see outliers in chart
Possible Causes:
1. Outlier value too close to signal
Problem: Mean=100, Noise=10, Outlier=120
Fix: Use 999 or -999 (clearly distinct)
2. Too few outliers
Problem: 0.1% of 100 points = 0 outliers
Fix: Use at least 1% or absolute number
3. Y-axis auto-scale obscures outliers
Problem: Rare outliers compress main signal
Fix: Zoom in or check statistics (min/max will show outliers)
Outliers look wrong
Check Configuration: - Verify value mode (constant/range/factor) - Check constant value is as expected - For range: verify min < max - For factor: verify factor != 1
Error: "Cannot remove more points than exist"
Cause: Trying to remove more points than available
Examples:
Problem: Remove 150 points from 100-point series
Fix: Reduce to 50 points or use percentage mode
Problem: Remove 110% of points
Fix: Use ≤ 100%
Degradation applied multiple times
Cause: Clicking "Preview" multiple times applies degradation repeatedly
Prevention: - Each preview generates fresh data - Degradation applied to clean signal each time - Not cumulative
Advanced Techniques
Creating Realistic Gap Patterns
Use the Gaps distribution strategy to generate contiguous missing-data blocks in a single step. Configure the number of gaps to match your target failure mode — a single large gap for a prolonged outage, multiple smaller gaps for intermittent blackouts. For strictly periodic patterns (e.g., every 6 hours exactly), generate first and post-process externally.
Combining with FORGE Testing
Workflow: 1. Generate clean data in Phoenix 2. Save with identifiable name 3. Generate degraded version with same parameters + degradation 4. Save degraded version 5. Run degraded through FORGE cleaning 6. Compare FORGE output with clean original 7. Measure cleaning effectiveness
Multiple Outlier Types
Phoenix supports one outlier type per generation. For multiple types:
Option 1: Generate multiple series
Series A: 2% constant outliers (9999)
Series B: 3% random range outliers
Series C: 1% factor outliers (×10)
Option 2: External post-processing - Generate with one type - Export - Add additional outlier types externally - Reimport to FORGE/CEREBRO
Controlled Degradation Placement
Phoenix uses random selection. For specific timestamp targeting: 1. Generate clean data 2. Export to CSV 3. Manually edit specific rows 4. Reimport
Real-World Examples
Example 1: Temperature Sensor Network
Scenario: 4 temperature sensors, wireless network with 98% reliability
Configuration:
Channels: 4 (correlated 0.85-0.9)
Duration: 24 hours
Sampling: 0.01 Hz (1 sample per 100 seconds)
Data Removal: 2% (simulates 98% reliability)
Outliers: 0.5%, Constant: -999 (sensor fault code)
[Screenshot Instructions: Sensor Network] 1. Configure 4 channels as above 2. Add degradation 3. Duration: 2 hours, Sampling: 0.01 Hz 4. Preview 5. Capture: Multi-channel chart with gaps and outliers 6. Purpose: Realistic sensor network scenario
Example 2: Vibration Monitor with EMI
Scenario: Accelerometer near high-voltage equipment
Configuration:
Signal: 30 Hz vibration, Amplitude: 2, Noise: 0.2
Duration: 10 seconds
Sampling: 200 Hz
Outliers: 5%, Random Range: -50 to 50 (EMI spikes)
Data Removal: 1% (rare sensor dropouts)
Example 3: Pressure Sensor with Saturation
Scenario: Pressure sensor periodically hits maximum reading
Configuration:
Signal: Mean=150 kPa, Noise=5, Oscillation: 0.1 Hz/Amp=30
Duration: 30 minutes
Sampling: 1 Hz
Outliers: 3%, Constant: 999 (sensor max)
Data Removal: 0%
Summary
Phoenix data degradation features allow comprehensive testing of data quality issues:
Data Point Removal: - Creates gaps (missing timestamps) - Number or percentage mode controls total points removed - Two distribution strategies: Random (scattered) or Gaps (contiguous blocks) - Gaps mode: specify number of blocks; sizes and positions are randomised - Applied once to all channels
Outlier Insertion: - Corrupts values (preserves timestamps) - Three value modes: constant, random range, factor - Number or percentage mode - Independent per channel (multi-channel)
Best Practices: - Start with low degradation levels - Match real-world sensor specifications - Document degradation parameters - Test cleaning algorithms incrementally
Next Steps
- Export and Save - Save degraded data for testing
- Multi-Channel - Apply degradation to multi-sensor data
- Basic Usage - Review complete signal generation workflow
- Technical Reference - Degradation algorithms and formulas
Data degradation is essential for building robust data processing pipelines that handle real-world data quality challenges.