Data Degradation: Simulating Data Quality Issues

Phoenix allows you to intentionally introduce data quality problems into your time series, essential for testing data cleaning pipelines, validating quality assurance algorithms, and preparing for real-world data challenges.

Why Simulate Bad Data?

Testing Data Cleaning Algorithms

Real sensor data often contains quality issues. Test your cleaning logic with controlled degradation: - Missing data: Dropped measurements, transmission failures - Outliers: Sensor malfunctions, interference, saturation

Validating FORGE Workflows

Before deploying data cleaning workflows in production: 1. Generate clean synthetic data 2. Add controlled degradation 3. Run through FORGE cleaning pipeline 4. Verify that cleaned data matches original

Training and Demonstration

Show stakeholders the importance of data quality: - Before/after comparisons - Impact of quality issues on analysis - Value of cleaning algorithms

Algorithm Benchmarking

Compare data cleaning algorithms on standardized test cases: - Same base signal - Different degradation levels - Measure cleaning effectiveness

Two Types of Data Degradation

Phoenix offers two independent degradation features:

  1. Data Point Removal - Creates gaps (missing data)
  2. Outlier Insertion - Adds corrupted values

You can use one, both, or neither.

Data Point Removal (Missing Data)

What It Does

Removes data points from the time series, creating gaps where measurements are missing. You control both how many points to remove and how those removals are distributed — scattered randomly or grouped into contiguous blocks.

Effect:

Before:  [100, 101, 99, 102, 98, 100, 101]
After:   [100, ---, 99, ---, 98, 100, ---]  (random distribution)
         [100, 101, ---, ---, ---, 100, 101] (gap distribution, 1 gap)

Timestamps of removed points are deleted — they don't appear as NaN or null.

Configuration Options

Quantity Mode: Number vs. Percentage

Controls the total number of points to remove.

Number - Remove exact count of points

Remove: 50 points
From 1000-point series → 50 points removed, 950 remain
From 500-point series → 50 points removed, 450 remain

Percentage - Remove proportion of points

Remove: 5%
From 1000-point series → 50 points removed, 950 remain
From 500-point series → 25 points removed, 475 remain

Choose Number When: - Testing specific gap sizes - Consistent absolute quantity needed - Series length varies

Choose Percentage When: - Testing proportional data loss - Comparing across different duration series - Simulating percentage-based reliability specs

Distribution Strategy: Random vs. Gaps

Controls how the removed points are placed within the series.

Random (default) - Removed points are selected individually at random: - Each point has equal probability of removal - No clustering or patterns - Uniform distribution over time - Simulates independent random dropouts (e.g., packet loss)

Gaps - Removed points are grouped into contiguous blocks:

  • Specify number of gaps (e.g., 3)
  • Total points to remove are randomly distributed across the gaps
  • Each gap is placed at a random position in the series
  • Simulates sustained outages, communication blackouts, or planned downtime windows

Gaps mode example:

Total removal: 10% of 1000 points = 100 points
Number of gaps: 3

Result:
  Gap 1: 35 points at position 120
  Gap 2: 41 points at position 450
  Gap 3: 24 points at position 780

Choose Random When

  • Simulating independent packet-by-packet loss
  • Testing algorithms that handle sparse missing data
  • Representing random sensor dropouts

Choose Gaps When

  • Simulating power outages, network blackouts, or planned maintenance windows
  • Testing algorithms designed to detect or impute contiguous missing sections
  • Representing failures with a recovery time (e.g., a sensor reboots, a link restores)

Use Cases

Transmission Failures (Random)

Remove: 2-5%, Distribution: Random
Simulates: Wireless sensor network packet loss

Sensor Dropouts (Random)

Remove: 1-3%, Distribution: Random
Simulates: Intermittent sensor connection issues

Network Blackout (Gaps)

Remove: 10-20%, Distribution: Gaps, Num Gaps: 1-3
Simulates: Communication link failure and recovery

Scheduled Maintenance Windows (Gaps)

Remove: 5-15%, Distribution: Gaps, Num Gaps: 2-4
Simulates: Known maintenance periods with no data collection

Data Logger Failures (Gaps)

Remove: 10-20%, Distribution: Gaps, Num Gaps: 3-5
Simulates: Logger crashes with extended downtime before restart

Severe Communication Issues (Random or Gaps)

Remove: 30-50%
Simulates: Heavy RF interference (Random) or failing hardware (Gaps)

Step-by-Step Configuration

  1. Scroll to "Data Degradation" section in sidebar
  2. Find "Data Point Removal" subsection
  3. Choose quantity mode:
  4. Select "Number" and enter count (e.g., 100)
  5. OR select "Percentage" and enter % (e.g., 5)
  6. Choose distribution strategy:
  7. Select "Random" for scattered individual removals (default)
  8. Select "Gaps" for contiguous blocks, then enter number of gaps
  9. Preview to see effect

[Screenshot Required: Data Point Removal Configuration] 1. Configure: - Base signal: Mean=100, Noise=3 - Duration: 10 minutes, Sampling: 1 Hz - Data Removal: 5% (percentage mode) 2. Capture: Data Degradation section showing removal config 3. Purpose: Show removal configuration interface

[Screenshot Required: Before/After Removal] 1. Generate clean signal (no removal) 2. Screenshot 1: Clean signal 3. Add 10% removal 4. Screenshot 2: Signal with gaps 5. Purpose: Visually demonstrate removal effect

Outlier Insertion (Corrupted Values)

What It Does

Replaces existing data point values with anomalous values, simulating sensor malfunctions, saturation, or interference.

Effect:

Before:  [100, 101, 99, 102, 98, 100, 101]
After:   [100, 9999, 99, 102, 9999, 100, 101]

Timestamps remain unchanged - only values are replaced.

Configuration Options

Quantity Mode: Number vs. Percentage

Same as data removal:

Number - Insert exact count of outliers

Insert: 20 outliers

Percentage - Insert proportion of total points as outliers

Insert: 2% outliers

Value Mode: Three Methods

Phoenix offers three ways to generate outlier values:

1. Constant Value

Replace with a fixed value (same for all outliers).

Configuration:

Value Mode: Constant Value
Constant Value: 9999

Effect:

Original: [100, 101, 99, 102, 98]
Outliers: [100, 9999, 99, 9999, 98]

Use Cases: - Sensor saturation: Maximum reading (e.g., 9999, -9999) - Error codes: Specific values indicating failure - Out-of-range: Physically impossible values - Sentinel values: Traditional missing data markers

Examples:

Temperature sensor saturates at max: 9999
Pressure sensor error code: -1
Voltage sensor open circuit: 0
Flow meter stopped: 0

[Screenshot Instructions: Constant Outliers] 1. Configure: - Mean: 50, Noise: 2 - Duration: 5 minutes, Sampling: 1 Hz - Outliers: 3% (percentage), Constant Value: 999 2. Preview 3. Capture: Chart showing signal with obvious 999 spikes 4. Purpose: Demonstrate constant value outliers

2. Random Range

Replace with random values between min and max.

Configuration:

Value Mode: Random Range
Range Min: -50
Range Max: 500

Effect:

Original: [100, 101, 99, 102, 98]
Outliers: [100, 237, 99, -18, 98]
         (random between -50 and 500)

Use Cases: - Electrical interference: Random noise spikes - Cross-talk: Values from other sensor ranges - A/D converter errors: Random bit flips - Testing robustness: Wide variety of outlier values

Examples:

EMI spikes on 0-100 signal:
  Range: -200 to 300

Cross-talk from 0-1000 sensor to 0-100 sensor:
  Range: 200 to 800

Random bit flips:
  Range: 0 to 65535 (16-bit full scale)

[Screenshot Instructions: Random Range Outliers] 1. Configure: - Mean: 100, Noise: 5 - Duration: 5 minutes, Sampling: 1 Hz - Outliers: 5%, Random Range: 300 to 500 2. Preview 3. Capture: Chart showing signal with random spikes in 300-500 range 4. Purpose: Demonstrate random range outliers

3. Factor Multiplication

Multiply existing values by a factor.

Configuration:

Value Mode: Factor Multiplication
Multiplication Factor: 10

Effect:

Original: [100, 101, 99, 102, 98]
Outliers: [100, 1010, 99, 1020, 98]
         (selected values × 10)

Use Cases: - Gain errors: Amplifier malfunction (×10, ×100) - Unit errors: Wrong calibration (×1000 for mm→m) - Scale errors: Decimal point shifts - Proportional drift: Multiplicative sensor drift

Examples:

Amplifier gain error (10× instead of 1×):
  Factor: 10

Unit conversion error (kPa entered as Pa):
  Factor: 0.001

Decimal point error:
  Factor: 100

Special Cases:

Factor < 1: Attenuates values (gain loss)
  Factor: 0.1 → values become 10% of original

Factor < 0: Inverts values (rare in practice)
  Factor: -1 → sign flip

Factor = 1: No change (identity, not useful)

[Screenshot Instructions: Factor Multiplication Outliers] 1. Configure: - Mean: 50, Noise: 3 - Duration: 5 minutes, Sampling: 1 Hz - Outliers: 2%, Factor: 5 2. Preview 3. Capture: Chart showing signal with occasional 5× spikes 4. Purpose: Demonstrate factor multiplication outliers

How Outliers Are Selected

Like data removal, outliers are inserted at random timestamps: - Uniform distribution over time - Each point has equal probability - No clustering patterns

Combining Removal and Outliers

You can use both degradation types simultaneously:

Example Configuration:

Data Removal: 5% (percentage)
Outliers: 2% (percentage), Constant: 9999

From 1000 points:
- 50 points removed (gaps)
- 20 points replaced with 9999 (outliers)
- 930 points remain clean

Application Order: 1. Generate clean signal 2. Remove data points (creates gaps) 3. Insert outliers into remaining points

Note: Outliers are only inserted into points that weren't removed.

[Screenshot Instructions: Combined Degradation] 1. Configure: - Mean: 100, Noise: 5 - Oscillation: 0.1 Hz, Amplitude: 20 - Duration: 10 minutes, Sampling: 1 Hz - Data Removal: 8% - Outliers: 3%, Constant: 999 2. Preview 3. Capture: Chart showing both gaps and outliers 4. Purpose: Demonstrate realistic badly-degraded data

Multi-Channel Considerations

For multi-channel time series:

Data Removal

  • Applied once to all channels
  • Same timestamps removed from all channels
  • Maintains synchronization across channels

Effect:

Before:
Time    Ch1   Ch2   Ch3
0.0     10    20    30
1.0     11    21    31
2.0     12    22    32

After (remove 1.0):
Time    Ch1   Ch2   Ch3
0.0     10    20    30
2.0     12    22    32

Outlier Insertion

  • Applied independently to each channel
  • Different points affected in each channel
  • Outlier value method applies to all channels

Effect (constant value 999):

Before:
Time    Ch1   Ch2   Ch3
0.0     10    20    30
1.0     11    21    31
2.0     12    22    32

After (2 outliers):
Time    Ch1   Ch2   Ch3
0.0     10    999   30
1.0     999   21    31
2.0     12    22    999

Testing Scenarios

Scenario 1: Light Degradation

Goal: Test algorithms on realistic sensor data

Configuration:

Data Removal: 1-2%
Outliers: 0.5-1%, Constant: 9999

Simulates: High-quality sensors with occasional issues

Scenario 2: Moderate Degradation

Goal: Test robustness of cleaning algorithms

Configuration:

Data Removal: 5-10%
Outliers: 2-5%, Random Range: 0 to 10× mean

Simulates: Typical industrial sensor network

Scenario 3: Heavy Degradation

Goal: Stress test data pipelines

Configuration:

Data Removal: 20-30%
Outliers: 10-15%, Constant: 9999

Simulates: Failing sensors, severe environmental interference

Scenario 4: Specific Fault Types

Sensor Saturation:

Outliers: 5%, Constant: 9999
Removal: 0%

Communication Failures:

Removal: 15%
Outliers: 0%

EMI/RFI Interference:

Outliers: 10%, Random Range: -1000 to 1000
Removal: 3%

Calibration Drift:

Outliers: 20%, Factor: 1.5 (50% gain error)
Removal: 0%

Best Practices

Start Conservative

Begin with low degradation levels: 1. Generate clean signal 2. Add 1% degradation 3. Verify cleaning algorithms work 4. Gradually increase degradation

Match Reality

Base degradation levels on actual sensor specs: - Sensor datasheets (reliability %) - Historical data quality metrics - Field experience

Document Degradation Parameters

When saving, include degradation info in description:

Good: "Motor vibration, 5% missing, 2% outliers at 9999"
Poor: "Test data"

Test Multiple Levels

Create multiple versions with increasing degradation:

Series 1: Clean (0% degradation)
Series 2: Light (2% missing, 1% outliers)
Series 3: Moderate (10% missing, 5% outliers)
Series 4: Heavy (30% missing, 15% outliers)

Use Realistic Outlier Values

Good Outlier Values: - Sensor maximum: 9999, -9999 - Physical impossibility: -273°C (below absolute zero) - Out of range: 150°C for 0-100°C sensor

Poor Outlier Values: - Within normal range: 105 for 0-100°C sensor - Too subtle: 101 for mean=100, noise=5 signal

Consider Your Use Case

Algorithm Development: Use consistent degradation Robustness Testing: Use variable, random degradation Benchmarking: Use standardized degradation levels

Troubleshooting

Can't see removed points in chart

Expected Behavior: Removed points disappear completely (gaps in data)

How to Verify: - Check statistics: point count should be reduced - Look for gaps in time axis (timestamps skip) - Export CSV: missing rows for removed timestamps

If not seeing effect: - Check removal percentage isn't too low (< 1%) - Verify you clicked "Preview" after configuring - Zoom in on chart to see gaps

Can't see outliers in chart

Possible Causes:

1. Outlier value too close to signal

Problem: Mean=100, Noise=10, Outlier=120
Fix: Use 999 or -999 (clearly distinct)

2. Too few outliers

Problem: 0.1% of 100 points = 0 outliers
Fix: Use at least 1% or absolute number

3. Y-axis auto-scale obscures outliers

Problem: Rare outliers compress main signal
Fix: Zoom in or check statistics (min/max will show outliers)

Outliers look wrong

Check Configuration: - Verify value mode (constant/range/factor) - Check constant value is as expected - For range: verify min < max - For factor: verify factor != 1

Error: "Cannot remove more points than exist"

Cause: Trying to remove more points than available

Examples:

Problem: Remove 150 points from 100-point series
Fix: Reduce to 50 points or use percentage mode

Problem: Remove 110% of points
Fix: Use ≤ 100%

Degradation applied multiple times

Cause: Clicking "Preview" multiple times applies degradation repeatedly

Prevention: - Each preview generates fresh data - Degradation applied to clean signal each time - Not cumulative

Advanced Techniques

Creating Realistic Gap Patterns

Use the Gaps distribution strategy to generate contiguous missing-data blocks in a single step. Configure the number of gaps to match your target failure mode — a single large gap for a prolonged outage, multiple smaller gaps for intermittent blackouts. For strictly periodic patterns (e.g., every 6 hours exactly), generate first and post-process externally.

Combining with FORGE Testing

Workflow: 1. Generate clean data in Phoenix 2. Save with identifiable name 3. Generate degraded version with same parameters + degradation 4. Save degraded version 5. Run degraded through FORGE cleaning 6. Compare FORGE output with clean original 7. Measure cleaning effectiveness

Multiple Outlier Types

Phoenix supports one outlier type per generation. For multiple types:

Option 1: Generate multiple series

Series A: 2% constant outliers (9999)
Series B: 3% random range outliers
Series C: 1% factor outliers (×10)

Option 2: External post-processing - Generate with one type - Export - Add additional outlier types externally - Reimport to FORGE/CEREBRO

Controlled Degradation Placement

Phoenix uses random selection. For specific timestamp targeting: 1. Generate clean data 2. Export to CSV 3. Manually edit specific rows 4. Reimport

Real-World Examples

Example 1: Temperature Sensor Network

Scenario: 4 temperature sensors, wireless network with 98% reliability

Configuration:

Channels: 4 (correlated 0.85-0.9)
Duration: 24 hours
Sampling: 0.01 Hz (1 sample per 100 seconds)
Data Removal: 2% (simulates 98% reliability)
Outliers: 0.5%, Constant: -999 (sensor fault code)

[Screenshot Instructions: Sensor Network] 1. Configure 4 channels as above 2. Add degradation 3. Duration: 2 hours, Sampling: 0.01 Hz 4. Preview 5. Capture: Multi-channel chart with gaps and outliers 6. Purpose: Realistic sensor network scenario

Example 2: Vibration Monitor with EMI

Scenario: Accelerometer near high-voltage equipment

Configuration:

Signal: 30 Hz vibration, Amplitude: 2, Noise: 0.2
Duration: 10 seconds
Sampling: 200 Hz
Outliers: 5%, Random Range: -50 to 50 (EMI spikes)
Data Removal: 1% (rare sensor dropouts)

Example 3: Pressure Sensor with Saturation

Scenario: Pressure sensor periodically hits maximum reading

Configuration:

Signal: Mean=150 kPa, Noise=5, Oscillation: 0.1 Hz/Amp=30
Duration: 30 minutes
Sampling: 1 Hz
Outliers: 3%, Constant: 999 (sensor max)
Data Removal: 0%

Summary

Phoenix data degradation features allow comprehensive testing of data quality issues:

Data Point Removal: - Creates gaps (missing timestamps) - Number or percentage mode controls total points removed - Two distribution strategies: Random (scattered) or Gaps (contiguous blocks) - Gaps mode: specify number of blocks; sizes and positions are randomised - Applied once to all channels

Outlier Insertion: - Corrupts values (preserves timestamps) - Three value modes: constant, random range, factor - Number or percentage mode - Independent per channel (multi-channel)

Best Practices: - Start with low degradation levels - Match real-world sensor specifications - Document degradation parameters - Test cleaning algorithms incrementally

Next Steps


Data degradation is essential for building robust data processing pipelines that handle real-world data quality challenges.