VoiceGuard Analytics — Fake Voice Detection Service

Detect Fake Voices
with Scientific Precision.

Powered by Deep Learning & Acoustic Analysis

BR SYSTEMS' VoiceGuard Analytics combines machine learning and deep learning in a multi-stage approach to detect AI-synthesized voices with high accuracy. Simply send us your audio files by email to receive a detailed analysis report.

1.000
AUC (Japanese)
0.3%
EER (Japanese)
4.5%
EER (English/RawNet2)
197
Feature Dimensions
7+
Validated Speakers
Background & Risk

Threats from AI Voice Synthesis

Advanced TTS systems such as XTTS v2 and VALL-E enable anyone to create highly realistic synthetic voices that convincingly imitate real humans.

01

Voice Impersonation

Attacks using synthesized voices of specific individuals to bypass identity verification and authentication systems are becoming a real threat.

02

Audio Deepfakes

Fake audio of politicians and public figures is spreading, causing serious damage to social credibility and reputation.

03

Phone & Business Fraud

Fraudulent calls impersonating family members or supervisors are increasing, making it difficult to distinguish them from genuine voices.

04

Content Authenticity

Verifying the authenticity of interviews, testimonies, and recordings has become increasingly difficult, undermining legal and journalistic trust.

Services

Two Analysis Services

Choose between a universal model and a speaker-specific model based on your use case and accuracy requirements.

Universal FVD

Universal Fake Voice Detection

No speaker registration required

No speaker information required — immediate analysis available. A general-purpose model supporting multiple speakers and TTS engines analyzes whether submitted audio is synthesized.

  • No speaker registration — same-day analysis available
  • 197-dimensional acoustic features + GradientBoosting
  • Dual judgment with RawNet2 deep learning model
  • Detailed report including ROC-AUC, EER, and confidence scores
  • Batch analysis (multiple files at once) supported
Personalized FVD

Personalized Fake Voice Detection

Speaker-specific high-accuracy model

Pre-register voice samples of the target speaker to build a speaker-specific high-accuracy model. Particularly effective for impersonation detection.

  • Pre-register authentic voice samples (Real)
  • Achieve extremely high accuracy with speaker-specific model
  • Speaker verification via ECAPA-TDNN embeddings
  • Includes Threshold Analysis detailed report
  • Continuous model update option available
Process

Completed in 4 Steps

Simply send your audio files by email to receive a comprehensive analysis report.

1

Inquiry

Tell us about the audio to be analyzed, quantity, and purpose. We will provide a quote the same day.

2

Send Audio Files

Send the target audio files (WAV recommended) by email.

3

Multi-Stage Analysis

Precision analysis using 197-dimensional acoustic features and RawNet2 deep learning.

4

Report Delivery

Comprehensive report including ROC curves, AUC, feature importance, and judgment rationale.

Technology

Analysis Technology Overview

A multi-stage approach combining machine learning and deep learning achieves high-accuracy judgments.

Feature-Based Analysis

Extracts 197-dimensional acoustic features and classifies using GradientBoosting. Multi-dimensional feature engineering combining MFCC, LFCC, CQCC, Group Delay, and Mel statistics achieves high explainability.

MFCC 39-dim LFCC 60-dim CQCC 60-dim Group Delay Mel Stats Jitter / Shimmer GradientBoosting

Deep Learning Model (RawNet2 Official)

An end-to-end neural network that takes raw waveforms directly as input. The SincConv + Channel Attention + ResBlocks + GRU architecture learns subtle voice characteristics that feature-based approaches cannot capture. Occlusion Sensitivity visualization shows which time-frequency regions contributed to the judgment.

RawNet2 Official SincConv + Attention ResBlocks x6 GRU x3 SWA Occlusion Sensitivity CUDA / RTX 5070
// ASVspoof 2019 LA Benchmark — EER Comparison
Method
EER
AUC
Lang
LFCC + GMM ASVspoof 2019 Official Baseline
8.0%
EN
GradientBoosting + 197-dim BR-FVD Feature Engineering
13.4%
0.944
EN
RawNet2 Official + SWA BR-FVD Deep Learning (English / ASVspoof 2019 LA)
4.5%
EN
GradientBoosting + 197-dim BR-FVD (Japanese / 7 speakers)
0.3%
1.000
JA
AASIST World State-of-the-Art (reference)
0.8%
EN
Verified Performance

Technology Validated by World-Standard Benchmarks

Detection Accuracy — Proven by the Numbers

Our system has been rigorously validated against ASVspoof 2019 LA (71,237 samples), the world-standard evaluation framework for fake voice detection. Using the official RawNet2 implementation (Tak et al., ICASSP 2021), we achieved EER = 4.487%, min t-DCF = 0.12352, significantly surpassing the official baseline (LFCC+GMM EER≈8%) and delivering production-ready detection accuracy.

For Japanese speech, we have achieved ROC-AUC = 1.000 and EER = 0.3%, fully meeting commercial service standards. To further strengthen confidence in our technology, we are also pursuing independent third-party validation through submission to an international peer-reviewed journal.

1.000
ROC-AUC for Japanese speech
(7-speaker validation)
0.3%
EER for Japanese speech
(Equal Error Rate)
4.49%
EER for English (ASVspoof 2019 LA)
Exceeds world-standard baseline
Coverage

Types of Fake Voice Covered

Audio deepfakes are broadly classified into five categories. Below is an honest overview of what BR-FVD currently covers and our roadmap for future development.

TTS (Text-to-Speech)

Supported

Synthesized voice generated by neural TTS systems such as XTTS v2, VALL-E, and StyleBERT. Validated on ASVspoof 2019 LA (A01–A19). EER = 4.487% achieved.

VC (Voice Conversion)

Supported

Voice converted from speaker A to speaker B. Trained on ASVspoof VC attacks (A01, A02, A17–A19). VAE-based VC is the most challenging attack type in the benchmark.

Emotion Fake

Partial

Voice with artificially altered emotion or tone from the same speaker. Detectable via Jitter, Shimmer, and related features. Model enhancement with dedicated training data is under consideration.

Scene Fake

Partial

Voice with manipulated background noise, reverberation, or environmental audio. Detectable via Spectral Flatness and related features. Validation with dedicated datasets is a future goal.

Partially Fake

Planned

Voice where only a portion of an utterance is replaced. Difficult to detect with an overall score — segment-level judgment is required. Implementation including Occlusion Sensitivity is under consideration.

Telephone Channel Audio

Planned

Voice transmitted over PSTN or VoIP with bandwidth limiting and codec compression applied. Implementation is under consideration, including dedicated data augmentation and pre-processing.

Deliverables

Analysis Report Contents

We provide detailed reports that scientifically visualize the basis for judgments, not just a simple yes/no answer.

ROC Curve & AUC

Receiver Operating Characteristic curve. Quantifies model discrimination performance using AUC and EER.

Score Distribution

Visualization of Real/Synthetic score distributions. Intuitively shows the degree of separation between the two classes.

Feature Importance

Importance ranking of 197-dimensional features. Shows which acoustic features served as the basis for the judgment.

Threshold Analysis

Detailed threshold-based classification. Includes individual indicator evaluations for Jitter, Shimmer, Spectral features, etc.

RawNet2 Deep Score + Occlusion

Deep learning model synthetic probability score with Mel Spectrogram x Occlusion Sensitivity visualization in 4 panels, showing which time-frequency regions contributed to the judgment.

Summary CSV

A CSV report listing synthetic probability scores, predicted labels, and confidence levels for each file.

Pricing

Simple Pricing

We offer flexible plans tailored to the number of files and use case. Please feel free to contact us for a consultation.

Spot Analysis
Inquiry
For one-time audio file verification. Ideal for a small number of files.
  • Universal FVD analysis
  • ROC & Score Distribution report
  • Summary CSV
  • Delivery: within 3 business days
Contact Us
Personalized
Inquiry
Speaker-specific analysis. From voice sample submission to individual model construction.
  • Speaker-specific model construction
  • Personalized FVD analysis
  • Full report set (6 types)
  • Continuous model update option
  • Delivery: upon consultation
Contact Us
FAQ

Frequently Asked Questions

What kind of audio files should I send?
We recommend WAV format (44,100 Hz, mono, 16-bit). MP3 and m4a are also supported, but conversion processing is required. Audio length should be at least 1.5 seconds, with 3–8 seconds of natural speech recommended.
Which service should I choose — Universal FVD or Personalized FVD?
We recommend Universal FVD for verifying audio from unknown or anonymous speakers, and Personalized FVD for impersonation detection or voice protection of specific individuals. Please feel free to contact us if you are unsure.
How accurate is the detection?
For Japanese speech, we have achieved ROC-AUC=1.000 and EER=0.3%. For English (ASVspoof 2019 LA evaluation set, 71,237 samples), our official RawNet2 implementation achieved EER=4.487% and min t-DCF=0.12352, significantly outperforming the world-standard baseline (EER≈8%). Please note that accuracy may vary for unknown TTS engines or post-processed audio.
How is the confidentiality of submitted audio data protected?
Audio data submitted for analysis will be securely deleted upon completion. We also support NDA (Non-Disclosure Agreement) signing for highly confidential data.
When will the online service launch?
We are currently building a web application server. We plan to launch an online file upload and instant analysis service in the near future. We will announce the launch on this page when it is ready.

Contact Us

For service inquiries or quote requests, please feel free to reach out. We typically respond within one business day.

info@brsystems.jp