Technical Specification & Legal Notice

BR SYSTEMS FVD Analysis Tool
Technical Specifications & Terms of Use

This page describes the system configuration and input specifications of our proprietary Fake Voice Detection (FVD) system, along with important terms of service.
Achieved EER=4.487% (RawNet2 Official) on the ASVspoof 2019 LA benchmark.

System Overview

System Configuration

The FVD system extracts 197-dimensional acoustic features from audio and classifies them using a machine learning model to determine Real / Synthetic. It consists of two GUI tools plus a deep learning inference engine.

fvd_train_gui.py

FVD Training Tool

Specify Real and Synthetic audio directories to train a model (pkl). Feature groups and model type (RF/SVM/GB) can be selected via GUI.

  • Individual ON/OFF selection of feature groups
  • Model types: Random Forest / SVM / Gradient Boosting
  • Score methods: predict_proba / Platt Scaling / Ensemble
  • Training results: AUC, EER, Feature Importance displayed in real time
  • Settings auto-saved as JSON (reproducibility)
fvd_gui.py

FVD Detection Tool

Load a trained model (pkl or pth) and perform authenticity judgment, detailed analysis, and visualization of audio files in one place.

  • Compare Both: individual comparison judgment of 2 files
  • Batch ROC: ROC-AUC, EER, Optimal Threshold calculation
  • Feature Importance: ranking visualization of judgment basis
  • Statistical Analysis: statistical difference test, spectral comparison
  • Threshold Analysis: detailed classification by threshold
  • RawNet2 Official: Deep Learning inference + Occlusion Sensitivity
Technical Specification

Technical Specifications

Key parameters of the analysis engine.

ItemSpecification / Description
Sampling Rate 44,100 Hz (unified across all modules)
Feature Dimensions 197 dimensions (adjustable via settings)
Key Features MFCC (39-dim) / LFCC (60-dim) / CQCC (60-dim) / Group Delay / Mel Statistics / Jitter / Shimmer / ZCR / Pitch
Classification Model Random Forest / Gradient Boosting / SVM (selectable via GUI)
Score Method predict_proba / Platt Scaling / Cosine+Euclidean / Ensemble
Evaluation Metrics ROC-AUC / EER / Optimal Threshold (Youden's J) / Accuracy / Confusion Matrix / min t-DCF
Supported Formats WAV (recommended, 44,100 Hz / mono / 16bit) / MP3 / FLAC etc.
Validated Performance (Japanese) 7 speakers / ROC-AUC = 1.000, EER = 0.3% (same-speaker condition)
Validated Performance (English) ASVspoof 2019 LA evaluation set (71,237 samples)
RawNet2 Official: EER = 4.487%, min t-DCF = 0.12352
Significantly outperforms world baseline (LFCC+GMM EER≈8%)
Deep Learning Model RawNet2 Official (Tak et al., ICASSP 2021)
SincConv + Channel Attention + ResBlocks×6 + GRU×3 + SWA
Visualization: Mel Spectrogram × Occlusion Sensitivity (4 panels)
Runtime Environment Python 3.10 / Windows 10 or later / Anaconda environment
GPU: NVIDIA RTX 5070 recommended (CUDA 12.8)
Input Data Specification

Training Data Input Specifications

Required data and recommended conditions for the FVD Training Tool.

1

Real Audio
Folder

Genuine voice
WAV files

2

Synthetic Audio
Folder

AI-synthesized
WAV files

3

Output
Folder

Save destination for
pkl, CSV, reports

4

Model Settings
Selection

Specify features and
model via GUI

Recommended Recording & File Conditions

  • Format: WAV / 44,100 Hz / mono / 16bit (auto-conversion available with audio_preprocessor.py)
  • Audio Length: At least 1.5 seconds. Recommended: 3–8 seconds (1–2 sentences of natural speech)
  • Content: Natural speech such as daily conversation or reading text. Avoid extremely biased phonemic content (e.g., repeated vowels)
  • Speakers: Training with multiple speakers (mixed male/female) improves general model accuracy
  • File Count: Minimum 50 files per class, recommended 150+ files per class
  • Synthetic Audio: Synthesize using the target speaker's Real audio as speaker_wav in BR-TTS NNW (XTTS v2)
i
Note on Saving Recording Transcripts

Text transcripts of recordings are not required for FVD training purposes. This system analyzes "how the voice sounds" — acoustic features such as Jitter, Shimmer, and Spectral Flatness — rather than "what is being said."
However, if you plan to integrate speaker verification or ASR (speech recognition) in the future, it is recommended to create a management table (CSV) linking audio files to their corresponding transcripts.

Legal & Policy

Important Terms of Use

Please review the following before using this service.

!
Accuracy & Disclaimer
  • Analysis results are reference information based on acoustic statistical models
  • Results do not constitute legal evidence
  • Accuracy may vary for unknown TTS engines or post-processed audio
  • Final judgments are the sole responsibility of the client
!
Handling of Audio Data
  • Submitted data will be used solely for analysis purposes
  • Audio data will be securely deleted upon completion
  • Personal information contained in audio will not be shared with third parties
  • NDA signing is available for confidential data
!
Restrictions on Use
  • Use is permitted only for legitimate purposes of verifying audio authenticity
  • Use that infringes on the rights of third parties is strictly prohibited
  • Using results for unfounded defamation or reputational damage is prohibited
i
Citation & Attribution
  • When publishing results in media or reports, please cite as: "Acoustic analysis results by BR SYSTEMS VoiceGuard Analytics"
  • Please refrain from exaggerating accuracy (e.g., "100% detection")
  • Unfounded comparative claims against competitors are not permitted
Open Source Licenses

Libraries & Licenses

This system uses the following open-source libraries. All are licensed for commercial use.

librosa ISC License Acoustic feature extraction
scikit-learn BSD License Machine learning models
numpy BSD License Numerical computation
PyTorch BSD License Deep learning framework
torchaudio BSD License Audio deep learning
torchcontrib BSD License SWA optimization
sounddevice MIT License Audio playback
matplotlib PSF License Graph rendering
PyWavelets MIT License Wavelet analysis
tkinter PSF License GUI framework
soundfile BSD License Audio file I/O
RawNet2Spoof MIT License Official RawNet2 (NAVER Corp.)
Commercial Use

All libraries listed above are licensed to permit commercial service deployment. ISC, MIT, BSD, and PSF licenses allow free use while maintaining copyright notices.

Quality Commitment

Commitment to Quality Improvement

ASVspoof 2019 LA Benchmark Validation Results

We conduct continuous performance validation on ASVspoof 2019 LA, the world-standard benchmark for fake voice detection. Using the official RawNet2 implementation (Tak et al., ICASSP 2021), our latest validation on the evaluation set (71,237 samples) achieved EER = 4.487%, min t-DCF = 0.12352. This significantly outperforms the world-standard baseline (LFCC+GMM EER≈8%), and we are currently preparing a manuscript for submission to an international peer-reviewed journal (IEEE Access).

i
Ongoing Response to AI Voice Synthesis

AI voice synthesis technology is advancing rapidly, and detection technology must continually keep pace. We constantly monitor developments in new TTS engines and voice conversion technologies, and continue to expand training data and update models to maintain and improve practical detection accuracy. As the next step, we are considering the introduction of AASIST (Graph Attention Network).

Inquiries & Contact

For questions about our services or quote requests, please feel free to contact us. We typically respond within one business day.

info@brsystems.jp BR SYSTEMS Official Website