Technical Specification
BR SYSTEMS FVD Analysis Tool
Technical Specifications & Terms of Use
This page describes the system configuration and input specifications of our proprietary
Fake Voice Detection (FVD) system, along with important terms of service.
Achieved EER=4.487% (RawNet2 Official)
on the ASVspoof 2019 LA benchmark.
System Configuration
The FVD system extracts 197-dimensional acoustic features from audio and classifies them using a machine learning model to determine Real / Synthetic. It consists of two GUI tools plus a deep learning inference engine.
FVD Training Tool
Specify Real and Synthetic audio directories to train a model (pkl). Feature groups and model type (RF/SVM/GB) can be selected via GUI.
- Individual ON/OFF selection of feature groups
- Model types: Random Forest / SVM / Gradient Boosting
- Score methods: predict_proba / Platt Scaling / Ensemble
- Training results: AUC, EER, Feature Importance displayed in real time
- Settings auto-saved as JSON (reproducibility)
FVD Detection Tool
Load a trained model (pkl or pth) and perform authenticity judgment, detailed analysis, and visualization of audio files in one place.
- Compare Both: individual comparison judgment of 2 files
- Batch ROC: ROC-AUC, EER, Optimal Threshold calculation
- Feature Importance: ranking visualization of judgment basis
- Statistical Analysis: statistical difference test, spectral comparison
- Threshold Analysis: detailed classification by threshold
- RawNet2 Official: Deep Learning inference + Occlusion Sensitivity
Technical Specifications
Key parameters of the analysis engine.
| Item | Specification / Description |
|---|---|
| Sampling Rate | 44,100 Hz (unified across all modules) |
| Feature Dimensions | 197 dimensions (adjustable via settings) |
| Key Features | MFCC (39-dim) / LFCC (60-dim) / CQCC (60-dim) / Group Delay / Mel Statistics / Jitter / Shimmer / ZCR / Pitch |
| Classification Model | Random Forest / Gradient Boosting / SVM (selectable via GUI) |
| Score Method | predict_proba / Platt Scaling / Cosine+Euclidean / Ensemble |
| Evaluation Metrics | ROC-AUC / EER / Optimal Threshold (Youden's J) / Accuracy / Confusion Matrix / min t-DCF |
| Supported Formats | WAV (recommended, 44,100 Hz / mono / 16bit) / MP3 / FLAC etc. |
| Validated Performance (Japanese) | 7 speakers / ROC-AUC = 1.000, EER = 0.3% (same-speaker condition) |
| Validated Performance (English) | ASVspoof 2019 LA evaluation set (71,237 samples) RawNet2 Official: EER = 4.487%, min t-DCF = 0.12352 Significantly outperforms world baseline (LFCC+GMM EER≈8%) |
| Deep Learning Model | RawNet2 Official (Tak et al., ICASSP 2021) SincConv + Channel Attention + ResBlocks×6 + GRU×3 + SWA Visualization: Mel Spectrogram × Occlusion Sensitivity (4 panels) |
| Runtime Environment | Python 3.10 / Windows 10 or later / Anaconda environment GPU: NVIDIA RTX 5070 recommended (CUDA 12.8) |
Training Data Input Specifications
Required data and recommended conditions for the FVD Training Tool.
Real Audio
Folder
Genuine voice
WAV files
Synthetic Audio
Folder
AI-synthesized
WAV files
Output
Folder
Save destination for
pkl, CSV, reports
Model Settings
Selection
Specify features and
model via GUI
Recommended Recording & File Conditions
- Format: WAV / 44,100 Hz / mono / 16bit (auto-conversion available with audio_preprocessor.py)
- Audio Length: At least 1.5 seconds. Recommended: 3–8 seconds (1–2 sentences of natural speech)
- Content: Natural speech such as daily conversation or reading text. Avoid extremely biased phonemic content (e.g., repeated vowels)
- Speakers: Training with multiple speakers (mixed male/female) improves general model accuracy
- File Count: Minimum 50 files per class, recommended 150+ files per class
- Synthetic Audio: Synthesize using the target speaker's Real audio as speaker_wav in BR-TTS NNW (XTTS v2)
Text transcripts of recordings are not required for FVD training purposes.
This system analyzes "how the voice sounds" — acoustic features such as
Jitter, Shimmer, and Spectral Flatness — rather than "what is being said."
However, if you plan to integrate speaker verification or ASR (speech recognition)
in the future, it is recommended to create a management table (CSV) linking
audio files to their corresponding transcripts.
Important Terms of Use
Please review the following before using this service.
- Analysis results are reference information based on acoustic statistical models
- Results do not constitute legal evidence
- Accuracy may vary for unknown TTS engines or post-processed audio
- Final judgments are the sole responsibility of the client
- Submitted data will be used solely for analysis purposes
- Audio data will be securely deleted upon completion
- Personal information contained in audio will not be shared with third parties
- NDA signing is available for confidential data
- Use is permitted only for legitimate purposes of verifying audio authenticity
- Use that infringes on the rights of third parties is strictly prohibited
- Using results for unfounded defamation or reputational damage is prohibited
- When publishing results in media or reports, please cite as: "Acoustic analysis results by BR SYSTEMS VoiceGuard Analytics"
- Please refrain from exaggerating accuracy (e.g., "100% detection")
- Unfounded comparative claims against competitors are not permitted
Libraries & Licenses
This system uses the following open-source libraries. All are licensed for commercial use.
All libraries listed above are licensed to permit commercial service deployment. ISC, MIT, BSD, and PSF licenses allow free use while maintaining copyright notices.
Commitment to Quality Improvement
We conduct continuous performance validation on ASVspoof 2019 LA, the world-standard benchmark for fake voice detection. Using the official RawNet2 implementation (Tak et al., ICASSP 2021), our latest validation on the evaluation set (71,237 samples) achieved EER = 4.487%, min t-DCF = 0.12352. This significantly outperforms the world-standard baseline (LFCC+GMM EER≈8%), and we are currently preparing a manuscript for submission to an international peer-reviewed journal (IEEE Access).
AI voice synthesis technology is advancing rapidly, and detection technology must continually keep pace. We constantly monitor developments in new TTS engines and voice conversion technologies, and continue to expand training data and update models to maintain and improve practical detection accuracy. As the next step, we are considering the introduction of AASIST (Graph Attention Network).
Inquiries & Contact
For questions about our services or quote requests, please feel free to contact us. We typically respond within one business day.
info@brsystems.jp BR SYSTEMS Official Website