Technical Specification

Technical Specification & Legal Notice

BR SYSTEMS FVD Analysis Tool
Technical Specifications & Terms of Use

This page describes the system configuration and input specifications of our proprietary Fake Voice Detection (FVD) system, along with important terms of service.
Achieved EER=4.487% (RawNet2 Official) on the ASVspoof 2019 LA benchmark.

System Overview

System Configuration

The FVD system extracts 197-dimensional acoustic features from audio and classifies them using a machine learning model to determine Real / Synthetic. It consists of two GUI tools plus a deep learning inference engine.

fvd_train_gui.py

FVD Training Tool

Specify Real and Synthetic audio directories to train a model (pkl). Feature groups and model type (RF/SVM/GB) can be selected via GUI.

Individual ON/OFF selection of feature groups
Model types: Random Forest / SVM / Gradient Boosting
Score methods: predict_proba / Platt Scaling / Ensemble
Training results: AUC, EER, Feature Importance displayed in real time
Settings auto-saved as JSON (reproducibility)

fvd_gui.py

FVD Detection Tool

Load a trained model (pkl or pth) and perform authenticity judgment, detailed analysis, and visualization of audio files in one place.

Compare Both: individual comparison judgment of 2 files
Batch ROC: ROC-AUC, EER, Optimal Threshold calculation
Feature Importance: ranking visualization of judgment basis
Statistical Analysis: statistical difference test, spectral comparison
Threshold Analysis: detailed classification by threshold
RawNet2 Official: Deep Learning inference + Occlusion Sensitivity

Technical Specification

Technical Specifications

Key parameters of the analysis engine.

Item	Specification / Description
Sampling Rate	44,100 Hz (unified across all modules)
Feature Dimensions	197 dimensions (adjustable via settings)
Key Features	MFCC (39-dim) / LFCC (60-dim) / CQCC (60-dim) / Group Delay / Mel Statistics / Jitter / Shimmer / ZCR / Pitch
Classification Model	Random Forest / Gradient Boosting / SVM (selectable via GUI)
Score Method	predict_proba / Platt Scaling / Cosine+Euclidean / Ensemble
Evaluation Metrics	ROC-AUC / EER / Optimal Threshold (Youden's J) / Accuracy / Confusion Matrix / min t-DCF
Supported Formats	WAV (recommended, 44,100 Hz / mono / 16bit) / MP3 / FLAC etc.
Validated Performance (Japanese)	7 speakers / ROC-AUC = 1.000, EER = 0.3% (same-speaker condition)
Validated Performance (English)	ASVspoof 2019 LA evaluation set (71,237 samples) RawNet2 Official: EER = 4.487%, min t-DCF = 0.12352 Significantly outperforms world baseline (LFCC+GMM EER≈8%)
Deep Learning Model	RawNet2 Official (Tak et al., ICASSP 2021) SincConv + Channel Attention + ResBlocks×6 + GRU×3 + SWA Visualization: Mel Spectrogram × Occlusion Sensitivity (4 panels)
Runtime Environment	Python 3.10 / Windows 10 or later / Anaconda environment GPU: NVIDIA RTX 5070 recommended (CUDA 12.8)

Input Data Specification

Training Data Input Specifications

Required data and recommended conditions for the FVD Training Tool.

Real Audio
Folder

Genuine voice
WAV files

Synthetic Audio
Folder

AI-synthesized
WAV files

Output
Folder

Save destination for
pkl, CSV, reports

Model Settings
Selection

Specify features and
model via GUI

Recommended Recording & File Conditions

Format: WAV / 44,100 Hz / mono / 16bit (auto-conversion available with audio_preprocessor.py)
Audio Length: At least 1.5 seconds. Recommended: 3–8 seconds (1–2 sentences of natural speech)
Content: Natural speech such as daily conversation or reading text. Avoid extremely biased phonemic content (e.g., repeated vowels)
Speakers: Training with multiple speakers (mixed male/female) improves general model accuracy
File Count: Minimum 50 files per class, recommended 150+ files per class
Synthetic Audio: Synthesize using the target speaker's Real audio as speaker_wav in BR-TTS NNW (XTTS v2)

Note on Saving Recording Transcripts

Text transcripts of recordings are not required for FVD training purposes. This system analyzes "how the voice sounds" — acoustic features such as Jitter, Shimmer, and Spectral Flatness — rather than "what is being said."
However, if you plan to integrate speaker verification or ASR (speech recognition) in the future, it is recommended to create a management table (CSV) linking audio files to their corresponding transcripts.

Legal & Policy

Important Terms of Use

Please review the following before using this service.

Accuracy & Disclaimer

Analysis results are reference information based on acoustic statistical models
Results do not constitute legal evidence
Accuracy may vary for unknown TTS engines or post-processed audio
Final judgments are the sole responsibility of the client

Handling of Audio Data

Submitted data will be used solely for analysis purposes
Audio data will be securely deleted upon completion
Personal information contained in audio will not be shared with third parties
NDA signing is available for confidential data

Restrictions on Use

Use is permitted only for legitimate purposes of verifying audio authenticity
Use that infringes on the rights of third parties is strictly prohibited
Using results for unfounded defamation or reputational damage is prohibited

Citation & Attribution

When publishing results in media or reports, please cite as: "Acoustic analysis results by BR SYSTEMS VoiceGuard Analytics"
Please refrain from exaggerating accuracy (e.g., "100% detection")
Unfounded comparative claims against competitors are not permitted

Open Source Licenses

Libraries & Licenses

This system uses the following open-source libraries. All are licensed for commercial use.

librosa ISC License Acoustic feature extraction

scikit-learn BSD License Machine learning models

numpy BSD License Numerical computation

PyTorch BSD License Deep learning framework

torchaudio BSD License Audio deep learning

torchcontrib BSD License SWA optimization

sounddevice MIT License Audio playback

matplotlib PSF License Graph rendering

PyWavelets MIT License Wavelet analysis

tkinter PSF License GUI framework

soundfile BSD License Audio file I/O

RawNet2Spoof MIT License Official RawNet2 (NAVER Corp.)

✓

Commercial Use

All libraries listed above are licensed to permit commercial service deployment. ISC, MIT, BSD, and PSF licenses allow free use while maintaining copyright notices.

Quality Commitment

Commitment to Quality Improvement

✓

ASVspoof 2019 LA Benchmark Validation Results

We conduct continuous performance validation on ASVspoof 2019 LA, the world-standard benchmark for fake voice detection. Using the official RawNet2 implementation (Tak et al., ICASSP 2021), our latest validation on the evaluation set (71,237 samples) achieved EER = 4.487%, min t-DCF = 0.12352. This significantly outperforms the world-standard baseline (LFCC+GMM EER≈8%), and we are currently preparing a manuscript for submission to an international peer-reviewed journal (IEEE Access).

Ongoing Response to AI Voice Synthesis

AI voice synthesis technology is advancing rapidly, and detection technology must continually keep pace. We constantly monitor developments in new TTS engines and voice conversion technologies, and continue to expand training data and update models to maintain and improve practical detection accuracy. As the next step, we are considering the introduction of AASIST (Graph Attention Network).

Inquiries & Contact

For questions about our services or quote requests, please feel free to contact us. We typically respond within one business day.

info@brsystems.jp BR SYSTEMS Official Website

Technical Specification

BR SYSTEMS FVD Analysis ToolTechnical Specifications & Terms of Use

System Configuration

FVD Training Tool

FVD Detection Tool

Technical Specifications

Training Data Input Specifications

Real AudioFolder

Synthetic AudioFolder

OutputFolder

Model SettingsSelection

Recommended Recording & File Conditions

Important Terms of Use

Libraries & Licenses

Commitment to Quality Improvement

Inquiries & Contact

BR SYSTEMS FVD Analysis Tool
Technical Specifications & Terms of Use

Real Audio
Folder

Synthetic Audio
Folder

Output
Folder

Model Settings
Selection