Abstract:
Acoustic signals are pivotal to communication and navigation among cetaceans. To improve whale-call recognition, we propose a multimodal approach that fuses Mel-Frequency Cepstral Coefficients (MFCC) with VGGish deep representations. Using audio from four whale species, we extract 13-dimensional MFCC and 128-dimensional VGGish features, combine them via a dynamic weighting scheme, and further refine the representation with mutual-information–based feature selection and Linear Discriminant Analysis (LDA). With Support Vector Machine (SVM) and Random Forest (RF) classifiers trained using five-fold cross-validation and hyperparameter tuning, the fused representation attains test-set accuracies of 99.28% (SVM) and 99.17% (RF), yielding an average gain of about 3 percentage points over single-feature baselines, with recall exceeding 99%. Under varying signal-to-noise ratios, the fused features consistently exhibit stronger robustness than MFCC or VGGish alone. Ablation studies attribute the gains to the synergy among dynamic weighting, feature selection, and dimensionality reduction. Although VGGish extraction increases computational cost, the accuracy–robustness trade-off remains favorable, indicating strong potential for practical deployment. Overall, the results validate that complementary fusion of shallow (MFCC) and deep (VGGish) features is effective for whale acoustic recognition and provides a promising foundation for high-sensitivity, non-intrusive marine bio-monitoring.