MIPPIA Logo

Fusion Segment Transformer:
Bi-directional attention guided fusion network for AI-Generated Music Detection

Submitted @ ICASSP 2026

Yumin Kim*, Seonghyeon Go*

MIPPIA Inc.

Abstract

With the rise of generative AI technology, anyone can now easily create and deploy AI-generated music, which has heightened the need for technical solutions to address copyright and ownership issues. While existing works have largely focused on short-audio, the challenge of full-audio detection, which requires modeling long-term structure and context, remains insufficiently explored. To address this, we propose an improved version of the Segment Transformer, termed Fusion Segment Transformer. As in our previous work, we extract content embeddings from short music segments using diverse feature extractors. Furthermore, we enhance the architecture for full-audio AI-generated music detection by introducing a Gated Fusion Layer that effectively integrates content and structural information, enabling the capture of long-term context. Experiments on the SONICS and AIME datasets show that our approach consistently outperforms the previous model and recent baselines, achieving state-of-the-art results in full-audio segment detection.

Preliminaries

Our previous work: Segment Transformer
Accepted @ APSIPA 2025

We first provide a brief overview of the architecture of our previous work, the Segment Transformer. We propose a two-stage AI-generated music detection (AIGM) framework.

🎼 Stage-1: Short Audio Segment Detection Model

In stage-1, we aim to extract meaningful embedding from a given short segment. We use pre-trained self-supervised learning (SSL) models, including Wav2vec, Music2Vec, and MERT, along with the pre-trained FXencoder, to obtain embeddings that capture local musical characteristics. To better adapt these encoder outputs to the AIGM detection task, we designed the AudioCAT framework, which flexibly employs SSL models and the FXencoder as feature extractor encoders and combines them with a fixed Cross-Attention–based Transformer decoder.

Segment Transformer Figure 1
🎼 Stage-2: Full Audio Segment Detection Model
Segment Transformer Figure 2
Math Example

In stage-2, we use a beat-tracking algorithm to split music tracks into four-bar segments and extract embeddings with the model in stage-1, resulting in a sequence of segment embeddings \( E = \{e_1, e_2, ..., e_N\} \) where each \( e_i \in \mathbb{R}^d \), where \( d \) is the embedding size of the extractor.

We then compute a self-similarity matrix \( M \in \mathbb{R}^{N \times N} \) to capture structural information, and we input both the embeddings and the similarity matrix into two parallel Transformer encoders. The final representation \( h_{\text{final}} = h_{\text{content}} \oplus h_{\text{similarity}} \) integrates local segment features with global structural patterns through simple concatenation.

🎼 Proposed Method:
Fusion Segment Transformer

Methodology

This work follows a two-stage pipeline that extends the architecture of our previous work, Segment Transformer.

🎼 Stage-1: Feature Embedding Extractor for Short Audio Segment Detection
Proposed Method Figure 1

In previous work, we employed various feature extractors with the AudioCAT to obtain segment-level embeddings, where models were pre-trained using cross-entropy loss. But these models focus primarily on temporal patterns rather than detailed frequency-domain analysis. To explore whether frequency-domain features could complement AIGM detection, we experimented with integrating the Muffin Encoder into the AudioCAT feature extractor.

🎼 Stage-2: Fusion Segment Transformer for Full Audio Segment Detection
Proposed Method Figure 2

We extract segment embeddings from stage-1 and compute a self-similarity matrix (SSM) to capture music structure. Unlike the previous Segment Transformer that simply concatenated these two features, our Fusion Segment Transformer employs dual streams: an embedding stream for content and an SSM stream for structure. A cross-modal fusion layer with bi-directional cross-attention and a gated unit integrates content and structural information. The fused representation is finally fed into a classification head for AI-generated music detection.

Quantitative Results

SONICS

SONICS results table

Comparison of various methods for full-audio detection (stage-2) on SONICS dataset. Our proposed method is highlighted in yellow.
Best : Bold; Second best : Underline.

AIME

AIME results table

Comparison with the Segment Transformer for full-audio detection (stage-2) on the AIME dataset.
Best : Bold; Second best : Underline.

Segment Transformer

@article{kim2025segment,
  title={Segment Transformer: AI-Generated Music Detection via Music Structural Analysis},
  author={Kim, Yumin and Go, Seonghyeon},
  journal={arXiv preprint arXiv:2509.08283},
  year={2025},
  note={Accepted for publication in APSIPA ASC 2025}}