Qualitative Results of FxSearcher

This page demonstrates the results of an gradient-free text-driven FX parameter searching agent, Fxsearcher. The system takes a source audio file and a target text prompt, then uses the CLAP model and Bayesian Optimization to find the optimal effects chain and parameters.

Benchmark Results

Comparative results between our FxSearch model and the Text2FX baseline. For Text2FX, Eq and Reverb was applied as FX chain for all examples.

Target Prompt FX Chain(FxSearcher) Audio Type Clean Audio FxSearcher(Ours) Text2FX
A ghostly child's voice echoing in a hallway EQ, Reverb, PitchShift Speech
A metallic robot's cold voice EQ, Reverb, Delay, PitchShift Speech
A monster speaking inside a cave EQ, Reverb, Distortion, PitchShift, BitCrush Speech
A powerful and destructive drum sound EQ, Distortion, PitchShift Instrumental
An electric guitar played underwater EQ, PitchShift, BitCrush Instrumental
A saxophone buzzing like it's trapped in an old radio EQ, Reverb, Delay, PitchShift, BitCrush Instrumental
A preacher's booming voice in a vast cathedral EQ, Reverb, Distortion, BitCrush Speech
A soldier shouting orders through a walkie-talkie EQ, Reverb, Distortion Speech

How Bayesian Optimization Works

This project uses Bayesian Optimization to efficiently find the best effect parameters. Unlike grid search or random search, which are "blind," Bayesian Optimization intelligently decides which parameters to try next based on all previous results. It's designed to find the global optimum of expensive, black-box functions in as few steps as possible.

The process consists of two key components: a Surrogate Model and an Acquisition Function.

1. Surrogate Model (Gaussian Process)

We don't know the true, complex relationship \(f(x)\) between parameters \(x\) and the CLAP score. Bayesian Optimization builds a probabilistic model to approximate this relationship. This surrogate model, typically a Gaussian Process (GP), creates a "map of possibilities" based on the points we've already evaluated.

A GP models the function as a multivariate normal distribution: $$ f(x) \sim \mathcal{GP}(\mu(x), k(x, x')) $$ Here, \(\mu(x)\) is the mean function (our prior belief about the score), and \(k(x, x')\) is the kernel, or covariance function, which models the similarity between points. After observing data \(D_{1:t} = \{(x_1, y_1), \dots, (x_t, y_t)\}\), the GP is updated to a posterior distribution. This posterior gives us a prediction for any new point \(x\), providing both a predicted mean \(\mu_t(x)\) (the expected score) and a variance \(\sigma_t^2(x)\) (our uncertainty about that score).

2. Acquisition Function

The acquisition function guides the search by deciding which point \(x\) to evaluate next. It balances two competing goals:

  • Exploitation: Choosing points where the surrogate model predicts a high score (low uncertainty, high mean).
  • Exploration: Choosing points where the model is most uncertain (high variance), in hopes of discovering a new, unobserved peak.

A common acquisition function is Expected Improvement (EI). It calculates the expected amount of improvement over the best score found so far, \(y^+ = \max(y_1, \dots, y_t)\). The formula is: $$ \text{EI}(x) = (\mu_t(x) - y^+) \Phi(Z) + \sigma_t(x) \phi(Z) \quad \text{where} \quad Z = \frac{\mu_t(x) - y^+}{\sigma_t(x)} $$ Here, \(\Phi\) and \(\phi\) are the CDF and PDF of the standard normal distribution. This formula elegantly balances exploiting high-mean regions (the first term) and exploring high-variance regions (the second term).

This project uses the Lower Confidence Bound (LCB) acquisition function, which provides a more direct way to control this trade-off: $$ \text{LCB}(x) = \mu_t(x) - \kappa \sigma_t(x) $$ By increasing the hyperparameter \(\kappa\), we can encourage the algorithm to be more 'adventurous' and prioritize exploration of uncertain regions. The next point to sample, \(x_{t+1}\), is the one that maximizes the acquisition function: \(x_{t+1} = \text{argmax}_{x} \text{EI}(x)\).