FxSearcher: Algorithm based CLAP-supervised FX searching agent

Benchmark Results

Comparative results between our FxSearch model and the Text2FX baseline. For Text2FX, Eq and Reverb was applied as FX chain for all examples.

Target Prompt	FX Chain(FxSearcher)	Audio Type
A ghostly child's voice echoing in a hallway	EQ, Reverb, PitchShift	Speech
A metallic robot's cold voice	EQ, Reverb, Delay, PitchShift	Speech
A monster speaking inside a cave	EQ, Reverb, Distortion, PitchShift, BitCrush	Speech
A powerful and destructive drum sound	EQ, Distortion, PitchShift	Instrumental
An electric guitar played underwater	EQ, PitchShift, BitCrush	Instrumental
A saxophone buzzing like it's trapped in an old radio	EQ, Reverb, Delay, PitchShift, BitCrush	Instrumental
A preacher's booming voice in a vast cathedral	EQ, Reverb, Distortion, BitCrush	Speech
A soldier shouting orders through a walkie-talkie	EQ, Reverb, Distortion	Speech

How Bayesian Optimization Works

This project uses Bayesian Optimization to efficiently find the best effect parameters. Unlike grid search or random search, which are "blind," Bayesian Optimization intelligently decides which parameters to try next based on all previous results. It's designed to find the global optimum of expensive, black-box functions in as few steps as possible.

The process consists of two key components: a Surrogate Model and an Acquisition Function.

1. Surrogate Model (Gaussian Process)

We don't know the true, complex relationship $f(x)$ between parameters $x$ and the CLAP score. Bayesian Optimization builds a probabilistic model to approximate this relationship. This surrogate model, typically a Gaussian Process (GP), creates a "map of possibilities" based on the points we've already evaluated.

A GP models the function as a multivariate normal distribution: $$ f(x) \sim \mathcal{GP}(\mu(x), k(x, x')) $$ Here, $\mu(x)$ is the mean function (our prior belief about the score), and $k(x, x')$ is the kernel, or covariance function, which models the similarity between points. After observing data $D_{1:t} = \{(x_1, y_1), \dots, (x_t, y_t)\}$, the GP is updated to a posterior distribution. This posterior gives us a prediction for any new point $x$, providing both a predicted mean $\mu_t(x)$ (the expected score) and a variance $\sigma_t^2(x)$ (our uncertainty about that score).

2. Acquisition Function

The acquisition function guides the search by deciding which point $x$ to evaluate next. It balances two competing goals:

Exploitation: Choosing points where the surrogate model predicts a high score (low uncertainty, high mean).
Exploration: Choosing points where the model is most uncertain (high variance), in hopes of discovering a new, unobserved peak.

A common acquisition function is Expected Improvement (EI). It calculates the expected amount of improvement over the best score found so far, $y^+ = \max(y_1, \dots, y_t)$. The formula is: $$ \text{EI}(x) = (\mu_t(x) - y^+) \Phi(Z) + \sigma_t(x) \phi(Z) \quad \text{where} \quad Z = \frac{\mu_t(x) - y^+}{\sigma_t(x)} $$ Here, $\Phi$ and $\phi$ are the CDF and PDF of the standard normal distribution. This formula elegantly balances exploiting high-mean regions (the first term) and exploring high-variance regions (the second term).

This project uses the Lower Confidence Bound (LCB) acquisition function, which provides a more direct way to control this trade-off: $$ \text{LCB}(x) = \mu_t(x) - \kappa \sigma_t(x) $$ By increasing the hyperparameter $\kappa$, we can encourage the algorithm to be more 'adventurous' and prioritize exploration of uncertain regions. The next point to sample, $x_{t+1}$, is the one that maximizes the acquisition function: $x_{t+1} = \text{argmax}_{x} \text{EI}(x)$.