Benchmark Results
Comparative results between our FxSearch model and the Text2FX baseline. For Text2FX, Eq and Reverb was applied as FX chain for all examples.
| Target Prompt | FX Chain(FxSearcher) | Audio Type | Clean Audio | FxSearcher(Ours) | Text2FX |
|---|---|---|---|---|---|
| A ghostly child's voice echoing in a hallway | EQ, Reverb, PitchShift | Speech | |||
| A metallic robot's cold voice | EQ, Reverb, Delay, PitchShift | Speech | |||
| A monster speaking inside a cave | EQ, Reverb, Distortion, PitchShift, BitCrush | Speech | |||
| A powerful and destructive drum sound | EQ, Distortion, PitchShift | Instrumental | |||
| An electric guitar played underwater | EQ, PitchShift, BitCrush | Instrumental | |||
| A saxophone buzzing like it's trapped in an old radio | EQ, Reverb, Delay, PitchShift, BitCrush | Instrumental | |||
| A preacher's booming voice in a vast cathedral | EQ, Reverb, Distortion, BitCrush | Speech | |||
| A soldier shouting orders through a walkie-talkie | EQ, Reverb, Distortion | Speech |
How Bayesian Optimization Works
This project uses Bayesian Optimization to efficiently find the best effect parameters. Unlike grid search or random search, which are "blind," Bayesian Optimization intelligently decides which parameters to try next based on all previous results. It's designed to find the global optimum of expensive, black-box functions in as few steps as possible.
The process consists of two key components: a Surrogate Model and an Acquisition Function.
1. Surrogate Model (Gaussian Process)
We don't know the true, complex relationship \(f(x)\) between parameters \(x\) and the CLAP score. Bayesian Optimization builds a probabilistic model to approximate this relationship. This surrogate model, typically a Gaussian Process (GP), creates a "map of possibilities" based on the points we've already evaluated.
A GP models the function as a multivariate normal distribution: $$ f(x) \sim \mathcal{GP}(\mu(x), k(x, x')) $$ Here, \(\mu(x)\) is the mean function (our prior belief about the score), and \(k(x, x')\) is the kernel, or covariance function, which models the similarity between points. After observing data \(D_{1:t} = \{(x_1, y_1), \dots, (x_t, y_t)\}\), the GP is updated to a posterior distribution. This posterior gives us a prediction for any new point \(x\), providing both a predicted mean \(\mu_t(x)\) (the expected score) and a variance \(\sigma_t^2(x)\) (our uncertainty about that score).
2. Acquisition Function
The acquisition function guides the search by deciding which point \(x\) to evaluate next. It balances two competing goals:
- Exploitation: Choosing points where the surrogate model predicts a high score (low uncertainty, high mean).
- Exploration: Choosing points where the model is most uncertain (high variance), in hopes of discovering a new, unobserved peak.
A common acquisition function is Expected Improvement (EI). It calculates the expected amount of improvement over the best score found so far, \(y^+ = \max(y_1, \dots, y_t)\). The formula is: $$ \text{EI}(x) = (\mu_t(x) - y^+) \Phi(Z) + \sigma_t(x) \phi(Z) \quad \text{where} \quad Z = \frac{\mu_t(x) - y^+}{\sigma_t(x)} $$ Here, \(\Phi\) and \(\phi\) are the CDF and PDF of the standard normal distribution. This formula elegantly balances exploiting high-mean regions (the first term) and exploring high-variance regions (the second term).
This project uses the Lower Confidence Bound (LCB) acquisition function, which provides a more direct way to control this trade-off: $$ \text{LCB}(x) = \mu_t(x) - \kappa \sigma_t(x) $$ By increasing the hyperparameter \(\kappa\), we can encourage the algorithm to be more 'adventurous' and prioritize exploration of uncertain regions. The next point to sample, \(x_{t+1}\), is the one that maximizes the acquisition function: \(x_{t+1} = \text{argmax}_{x} \text{EI}(x)\).