NoiseFiT  /  Supplementary
Anonymous Submission · Supplementary Material

NoiseFiT

Noise-Augmented Fine-Tuning for hallucination mitigation in LLMs.

v1.0 · 2026

A study of how injecting calibrated noise into hidden states during fine-tuning reshapes attention, entropy, and the truthfulness of large language models.

This page collects the per-model analyses behind our paper“Noise Augmented Fine-Tuning for Mitigating Hallucinations in Large Language Models.”

For each of five model families, we evaluate five independent generations on held-out test data using Grok 3.0, then probe the internal state geometry (entropy, sparsity, effective rank, hidden-state norms) across layers under a range of noise configurations.

Links are privatised to remain anonymous under the peer review guidelines.
SIGNAL · NOISE σ = 0.01 ∇ hidden state f = 1.00 Hz
5
Model Families
81
Analysis Reports
8
Internal Metrics
5
Generations / Sample
Headline · Results

Noise-augmented fine-tuning improves factual accuracy
across every model family we tested.

We score five independent generations per prompt against ground-truth answers using Grok 3.0, then average across 208 held-out test items. Below — each family's Base model, its plain fine-tune (Base FiT), and its best NoiseFiT configuration.

+7.7pp
Average uplift
NoiseFiT vs Base
+10.2pp
Best uplift
Llama 3B & Qwen 0.5B
5/5
Families
improved by NoiseFiT
39
Configurations
evaluated end-to-end

NoiseFiT range vs baselines

For each architecture: all evaluated NoiseFiT configs as a range, compared against Base and Base FiT.

Base Base FiT NoiseFiT min–max NoiseFiT mean
0%25%50%75%100%
Paper · Benchmarks & Ablations

Eight standard benchmarks. Five architectures.
One consistent direction of travel.

Beyond the in-house hallucination probe, we evaluated every NoiseFiT-trained model on the HELM-style public benchmark suite — and ran a full loss-component ablation plus a head-to-head against existing noise-augmentation methods on Llama-2-7B / Alpaca.

Mean uplift per benchmark

Average relative improvement of the top-5 NoiseFiT configs over BaseFiT, taken across all five architectures.

Champion configuration per architecture

Best NoiseFiT recipe for each model family, evaluated on 8 public benchmarks. Δ is the absolute change vs BaseFiT.

(n, σ, r) = layers, std, H/L-SNR

Head-to-head vs noise-augmentation baselines

Llama-2-7B fine-tuned on Alpaca · checkpoint 3 · higher is better

All NoiseFiT variants beat NEFTune and R-Drop on average score.

Loss-component ablation

Each component matters — isolated KL or consistency losses collapse the model.

Diagnostic collapse: Consistency-only and KL-only trade ARC/HellaSwag/MMLU for misleadingly high TruthfulQA. They are listed as cautionary controls, not viable methods.

Training cost vs truthfulness

Llama-2-7B / Alpaca · runtime, peak GPU memory, and TruthfulQA MC2 by method.