Contrastive Cross-Lingual Calibration for Large Language Models
DOI:
https://doi.org/10.64748/8gkysx36Keywords:
calibration, multilingual NLP, cross-lingual transfer, uncertainty, hallucination, evaluationAbstract
Large language models (LLMs) are increasingly deployed in multilingual settings, yet their probability estimates are often miscalibrated, particularly for low-resource languages and code-switched inputs. We present C³, a post-hoc calibration framework that reduces cross-lingual miscalibration by optimizing language-aware temperature and bias parameters using contrastive counterfactuals generated via translation/back-translation and meaning-preserving perturbations. C³ aligns confidence across languages without retraining the base model. On classification (XNLI) and extractive/generative QA (XQuAD, MLQA, TyDi QA GoldP), C³ lowers Expected Calibration Error by 35–57% and Brier score by 9–18%, with modest accuracy gains (0.7–2.1 pp). For generative QA, hallucination rate decreases by 21% while maintaining answer quality. Benefits are largest for Swahili and Arabic and persist under code-switch and spelling noise. Ablations show that contrastive counterfactuals and language-specific scaling both contribute, and isotonic fusion improves tails of the confidence distribution. We release calibration recipes and evaluation scripts to support responsible multilingual deployment.
Downloads
Posted
How to Cite
License
Copyright (c) 2025 Marcello Conti (Author)

This work is licensed under a Creative Commons Attribution 4.0 International License.