In recent years, foundation Vision-Language Models (VLMs), such as CLIP [1], which empower zero-shot transfer to a wide variety of domains without fine-tuning, have led to a significant shift in ...