VLMs are blind but Groundlight can see

Author(s):

Dr. Paulina Varshavskaya

Chief Science Officer

Leo Dirac

CTO & Co-founder

Tl;dr: Groundlight scientists show how their system that combines AI with escalation to live human oversight can do a fantastic job on simple visual questions that utterly confuse GPT, Claude, and Gemini. While the world's most famous AIs struggle to tell you if two circles touch, or two lines cross, Groundlight answers these questions almost perfectly.

It’s now been established that today's vision language models (VLMs) struggle to perform simple visual tasks that are straightforward for humans [1]: for example, being able to tell how many times two lines cross, or whether two circles are touching or not. They rely heavily on the text-processing and generative abilities of their LLM components, and only interpret the visual information inside input images to a limited extent. The vision stack of the VLM seems to suffer from an inductive bias that pulls it away from being able to interpret novel visual content or doing spatial reasoning on straightforward but previously unseen images.

What about Groundlight detectors? Are they similarly blind? We have run two quick experiments to see how the Groundlight system would behave on binary questions from the BlindTest benchmark: the touching circles task, and a version of the lines crossing tasks. These tasks are a simplified proxy of what is needed for real-world vision problems in manufacturing, retail, robotics. For these kinds of embodied AI problems, where the models are interacting with the physical world, it is not enough to rely on LLM-generated textual reasoning. Our models need to see: to be able to correctly identify and analyze visual changes that signify important application differences. As we show below, Groundlight models have no trouble with these tasks.

‍

Touching Circles

We directly borrow the “touching” version of task 2 (“Two circles”) from the paper [1] and use the authors’ open source code to generate the examples. We submit to a Groundlight detector up to 500 labeled images of the task, and then test the resulting trained model on 5,000 new instances.

Detector query: Are the two circles touching each other? Answer with Yes/No.

Our definition of ground truth for this task is taken directly from the paper [1]: “We consider two circles overlapping and touching if [the boundary-to-boundary distance] d < 0.0; non-overlapping but touching if d = 0.0; and non-overlapping & non-touching when d > 0.0. Random-baseline accuracy: 50%.”

Here we ignore the overlapping and focus on touching.

‍

Training:

We generated 500 examples each of touching and non-touching circles with random orientation and distance between circles, explicitly discarding any angles and distances that are part of the test set in [1]. We then created 4 Groundlight detectors with the text query “Are the two circles touching each other? Answer with Yes/No.” and a confidence threshold of 0.9 and sent the first 10, 50, 100, 500 or all 1,000 examples to one of them.

Unfortunately, Appendix B.6 of [1] where the authors fine-tune Bunny (Bunny-v1.1-Llama-3-8B-V [2]) on the two-circles task doesn’t specify exactly the training recipe. We only know that they trained on “10K, 20K, 50K, and 100K samples, each containing a balanced number of instances” of the Yes vs No class. We’re using significantly less data than that.

Using Groundlight detectors as they were meant to be used, we only sent training labels for those images where the ML answer was under our specified confidence threshold of 0.9, so the real number of training examples is less than the nominal size of the dataset.

‍

Evaluation:

In [1] fine-tuned Bunny could achieve up to 36.8% accuracy on the test set, a large improvement over the zero-shot performance of 11.7%.

How do the Groundlight detectors compare? Here are evaluation set accuracy results for the four detectors:

With as few as 10 labeled examples, a Groundlight ML detector has a decent performance of 78% balanced accuracy on the test set (identical to the test set in [1]), and is above 90% on the test set when fine-tuned with 50 labeled examples. You can also see the ML-reported confidence rising with the number of training examples. With only 10 training examples, all will be under the detector confidence threshold of .9 and will be escalated. This drops to under 40% with a training set of 50, and 24% with 100, and only 10% with 500 image queries sent to the detector. Note that this means the Groundlight detector has significantly fewer training examples than the nominal number, as it only receives human labels for those examples where the ML model is not yet confident:

These are just the ML predictions. The accuracy of the full Groundlight system would be even higher than this, since all underconfident predictions would be escalated to a live person and answered correctly (with a slight delay).

‍

Crossing Lines, Counting Shapes, Tracing Paths…

We can expect similar performance on binary QA versions of the other “VLMs are blind” tasks like counting the number of crossing lines or particular shapes or rows and columns in a grid. [link] Watch this space [/link] for examples of counting detectors.

Here we run a binary version of the crossing lines task from [1] where we create a Groundlight detector with the query: Are the red and blue lines crossing?

The plot below shows ML model errors over the lifetime of this detector, from the first to the thousandth image sent. At first ML confidence is low and there are many mistakes, but within a hundred or so examples, the model is mostly confidently and correctly answering the question. Crucial to note is that every point below 0.9 in confidence is escalated for review by a person, and so all the unconfident mistakes are immediately corrected.

At the end of this experiment, the projected ML balanced accuracy regardless of confidence is approaching 93% (obtained by 4-fold cross-validation on the ground truth set), and the fraction of queries escalated for review is reduced to under 10%:

By not-quite-apples-to-apples comparison, the best zero-shot accuracy on the lines crossing task from [1] was 75.36% (Sonnet-3.5).

‍

Pick the best tool for the job

So VLMs are blind to simple visual differences in these tasks, and may not even be able to fine-tune to them, but we can easily fine-tune Groundlight detectors to good levels of accuracy using just a few tens of examples.

In a way these results are completely unsurprising. We do know that there are simple CNN-based image classification models that have been around for decades. The important thing we’re demonstrating with these experiments is that as a computer vision practitioner or user you have to pick the right tool for the job.

Groundlight detectors make this model selection seamless for you: detectors implement a set of diverse modern ML pipelines which are continuously evaluated. At any one time the best one for your specific task will be actively answering your image queries.

Take Groundlight for a spin for free and see for yourself.

You can also schedule a free consultation to learn how you can grow your business using computer vision.

‍

References:

[1] Rahmanzadehgervi, P., Bolton, L., Taesiri, M.R., Nguyen, A.T.: Vision Language Models Are Blind. (2024) https://arxiv.org/abs/2407.06581

[2] He, M., Liu, Y., Wu, B., Yuan, J., Wang, Y., Huang, T., Zhao, B.: Efficient multimodal learning from data-centric perspective. (2024) https://arxiv.org/abs/2402.11530

‍

Touching Circles

Crossing Lines, Counting Shapes, Tracing Paths…

Pick the best tool for the job

References:

Blog & resources

Building Computer Vision Applications with the Groundlight MCP Server

Vision as Model Context Protocol (MCP) service

GRPO for vision - Teaching an LLM to reason about images

Solutions

Products

Developers

Resources

Use Cases