Machine Learning
All
October 13, 2024

Groundlight Pushes Boundaries of Visual LLMs with New Research

Groundlight Staff

Exploring how visual language models (VLMs) can adapt to real-world problems with ambiguous instructions

Groundlight Pushes Boundaries of Visual LLMs with New Research

Teaching AI to answer "Is anything wrong?"

At Groundlight, we care a lot about the boundaries of what AI can do in the realm of visual understanding. Our latest research paper explores a fascinating question: Can AI learn to analyze images and spot problems just by looking at examples, much like a human expert would?  To do this we created a new kind of visual reasoning task for VLMs (LLMs that understand images) that had never been tried before.

Imagine you're training a new quality control inspector. You might show them a series of images, pointing out which ones have issues and which ones don't. Over time, they learn to spot problems on their own, even in new situations. Our research aims to teach AI in a similar way.

Traditional computer vision often can't handle open-ended questions like "Is anything wrong with this image?" These questions are ambiguous and highly context-dependent. What constitutes "wrong" can vary dramatically based on the specific application, industry standards, or even the particular product being examined.

Our study focused on teaching AI models to handle these ambiguous queries through visual demonstrations. We showed the AI numerous examples of images we created in a controlled way, each paired with a simple yes/no answer to questions that don't have enough information to answer like "is everything okay?" or "is the wrench in the right place?" The only way anyone can answer questions like these is by looking at examples and infer what “okay” and “right place” mean. That is categorically different from the kinds of questions where the answer can be guessed from common sense, background knowledge or context, like “what year was this car released?”  

Why do we need to know if AI can answer questions specifically from visual examples? In complex commercial and industrial settings, only the people who work there understand what's going on and what things should look like. Today's AI models don't stand a chance of understanding a production line just by looking at it.  So we created these tasks to simulate a simplified version of that situation by ensuring the AI can't understand the context of the question, because the information just isn't there.

Practical Applications of Computer Vision

This research has exciting implications for various sectors. Let's explore how making AI fully visually aware could transform operations in different fields:

Manufacturing, and Process Control

In a production environment, AI could assist in quality control by answering questions like, "Does this welding look correct?" or "Is this component aligned properly?" The AI doesn't need to understand welding techniques or alignment specifications; it learns from examples of what "correct" and "incorrect" look like.

Food Service and Quick Service Restaurants (QSR)

For food service, AI could help maintain consistency by addressing queries such as, "Is this burger assembled correctly?" or "Does this meal presentation meet our standards?" This could help ensure quality even during busy periods or with new staff.

Retail

In retail settings, AI could aid in visual merchandising and inventory management. It could answer questions like, "Is this end-cap display set up properly?" or "Are these shelves stocked according to our planogram?" This could help maintain brand standards across multiple locations.

Logistics

In warehouses and distribution centers, AI could help with package handling and storage compliance. It could address queries such as, "Is this pallet stacked safely?" or "Is this hazardous material stored according to SOP?" This could enhance safety and efficiency in fast-paced logistics environments.

The Future of Robotic Perception

This research also has exciting implications for robotics. As robots become more prevalent in various industries, their ability to understand and interpret their environment becomes crucial.

Imagine a mobile inspection robot roaming a factory floor. With an AI model that can figure out the boundaries of correctness from visual examples, it could answer complex, context-dependent questions about its surroundings. "Is there anything unusual in this area?" or "Does this equipment need maintenance?" The robot doesn't need pre-programmed rules for every situation; it learns from examples and improves over time.

This capability could dramatically enhance the versatility and usefulness of robots in industrial settings. It bridges the gap between rigid, rule-based systems and the flexibility of human perception.

(For those interested in the technical details, Groundlight is available through our ROS node, enabling easy integration with existing robotic systems.)

Combining AI and Human Insight

At Groundlight, we're working to make this technology accessible and practical for businesses. Our approach combines the speed and scalability of AI with the nuanced expertise of professional humans.

Here's how it works:

  1. You ask a question about an image in plain English.
  2. Our AI analyzes the image and attempts to answer.
  3. If the AI is unsure, the question is routed to a live human monitor.
  4. The system learns from each interaction, continuously improving its performance.

This hybrid approach allows for handling complex, real-world scenarios while continuously enhancing the AI's capabilities.

Want to see how this technology could work in your operations? We'd love to show you.

Reach out to us and we can discuss how to address your visual inspection needs.

Still have questions?

Can ChatGPT analyze images?

What other AI software can analyze images?