Ask any Zebra how they feel about robots and they will probably geek out a bit. We’re huge fans, mainly because we’ve seen them in living color. They aren’t a sci-fi concept or even “the way of the future.” They are here now, in the present day, helping us solve very real human problems and lifting a huge weight off the shoulders of many essential front-line workers, sometimes quite literally. And we believe the human-robot relationship will only grow stronger as the days go on. That’s why we take every opportunity to expand our research in this area and support the research of others.
For example, when we heard the Centre for Intelligent Sensing at Queen Mary University in London was hosting a CORSMAL Challenge focused on Audio-Visual Object Classification for Human-Robot Collaboration, we immediately asked how we could get involved. We even donated the £1,800 cash prize to the winning team.
Going Where No Human (or Robot) Has Gone Before
CORSMAL is exploring the fusion of multiple sensing modalities (touch, sound, and first- and third-person vision) to accurately and robustly estimate the physical properties of objects in noisy and potentially ambiguous environments.
For this specific Human-Robot Collaboration Challenge, computer vision researchers from around the world were brought together to look at ways to improve how humans and robots work together on daily tasks in home and work settings. The goal was to understand how robots could be of particular benefit for people with physical disabilities or frailty due to old age.
We connected with Andrea Mirabile, who is the Senior Manager for Computer Vision, within Zebra Technologies’ Chief Technology Office, to understand why this concerted effort was so important and how it was different from other investigations into human-robot collaboration. Here’s what he had to say:
Your Edge Blog Team: We understand you were able to attend the Challenge workshop and witness some of the great work done by the teams. Can you tell us a little bit about the experience?
Andrea: The Challenge required teams to use computer vision algorithms to sense and analyze audio-visual recordings of different people holding and filling-up containers, glasses and boxes with contents such as water, rice and pasta at varying speeds and levels.
Your Edge Blog Team: What was the purpose of this particular Challenge?
Andrea: The CORSMAL Challenge is about using audio-visual data to understand the different physical properties of household containers for food and drinks. These containers are manipulated by a person prior to a human-robot collaboration and could be relevant for assistive scenarios at home or in a workplace. Containers vary in their shape, size, weight, or appearance, or they can be occluded by the human hand during manipulation. The Challenge provides a benchmarking framework that supports the design and evaluation of novel audio-visual solutions for the estimation of the physical properties of containers in these challenging scenarios.
Your Edge Blog Team: In other words, the analyzed recordings were fed into a virtual robot arm-human environment, to guide and improve robotic actions to help a person with holding, moving, and filling the containers in the virtual environment.
Your Edge Blog Team: Is this something you envision you could see coming to life as an autonomous mobile robot (AMR) solution?
Andrea: In the future, definitely! Recent, fast advances in machine learning and artificial intelligence (AI) have created an expectation for robots to seamlessly operate in the real world by accurately and robustly perceiving and understanding dynamic environments, including the actions and intentions of humans. However, several challenges in audio-visual perception and modelling humans’ hard-to-predict behaviors hamper the deployment of robots in real-world scenarios.
Your Edge Blog Team: Why was Zebra so interested in supporting this Challenge as an observer and prize sponsor if the focus was more on consumer applications, such as helping people with daily living activities?
Andrea: The reason is twofold:
We want to incentivize academia and the AI community to focus on less-explored core challenges to accelerate the deployment of more complex applications of human-robot interaction into the real world.
We also want to create a smoother transition and a stronger link between emerging technologies and trends in academia and industry.
It’s also important that we engage with academia and nurture future talent, which is a core part of Zebra’s R&D culture. The Challenge was actually organized by leading computer vision academics from universities in the UK, France, Switzerland, USA and Australia, and the three teams were made up of master’s and PhD researchers from the UK, Japan, and Italy. The workshop featured live presentations, open discussion with Q&A, and the announcement of the winning team. So, it wasn’t just about observing the competition.
Our participation enabled us to tap into the knowledge base of so many different experts in this field. Most of the teams leveraged only visual information (video data), ignoring audio information (audio data). Unlike text-image multi-modal learning that has been a hot topic in the last couple of years, audio-visual understanding seems a less mature area.
Plus, these types of challenges help us to experience the state of current technologies and identify core research areas on which we should invest organically and collaborate with universities.
Your Edge Blog Team: We understand the teams’ computer vision solutions were judged against criteria including accuracy, robustness, safety, and adaptivity to changing conditions. Can you talk a bit about the types of solutions they demonstrated, how they performed, and what key learnings you were able to take away from their success – or perhaps their failure – in some areas?
Andrea: The Challenge focused on the estimation of the capacity, dimensions, and mass of containers, the type, mass, and filling (percentage of the container with content), and the overall mass of the container and filling. This information could be then passed to a robot to grab it.
The teams developed very innovative solutions, achieving very promising results. The main task was broken down into the following sub-tasks and multiple specialized models were deployed to solve each subtask:
Task 1: Filling level classification.
The goal is to classify the filling level as empty, half full, or full (i.e. 90%) for each configuration
Task 2: Filling type classification.
The goal is to classify the type of filling, if any, as one of these classes: 0 (no content), 1 (pasta), 2 (rice), 3 (water), for each configuration.
Task 3: Container capacity estimation.
The goal is to estimate the capacity of the container for each configuration.
Task 4: Container mass estimation.
The goal is to estimate the mass of the (empty) container for each configuration.
In addition, the audio data was not leveraged by most of the teams.
Your Edge Blog Team: What made the winning solution standout?
Andrea: The winning team addressed the filling type classification with audio data and then combined this information from audio with video modalities to address the filling level classification. For the container capacity, dimension, and mass estimation, they presented a data augmentation and consistency measurement to alleviate the over-fitting issue in the given dataset caused by the limited number of containers. The winning solution stood out for its improved generalization ability of the models to estimate the property of the containers that were not previously seen during training.
Your Edge Blog Team: We understand you recently led a team of Zebra computer vision researchers to second place in the prestigious CVPR Computer Vision Challenge. That competition was more focused on retail applications, right?
Andrea: Although this Challenge focuses on retail applications, one of the core challenges being tackled is the same multi-modal understanding capability mentioned before. The growing customer demand for retailers is becoming more and more diversified, growing the need for methods such as attribute extraction for e-commerce products that not only require a single modality such as images, but also call for the usage of textual data that describe the images. Bridging the gap between visual representation and high-level semantic concepts remains an open research topic. The aim of this Challenge is to design a system which can retrieve related images using text queries. The challenge lies in the scale of data, noisy image-caption pairs, and multi-lingual captions. The large-scale dataset proposed by the organizers is composed of∼4M image-caption pairs of∼100K fine-grained classes.
Think of it this way: if you’re walking into a store, how do you find what you’re looking for? You could query a system through voice command or text search, and it could give you directions to the product you are looking for. To do just that, though, designers may have spent months slotting products into the right categories and appending the most descriptive tags. The decisions are not always straightforward: Do blenders belong with pots and pans in kitchen supplies or in appliances alongside portable dishwashers? Gray areas like this demand nuanced interpretation and a keen understanding of how shoppers think.
Often, a product’s description identifies the features important to a particular shopper but in other cases, those attributes are instead folded into a series of hashtags. Sometimes, the photo, not the text, reveals crucial details, such as the profile of the perfect sofa or cut of a flattering dress. Fully comprehending a product means parsing the subtleties of this sort of multimodal content.
This will help retailers to increase product visibility for customers, elevate shopping experience and decrease the chance of mis-sales because of products that are hard to find on the shelves.
Your Edge Blog Team: Is there anything particularly interesting that you see researchers focused on right now in the computer vision space, either in competitions like these or more broadly with academic, government or private sector innovation projects?
Andrea: So far, many of the current breakthroughs in this domain are somehow related to more data, more computational power, and more specialized models. How can we break this “dependency”? For example, how can we generate better visual representations of the world? How can we rely less on data and their associated annotations? How can we leverage synthetic data? How can we create algorithms that train and run on smaller (less powerful) devices? Those are things that the scientific community has not fully addressed yet. And I think we are mature enough to start thinking about them and potentially addressing some of them.
Your Edge Blog Team: Where do you see the most promise for robot-human collaboration in the next five years with regards to front-line work? And how will computer vision innovation factor in?
Andrea: Augmenting front-line workers' capabilities is an area of focus for both industry and academia. Robots assist humans in terms of precision, speed, and force. Humans contribute in terms of the experience, knowledge of executing the task, intuition, and easy adaptation and learning, as well as an understanding of control strategies. This prospective combined with the current supply chain challenges will boost Human-Robot Collaboration (HRC) deployments and drive research into new and more complex applications.
Effective teamwork, as in the case of HRC, requires awareness of each member of the system. This is essential for both establishing a safe environment and task planning/organization. Continuous advancements in Computer Vision, Natural Language Processing, Planning and Control strategies will be necessary to tackle these challenges.
Did You Know?
The Centre for Intelligent Sensing is a focal point for research in Intelligent Sensing at Queen Mary University of London. The Centre focuses on breakthrough innovations in computational intelligence that are expected to have a major impact in transforming human and machine utilization of multiple sensor inputs for interpretation and decision making. The Centre facilitates sharing of resources, exchange of ideas and results among researchers in the areas of theory and application of signal acquisition, processing, communication, abstraction, control and visualization. The expertise in the Centre includes camera and sensor networks, image and signal processing, computer vision, pattern recognition and learning, coding, 3D imaging, reconstruction and rendering, 3D graphics, bio-inspired computing, human-computer interaction, face and gesture recognition, affective computing and social signal processing, and data mining. The center provides post-graduate research and teaching in Intelligent Sensing and is responsible for the MSc program in Computer Vision.
For more insights on Zebra’s work in the computer vision and robotics space, check out these posts: