PourIt!: Weakly-supervised Liquid Perception

from a Single Image

for Visual Closed-Loop Robotic Pouring

International Conference on Computer Vision (ICCV), 2023

Fudan University

Our visual closed-loop robotic pouring. Unlike SAR-Net, our approach recovers a 3D point cloud of the liquid using the source container's pose and 2D liquid perception data. This allows the robot to pour accurately based on visual feedback, even without the depth measurement of the liquid.

Abstract

Liquid perception is critical for robotic pouring tasks. It usually requires the robust visual detection of flowing liquid. However, while recent works have shown promising results in liquid perception, they typically require labeled data for model training, a process that is both time-consuming and reliant on human labor.

To this end, this paper proposes a simple yet effective framework PourIt!, to serve as a tool for robotic pouring tasks. We design a simple data collection pipeline that only needs image-level labels to reduce the reliance on tedious pixel-wise annotations. Then, a binary classification model is trained to generate Class Activation Map (CAM) that focuses on the visual difference between these two kinds of collected data, i.e., the existence of liquid drop or not. We also devise a feature contrast strategy to improve the quality of the CAM, thus entirely and tightly covering the actual liquid regions. Then, the container pose is further utilized to facilitate the 3D point cloud recovery of the detected liquid region. Finally, the liquid-to-container distance is calculated for visual closed-loop control of the physical robot.

To validate the effectiveness of our proposed method, we also contribute a novel dataset for our task and name it PourIt! dataset. Extensive results on this dataset and physical Franka robot have shown the utility and effectiveness of our method in the robotic pouring tasks.



Method Overview

The workflow of our proposed PourIt! framework. Given the positive and negative samples of images with simple 0/1 image-level labels, we first use the Transformer backbone to extract features and the MLP layers to predict the image classes. Then, the derived CAM is used to index the localization of foreground and background pixels of the feature maps by using the threshold $\epsilon$. In order to improve the quality of CAM, we force the network to pull close the foreground features while pulling apart the background ones. At the inference stage, we use a pose estimation network to recover the 6-DoF poses of source and target containers and a perception network to estimate the potential liquid mask from derived CAM. The trajectory of the liquid is then extracted according to the morphological shape of the mask. Finally, the point cloud of liquid is reconstructed by using the predicted 6-DoF object poses and 2D liquid trajectory, which provides real-time visual feedback of liquid-to-container distance as high-level guidance of robot control.

Examples of our PourIt! Dataset

Experimets of Robotic Pouring

Generalization Ability on Novel Scenes

More Qualatative Results on Datasets

BibTeX

@inproceedings{lin2023pourit,
  title={PourIt!: Weakly-supervised Liquid Perception from a Single Image for Visual Closed-Loop Robotic Pouring},
  author={Lin, Haitao and Fu, Yanwei and Xue, Xiangyang},
  journal={International Conference on Computer Vision (ICCV)},
  year={2023},

}

Acknowledgements

We thank Dr. Connor Schenck for providing the UW Liquid Pouring Dataset.