The human mind does not see a visual scene as pixels; rather it is able to pick out objects of interest and selectively reason about them. Although object-based reasoning represents a core feature of human reasoning, only recently has it been possible to study in robotics due to advances in computer vision for segmentation and classification of objects from camera images. Our recent paper published at ICRA 2019, titled Multi-Object Search in Object-Oriented POMDPs, takes a stab at the problem by solving a novel multi-object search (MOS) task: without knowing in advance where the objects are located, a robot must find them in an indoor, roomed environment. We formulate the MOS task within a new framework called an object-oriented partially observable Markov decision process (OO-POMDP). An OO-POMDP represents the state and observation spaces in terms of classes and objects. The structure afforded by OO-POMDPs supports reasoning about each object independently while also providing a means for grounding language commands from a human on task onset. A human, for example, may issue an initial command such as “Find the mugs in the kitchen and books in the library,” where a robot can associate the locations to each object class so as to improve its search. We show that OO-POMCP with grounded language commands is sufficient for solving challenging MOS tasks both in simulation and on a physical mobile robot, which has applications for rescue- and home- robots.
Interestingly, the OO-POMDP allows the robot to recover from lies or mistakes in what a human tells it. If the person tells the robot “Find the mug in the kitchen,” the robot will first look in the kitchen for the objects. But after failing to find them in the kitchen, it will then systematically search the rest of the environment. Our OO-POMCP inference algorithm allows the robot to quickly and efficiently use all the information it has to find the object.
Kevin Stacey and his team released a great article about progress in our drone project. I especially like the video! Thank you for a fantastic summer that included improved PID control, a working 3D Unscented Kalman Filter, and on-board, off-line SLAM, all on our little Raspberry Pi drone!
We also received the fantastic news that our IROS paper was accepted! Stay tuned for the presentation at IROS 2018. This paper describes the initial platform that we used for the course last fall. We are now on Version 3 of the hardware and software stack, and we can’t wait to see what else is in store!
Many times, natural language commands issued to robots not only specify a particular target configuration or goal state but also outline constraints on how the robot goes about its execution. That is, the path taken to achieving some goal state is given equal importance to the goal state itself. One example of this could be instructing a wheeled robot to “go to the living room but avoid the kitchen,” in order to avoid scuffing the floor. This class of behaviors poses a serious obstacle to existing language understanding for robotics approaches that map to either action sequences or goal state representations. Due to the non-Markovian nature of the objective, approaches in the former category must map to potentially unbounded action sequences whereas approaches in the latter category would require folding the entirety of a robot’s trajectory into a (traditionally Markovian) state representation, resulting in an intractable decision-making problem. To resolve this challenge, we use a recently introduced probabilistic variant of Linear Temporal Logic (LTL) as a goal specification language for a Markov Decision Process (MDP). While demonstrating that standard neural sequence-to-sequence learning models can successfully ground language to this semantic representation, we also provide analysis that highlights generalization to novel, unseen logical forms as an open problem for this class of model. We evaluate our system within two simulated robot domains as well as on a physical robot, demonstrating accurate language grounding alongside a significant expansion in the space of interpretable robot behaviors.
Our paper was recently published at RSS 2018 and you can read it here!
Using light field methods, we can use Baxter’s monocular camera to localize the metal 0.24′ nut and corresponding bolt. The localization is precise enough to allow the robot to use the estimated poses to then perform an open-loop pick, place, and screw to put the nut on the bolt. The precise pose estimation enables complex routines to be quickly encoded, because once the robot knows where the parts are, it can perform accurate grasps and placement actions. You can read more about how it works in our RSS 2017 paper.
Our group was featured in this New Yorker article, showcasing Rebecca Pankow and John Oberlin’s work programming Baxter to pick petals from a daisy, as well as some of my thoughts on inequality and automation. I was thrilled with Sheelah’s work on this very important issue, focusing on the effects of automation and our changing economy.
There are many tasks that are too dangerous for humans to perform that would be better suited for a robot, such as defusing a bomb or repairing a nuclear reactor. Ideally, these robots would be autonomous, but currently, robots are not able to perform all tasks on their own yet. For robots to help with these problems today, they are directly controlled from afar by a human user, in an act called teleoperation. With this work, we set out to develop a teleoperation interface that is as intuitive and efficient as possible for completing the task.
We developed a virtual reality interface to allow novice users to efficiently teleoperate a robot and view it’s environment in 3D. We have released an open-source ROS package, ROS Reality, which allows anyone to connect a ROS network to a Unity scene over the internet via websockets. ROS topics can be sent to the Unity scene, and data from the Unity scene can be sent to the ROS network as a topic. This allows a human to perceive a scene and teleoperate the robot in it to perform a complex task, such as picking up a cup, as simply as they would in real life. We conducted a user study to compare the speed of our interface to traditional teleoperation methodologies, such as keyboard and monitor, and found a 66% increase in task completion under our system.
Below is a video of our system being used to teleoperate a Baxter robot at MIT from Brown University (41 miles away). Since our bandwidth requirements are about the same as a Skype call, we are able to establish a relatively low-latency connection that allows 12 cups to be easily stacked in a row. For more information, please check out our paper, which was accepted to ISRR 2017!
When humans are collaborating together, they are constantly communicating their intents through body language and speech in order to arrive at a common understanding. Similarly, we would like to enable robots to be able to communicate their intents so that robots and humans can quickly converge to solving the same problem and avoid miscommunication issues. A particularly important intent for robots to communicate to humans is motion, because of the safety concerns involved with a robot actuating in close-quarters with a human. Human-robot collaboration would be safer and more efficient if robots could communicate their motion intent to human partners in a quick and easy manner.
In our latest paper, we propose a mixed-reality head-mounted display interface that allows visualization of a robot’s future poses over the wearer’s real-world view of the environment. This allows a human to view the entire planned motion of the robot in the real workspace before it even moves, removing any potential issues of testing a real motion plan in the environment. To measure our interface’s ability to improve collaboration speed and accuracy, we conducted an experiment with real world users to compare our interface to traditional and no visualization techniques. We found our interface increased accuracy by 16% and a 62% decrease in the task completion compared to traditional visualizations.
If you’d like to see a demo of the interface yourself, check out the video below! Watch as the robot visualizes different motion plans that attempts to move the arm from one side of the table to the other without hitting any of the blocks. As opposed to traditional monitor and keyboard interfaces, our MR headset allows users to inspect the real scene quickly and efficiently. The code for the system is available on Github here. For more information, see our paper, which was accepted into ISRR 2017!
Humans communicating with other humans use a feedback loop that enables errors to be detected and fixed, increasing overall interaction success. We aim to enable robots to participate in this feedback loop so that they elicit additional information from the person when they are confused and use that information to resolve ambiguity and infer the person’s needs. This technology will enable robots to interact fluidly with untrained users who communicate with them using language and gestures. People from all walks of life can benefit from robotic help with physical tasks, ranging from assisting a disabled veteran in his home by fetching objects to a captain coordinating with a robotic assistant on a search-and-rescue mission.
Our latest paper defines a mathematical framework for an item-fetching domain that allows a robot to increase the speed and accuracy of its ability to interpret a person’s requests y reasoning about its own uncertainty as well as processing implicit information (implicatures). We formalize the item delivery domain as a Partially Observable Markov Decision Process (POMDP), and approximately solve this POMDP in real time. Our model improves speed and accuracy of fetching tasks by asking relevant clarifying questions only when necessary. To measure our model’s improvements, we conducted a real world user study with 16 participants. Our model is 2.17 seconds faster (25% faster) than state-of-the-art baseline, while being 2.1% more accurate.
You can see the system in action in this video: when the user is close to the robot, it is able to interpret the gesture and immediately selects the correct object without asking a question. However when the user is farther away, the pointing gesture is more ambiguous. The robot asks a targeted question. After the user answers the question, the robot selects the correct object. For more information, see our paper, which was accepted into ICRA 2017!