29/07/2024

To be successful, domestic and personal care robots will have to stack and unstack, fetch and put back, empty and re-fill, screw and unscrew, switch on and off, open and close, and fold and put away literally dozens of things in your home. Today’s most popular home robot, the robot vacuum cleaner, is designed to do the exact opposite – avoid everything but the dirt.

The rapid advances in desk-top AI are made possible by the massive amounts of data on which models can be trained. The challenge with training robots is getting enough video-based data of physical interactions in the unpredictable, complex environment of a home.

Researchers have come up with two very different approaches which may provide breakthroughs in the development of a generalist home robot.

It's more complicated than plugging AI into a robot

Given the huge advances in AI, a logical step would be to free Large Language Models (LLMs) from the desk top by giving them robotic arms and legs: as Google DeepMind researchers have said in a recent paper

“High-capacity models pretrained on broad web-scale datasets provide an effective and powerful platform for a wide range of downstream tasks: large language models can enable not only fluent text generation… but emergent problem-solving... and creative generation of prose... and code..., while vision-language models enable open-vocabulary visual recognition… and can even make complex inferences about object-agent interactions in images... Such semantic reasoning, problem solving, and visual interpretation capabilities would be tremendously useful for generalist robots that must perform a variety of tasks in real-world environments.”

But that has proven much easier said than done. The problem is that AI and robots each think in different languages and work in different realms. A robot’s instructions to move through its surrounding physical environment are expressed in cartesian space, defined (as in a graph) by two perpendicular horizontal axes, the x-axis and the y-axis, and the vertical z-axis to make a 3D space: for example, instructing a robot to “x =100, y=0” means move straight ahead 100mm. The robot will have a base cartesian system and additional ones for each moving part or tool (called frames). Within each frame, the robot has three primary motions: linear (A to B in a straight line), joint (A to B by a non-linear, potentially unpredictable path), and arc (moving around a fixed point in a constant radius). 

LLMs don’t perceive or interact with the outside physical world but operate in their own heads (i.e. applying the billions of parameters on which they have been trained), and they learn and provide output in terms of semantics, labels, and textual prompts.

Thus, LLMs produce outputs in words but the robots need inputs in combinations of the x, y, and z axis.

The DeepMind researchers describe their breakthrough solution as “simple and surprisingly effective”. They started with powerful vision-language (VLM) AI models (they used PaLI-X and PaLM-E), which learn to take vision and language as input and provide free-form text and are used in downstream applications such as object classification and image captioning. But instead of producing text outputs, the VLMs were trained on “robotic trajectories” (i.e. how a robotic limb would move) “by tokenizing the actions into text tokens and creating multimodal sentences that respond to robotic instructions paired with camera observations by producing corresponding actions". By getting the AI to directly think actions, not words, the AI can produce instructions following robotic policies: i.e. move the robotic arm in a linear, joint or arc motion along the x-axis, y-axis, or z-axis.

The DeepMind researchers found that their AI-powered robot:

“[exhibits]… a range of remarkable capabilities, combining the physical motions learned from the robot data with the ability to interpret images and text learned from web data into a single model… [w]hile the model’s physical skills are still limited to the distribution of skills seen in the robot data, the model acquires the ability to deploy those skills in new ways by interpreting images and language commands using knowledge gleaned from the web.”

The DeepMind researchers also found, by transferring knowledge from the web, the VLM-powered robot exhibited new capabilities beyond those demonstrated in the robot data, which they termed as “emergent, in the sense that they emerge by transferring Internet-scale pretraining". For example, in the photo below, a standard robot would have to be instructed to move the object at Point A (e.g. x=0, y=0) to Point B (e.g. x=400, y=200), but the VLM-powered robot can be simply told to move the green circle to the yellow hexagon and it will identify the objects from its internet data training.

one-shot: "Move the remaining blocks to the group"

The VLM-powered robot also exhibited more surprising, complex emergent capabilities: 

  • When presented with a bowl of apples and a bowl of strawberries and told to put loose strawberries in the correct bowl, the VLM-powered successfully did so, which “requires a nuanced understanding of not only what a strawberry and bowl are, but also reasoning in the context the scene to know the strawberry should go with the like fruits".
  • When told to “pick up the bag about to fall off the table”, the VLM-powered robot demonstrated “a physical understanding to disambiguate between two bags and recognize the precariously placed object".
  • The VLM-powered robot will successfully execute instructions such as “move the coke can to the person with glasses”, which the DeepMind researchers says “demonstrate[s] human-centric understanding and recognition".
  • When instructed using more sophisticated prompt techniques such as chain of logic, the VLM-powered robot can “figure out which object to pick up for use as an improvised hammer (a rock), or which type of drink is best suited for someone who is tired (an energy drink)".

There are limitations. The VLM-powered robot cannot learn any new robot motions beyond those three movements of linear, arc and joint. It also requires extraordinary computing power – the AI model has 55 billion parameters, which is not feasible to run directly on-robot GPUs commonly used for real-time robot control. The DeepMind AI model had to be run on multi-processor cloud with a network-based connection to the robot, not the best for a home deployment with patchy Internet access.

Hey robot, watch and learn

NYU and Meta researchers have tried to solve the problem of not having enough videos to train domestic robots by enabling consumers to teach the robots in their own homes.

The NYU/Meta researchers say that past efforts to develop “a holistic, automated home assistant” have the wrong end of the stick:

“...current research in robotics is predominantly either conducted in industrial environments or in academic labs, both containing curated objects, scenes, and even lighting conditions. In fact, even for the simple task of object picking or point navigation performance of robotic algorithms in homes is far below the performance of their lab counterparts. If we seek to build robotic systems that can solve harder, general-purpose tasks, we will need to re-evaluate many of the foundational assumptions in lab robotics.”

The NYU/Meta researchers used an AI training method called behaviour cloning, which is a machine learning approach where a model learns to perform a task by observing and imitating the actions and behaviours of humans.

The NYU/Meta researchers recruited 22 households (they called their team HoNY) to collect videos of the performance of tasks in their homes to be used to train robots. To date, methods to collect videos for training robots have been clunky and expensive: the robot will be fitted with a camera laden exoskeletal, or one robot will watch and record another robot attempting tasks – which are impractical in domestic settings. Also, watching a video of a human hand and arm twisting a top off a bottle is not immediately translatable for a robot with an extendable ‘limb’ fitted with a couple of claws.

So, the NYU/Meta researchers armed their volunteers with a $25 gripper stick (like you would pick up rubbish with) which they fitted with an iPhone to film themselves performing household tasks as if they were a robot. As the NYU/Meta researchers noted:

“The stick helps the user intuitively adapt to the limitations of the robot, for example by making it difficult to apply large amounts of force. Moreover, the iPhone Pro (version 12 or newer), with its camera setup and internal gyroscope, allows the stick to collect RGB [red, green, blue] image and depth data at 30 frames per second, and its 6D position (translation and rotation).”

The volunteers were asked to focus on eight tasks: switching button, door opening, door closing, drawer opening, drawer closing, pick and place and handle grasping. They were also asked to record themselves playing with the stick around their homes in more random actions, because “[s]uch playful behaviour has in the past proven promising for representation learning purposes” (which seems another way of saying robots can learn from goofy stuff humans do!).

With the diverse home dataset, the NYU/Meta researchers trained a foundational visual imitation model that can be easily modified (fine-tuned) and deployed in homes. On the robot being deployed in each home, the model would be fine-tuned to this novel environment and novel tasks by the householder using the stick to collect five minute videos of them performing 24 or so tasks around their house. The model is then fine-tuned 50 times over (each called an epoch) on this data, taking about another 20 minutes.

Once fine-tuned, the robot could perform a broad range of tasks in its new home environment, with an 81% overall success rate, on a first go basis.

However, performance was also variable across tasks and in different settings:

  • The robots score 100% success on flipping cushions and opening and closing an air fryer (what more could you want in life).
  • Tasks which require ‘wrist turning’ (yawl, pitch or roll in robot movements) had a much lower success rate (on average 56%), although the NYU/Meta researchers thought this could be solved with more data.
  • Robots were ‘bamboozled’ if the location from which they were moving an object looked much like the location to which they were instructed to move the object (e.g. between different shelves of an IKEA bookcase) because, not having a temporal sense, they quite literally do not know whether they are coming or going when looking at similar looking places.
  • Shadows also could throw the robot: “the primary example of this is from Home 1 Air Fryer Opening, where the strong shadow of the robot arm caused our policy to fail. Once we turned on an overhead light for even lighting, there were no more failures”.
  • Like vampires, robots don’t like mirrors: 

“A… problem with reflective surfaces like mirrors is that we collect demonstrations using the Stick but run the trained policies on the robot. In front of a mirror, the demonstration may actually end up recording the demo collector in the mirror. Then, once the policy is executed on the robot, the reflection on the mirror captures the robot instead of the demonstrator, and so the policy goes out-of-distribution and fails.”

  • Like a goldfish swimming around a bowl, robots also have a memory problem:

“With a single first person point of view on the Stick, the robot needs to either see or remember large parts of the scene to operate on it effectively. However, there is a dearth of algorithms that can act as standalone memory module for robots.”

Will robots be comfortable in their own skin?

Finally, researchers in Japan have developed living skin tissue which can be bonded to the metallic frame of a robot. Not only could this make domestic robots more humanoid in appearance, but, as with us humans, the robot’s skin could also allow it to display reactions (the ultimate smiley face), sense the surrounding environment and improve its dexterity.

So what about my laundry?

Both sets of researchers characterise their developments as early steps to a reliable domestic robot. But sadly, MIT Technology Review says that doing your laundry is probably one of the last tasks domestic robots will conquer: 

“And now for the bad news: even the most cutting-edge robots still cannot do laundry. It’s a chore that is significantly harder for robots than for humans. Crumpled clothes form weird shapes which makes it hard for robots to process and handle.”

More information: On Bringing Robots Home

""