I’ll admit that I’ve tiptoed around the topic a bit in Actuator due to its sudden popularity (SEO gods be damned). I’ve been in this business long enough to instantly be suspicious of hype cycles. That said, I totally get it this time around.
While it’s true that various forms of machine learning and AI touch our lives every day, the emergence of ChatGPT and its ilk present something far more immediately obvious to the average person. Typing a few commands into a dialog box and getting an essay, a painting or a song is a magical experience – particularly for those who haven’t followed the minutiae of this stuff for years or decades.
If you’re reading this, you were probably aware of these notions prior to the last 12 months, but try to put yourself in the shoes of someone who sees a news story, visits a site and then seemingly out of nowhere, a machine is creating art. Your mind would be, in a word, blown. And rightfully so.
Over the past few months, we’ve covered a smattering of generative AI-related stories in Actuator. Take last week’s video of Agility using generative AI to tell Digit what to do with a simple verbal cue. I’ve also begun speaking to founders and researchers about potential applications in the space. The sentiment shifted quite quickly from “this is neat” to “this could be genuinely useful.”
Learning has, of course, been a giant topic in robotics for decades now. It also, fittingly, seems to be where much of the potential generative AI research is heading. What if, say, a robot could predict all potential outcomes based on learning? Or how about eliminating a lot of excess coding by simply telling a robot what you want it to do? Exciting, right?
When I’m fascinated by a topic in the space, I always do the same thing: find much smarter people to berate with questions. It’s gotten me far in life. This time out, our lucky guests are a pair of UC Berkeley professors who have taught me a ton about the space over the years (I recommend getting a few for your rolodex).
Pieter Abbeel is the Director of the Berkeley Robot Learning Lab and co-founder/President/Chief Scientist of Covariant, which uses AI and rich data sets to train picking robots. Along with helping Abbeel run BAIR (Berkeley AI Research Lab), Ken Goldberg is the school’s William S. Floyd Jr. Distinguished Chair in Engineering and the Chief Scientist and co-founder of Ambi Robotics, which uses AI and robotics to tackle package sorting.
Let’s start with the broad question of how you see generative AI fitting into the broader world of robotics.
There are two big trends happening at the same time. There is a trend of foundation models and a trend of generative AI. They are very intertwined, but they’re distinct. A foundation model is a model that is trained on a lot of data, including data that is only tangentially possibly related to what you care about. But it’s still related. But by doing so, it gets better at the things you care about. Essentially, all generative models are foundation models, but there are foundation models that are not generative AI models, because they do something else. The Covariant Brain is a foundation model, the way it’s set up right now. Since day one, back in 2017, we’ve been training all the items that we could possibly run across. But in any given deployment, we only care about, for example, electrical supplies, or we only care about apparel, or we only care about groceries only.
It’s a shift in paradigm. Traditionally, people would have said, ‘Oh, if you’re going to do groceries, groceries, groceries, groceries. That’s all your training, a neural network based on groceries. That’s not what we’ve been doing. It’s all about chasing the long tail of edge cases. The more things you’ve seen, the better you can make sense of an edge case. The reason it works is because the neural networks become so big. If you know networks are small, all this tangentially related stuff is going to perturb your knowledge of the most important stuff. But there’s so much they can keep absorbing. It’s like a massive sponge keeps absorbing things. You’re not hurting anything by putting in this extra stuff. You’re actually helping a little bit more by doing that.
It’s all about learning, right? It’s a big thing everyone is trying to crack right now. The broader foundation model is to train it on as large a dataset as possible.
Yes, but the key is not just large. It’s very diverse. I’m not just doing only groceries. I’m gonna pick groceries, but I’m also training on all the other stuff that I might pick at another warehouse into the same foundation model to have a general understanding of all objects, which is a better way to learn not only about groceries. You never know what will pop up in the mix of those groceries. There’s always going to be a new item. You never have everything covered. So, you need to generalize to new items. Your chance of generalizing to new items well, the probability is higher if you’ve covered a very wide spectrum of other things.
The larger the neural network, the more it understands the world, broadly speaking.
Yeah. That really is the key. That is what’s going to unlock AI-powered robotics applications, whether it’s picking or self-driving and so forth – it’s the ability to absorb so much. But if we switch gears and think about generative AI specifically, there are thing you can imagine it playing a role in. If you think of generative, what does it mean compared to previous generations of AI? At its core, it means it’s generative data. But how is that different from generating labels? If I give it an image and it says “cat,” that’s also generating data. It’s just that it’s able to generate more data. Again, that relates to the neural network. The neural networks are larger, which allows them not only to analyze larger things, but to generate larger things in a consistent way.
In robotics, there are a few angles. One is building a deeper understanding of the world. Instead of me saying, “I’m going to label data to teach the neural network,” I can say, “I’m going to record a video of what happens,” and my generative model needs to predict the next frame, the next frame, the next frame. And by forcing it to understand how to predict the future, I’m forcing it to understand how the world works.
Oftentimes when I talk to people about the different forms of learning, it’s almost discussed as though they’re in conflict with each other, but in this case, it’s two different kinds of learning effectively working in tandem.
Yes. And again, because the networks are so large, we train neural networks to predict future frames. By doing that, in addition to training them to output the optimal actions for a certain task, they actually learn to output the actions much quicker, from far less data. You’re giving it two tasks and it learns how to do the one task, because the two tasks are related. Predicting the next frame is such a difficult thinking exercise, you force it to think through so much more to predict actions that it predicts actions much, much faster.
In terms of a practical real-world application – say, in an industrial setting, learning to screw something in – how does learning to predict the next thing inform its action?
This is a work in progress. But the idea is that there are different ways of teaching a robot. You can program it. You can give it demonstrations. It can learn from reinforcement, where it learns from its own trial and error. Programming has seen its limitations. It’s not really going beyond what we’ve seen for a long time in car factories.
Let’s say I was a self-driving car. If my robot can predict the future at all times, it can do two things. The first is having a deep understanding of the world and with a little extra learning, pick the right action. In addition, it has another option. If it wants to do a lot of work in the moment, it can simulate scenarios. It can also simulate the traffic around it. That’s where this is headed.
These are all of the possible outcomes I can see. This is the best outcome, I’m going to do that.
Correct. There are other things we can do in generative AI with robotics. Google has had some results, where what they said is, what if we bolt some things together. One of the big challenges with robotics has been high-level reasoning. There are two challenges: 1. how do you do the actual motor skill and 2. what should you actually do. If someone asked you, ‘make me scrambled eggs,’ what does that even mean? And that’s where generative AI models come in handy in a different way. They’re pre-trained. The simplest version just uses language. Making scrambled eggs, you can break that down into:
- Go get the eggs from the fridge
- Get the frying pan
- Get the butter.
The robot can go to the fridge. It might ask what to do with the fridge, and then the model says:
- Go to the fridge
- Take the thing out of the fridge
The whole thing in robotics has traditionally been logic or task planning, and people who have to program it in somehow have to describe the world in terms of logical statements that somehow come after each other, and so forth. The language models kind of seem to take care of it in a beautiful way. That’s unexpected to many people.
How do you see generative AI’s potential in robotics?
The core concept here is the transformer. The transformer network is very interesting, because it looks at sequences. It’s able to essentially get very good at predicting the next item. It’s astoundingly good at that. It works for words, because we only have a relatively small number of words in the English language. At best, the Oxford English Dictionary I think has about half a million. But you can get by with far fewer than that. And you have plenty of examples, because every string of text gives you an example of words and how they’re put together. It’s a beautiful sweet spot. You can show it a lot of examples, and you have relatively few choices to make at every step. It turned out it can predict that extremely well.
The same is true to sequences of sounds, so this can also be used for audio processing and prediction. You can train it very similar. Instead of words, you have sequences of phonemes coming in. You give it a lot of strings of music or voice, and then it will be able to predict the next sound signal or phoneme. It can also be used for images. You have a string of images and it can be used to predict the next image.
Pieter was talking about using video to predict what’s going to happen in the next frame. It was effectively like having the robot think in video.
Kind of, yeah. If you can now predict the next video, the next thing you can add in there is the control. If I add my control in there, I can predict what comes out if I do action A or B. I can look at all of my actions and pick the action that gets me closer to what I want to see. Now I want to get it to the next level, which is where I look at the next scene and have voxels. I have these three-dimensional volumes. I want to train it on those and say, “here’s my current volume, and here’s the volume I want to have. What actions do I need to perform to get me there?”
When you’re talking about volumes, you mean where the robot exists in space?
Yeah, or even what’s happening in front of you. If you want to clean up the dishes in front of you, the volume is where all those dishes are. Then you say, “what I want is a clear table with none of those dishes on it.” That’s the volume I want to get to, so now I have to find the sequence of actions that will get from the initial state, which is what I’m looking at now, to the final state, which is where I have no dishes anymore.
Based on videos the robot has been trained on, it can extrapolate what to do.
In principle, but not from videos of people. That’s problematic. Those videos are shot from an odd angle. You don’t know what motions they’re doing. It’s hard for the robot to know that. What you do is essentially have the robot self-learn by having the camera. The robot tries things out and learns over time.
A lot of the applications I’m hearing about are based around language commands. You say something, the robot is able to determine what you mean and execute it in real-time.
That’s a different thing. We now have a tool that can handle language very well. And what’s cool about it is that it gives you access to the semantics of a scene. A very well known paper from Google did the following: You have a robot and you say “I just spilled something, and I need help to clean it up.” Typically the robot wouldn’t know what to do with that, but now you have language. You run that into ChatGPT and it generates: “get a sponge. Get a napkin. Get a cloth. Look for the spilled can, make sure it can pick that up.” All of that stuff can come out. What they do is exactly that: They take all of that output and they say, “is there a sponge around? Let me look for a sponge.”
The connection between semantic of the world – a spill and a sponge – ChatGPT is very good at that. That fills a gap that we’ve always had. It’s called the Open World Problem. Before that, we had to program in every single thing it was going to encounter. Now we have another source that can make these connections that we couldn’t make before. That’s very cool. We have a project on that’s called Language Embedded Radiance Field. It’s brand new. It’s how to use that language to figure out where to pick things up. We say, “here’s a cup. Pick it up by the handle,” and it seems to be able to identify where the handle is. It’s really interesting.
You’re obviously very smart people and you know a lot about generative AI, so I’m curious where the surprise comes in.
We always get surprised when these systems do things that we didn’t anticipate. That’s when robotics is at its best, when you give it a setup, and it suddenly does something.
It does the right thing for once.
Exactly! That’s always a surprise in robotics!
One more bit of generative AI before we move on for the week. Researchers at Switzerland’s EPFL University are highlighting robots making robots. I’m immediately reminded of RepRap, which gave rise to the desktop 3D printing space. Launched in 2005, the project started with the mission of creating “humanity’s first general-purpose, self-replicating manufacturing machine.” Effectively, the goal was to create a 3D printer that could 3D-print itself.
For this project, the researchers used ChatGPT to generate the design for a product-picking robot. The team suggests language models “could change the way we design robots, while enriching and simplifying the process.”
Computational Robot Design & Fabrication Lab head Josie Hughes adds, “Even though Chat-GPT is a language model and its code generation is text-based, it provided significant insights and intuition for physical design, and showed great potential as a sounding board to stimulate human creativity.”
Some light lanternfly extermination
An interesting pair of research with some common DNA also crossed my desk this week. Anyone who’s seen a spotted lanternfly in person knows how beautiful they can be. The China native insect flutters around on wings that flash sharp swaths of red and blue. Anyone who’s seen a spotted lanternfly on the Eastern U.S. seaboard, however, knows that they’re invasive species. Here in New York, there’s a statewide imperative to destroy the buggers on command.
The CMU Robotics Institute designed TartanPest as part of Farm-ng’s Farm Robotics Challenge. The system features a robotic arm mounted atop a Farm-ng tractor, which is designed to spot and spray masses of lanternfly eggs – destroying the bugs before they hatch. The robot, “uses a deep learning model refined on an augmented image data set created from 700 images of spotted lanternfly egg masses from iNaturalist to identify them and scrape them off surfaces.
For the record, nowhere in Asimov’s robotics laws are lanternflies mentioned.
Meanwhile, ABB this week showcased what it calls “the world’s most remote robot.” A product of a collaboration with the nonprofit group JungleKeepers, the system effectively uses an ABB arm to automate seed collection, planting and watering in a bid to promote reforestation.
There’s a big open question around efficacy and scalability, and this is certainly a nice PR play from the automation giant, but if this thing can make even small progress amid rapid deforestation, I’m all for it.
Sweaters for robots
One more CMU project that I missed a couple of weeks back. RobotSweater is not a robotic sweater, but rather a robot in a sweater (SweaterRobot might have been more apt). Regardless, the system uses knitted textiles as touch-sensitive skin. Per the school:
Once knitted, the fabric can be used to help the robot “feel” when a human touches it, particularly in an industrial setting where safety is paramount. Current solutions for detecting human-robot interaction in industry look like shields and use very rigid materials that Liu notes can’t cover the robot’s entire body because some parts need to deform.
Once attached to the robot (in this case, an industrial arm), the e-textile can sense the force, direction and distribution of touch, sensitivities that could help these systems more safely work alongside people.
“In their research, the team demonstrated that pushing on a companion robot outfitted in RobotSweater told it which way to move or what direction to turn its head,” CMU says. “When used on a robot arm, RobotSweater allowed a push from a person’s hand to guide the arm’s movement, while grabbing the arm told it to open or close its gripper.”
Capping off an extremely research heavy edition of Actuator – and returning once again to Switzerland’s EPFL – is Mori3. The little robot is made up of a pair of triangles that can form into different shapes.
“Our aim with Mori3 is to create a modular, origami-like robot that can be assembled and disassembled at will depending on the environment and task at hand,” Reconfigurable Robotics Lab director, Jamie Paik. “Mori3 can change its size, shape and function.”
The system recalls a lot of fascinating work concurrently happening in the oft-intersecting fields of modular and origami robotics. The systems communicate with one another and attach to form complex shapes. The team is targeting space travel as a primary application for this emerging technology. Their small, flat design make them much easier to pack on a shuttle than a preassembled bot. And let’s be honest, no one wants to spend a ton of time piecing together robots like Ikea furniture after blasting off.
“Polygonal and polymorphic robots that connect to one another to create articulated structures can be used effectively for a variety of applications,” says Paik. “Of course, a general-purpose robot like Mori3 will be less effective than specialized robots in certain areas. That said, Mori3’s biggest selling point is its versatility.”
3…2…1…we have Actuator.