full transcript
From the Ted Talk by Joseph Redmon: How computers learn to recognize objects instantly
Unscramble the Blue Letters
If we speed this up by another factor of 10, this is a detector rinnnug at five frames per second. This is a lot better, but for example, if there's any significant mmovenet, I wouldn't want a stysem like this driving my car.
This is our detection system running in real time on my laptop. So it smoothly tracks me as I move around the frame, and it's robust to a wide vretiay of changes in size, pose, forward, backward. This is great. This is what we really need if we're going to build systems on top of computer vision.
(Applause)
So in just a few years, we've gone from 20 seconds per image to 20 meicllosdnis per igame, a thousand times faster. How did we get there? Well, in the past, object detection ssymtes would take an image like this and split it into a bunch of regions and then run a cflsiiesar on each of these rengios, and high scores for that classifier would be considered dencieotts in the image. But this involved running a classifier thousands of tmeis over an image, tdoasuhns of neural network evaluations to produce detection. Instead, we tiarned a slgnie network to do all of detection for us. It produces all of the bounding boxes and class probabilities simultaneously. With our system, instead of looking at an image thousands of times to produce detection, you only look once, and that's why we call it the YOLO method of object detection. So with this speed, we're not just limited to images; we can process video in real time. And now, instead of just seeing that cat and dog, we can see them move around and interact with each other.
Open Cloze
If we speed this up by another factor of 10, this is a detector _______ at five frames per second. This is a lot better, but for example, if there's any significant ________, I wouldn't want a ______ like this driving my car.
This is our detection system running in real time on my laptop. So it smoothly tracks me as I move around the frame, and it's robust to a wide _______ of changes in size, pose, forward, backward. This is great. This is what we really need if we're going to build systems on top of computer vision.
(Applause)
So in just a few years, we've gone from 20 seconds per image to 20 ____________ per _____, a thousand times faster. How did we get there? Well, in the past, object detection _______ would take an image like this and split it into a bunch of regions and then run a __________ on each of these _______, and high scores for that classifier would be considered __________ in the image. But this involved running a classifier thousands of _____ over an image, _________ of neural network evaluations to produce detection. Instead, we _______ a ______ network to do all of detection for us. It produces all of the bounding boxes and class probabilities simultaneously. With our system, instead of looking at an image thousands of times to produce detection, you only look once, and that's why we call it the YOLO method of object detection. So with this speed, we're not just limited to images; we can process video in real time. And now, instead of just seeing that cat and dog, we can see them move around and interact with each other.
Solution
- system
- milliseconds
- thousands
- systems
- regions
- trained
- image
- times
- classifier
- movement
- running
- detections
- variety
- single
Original Text
If we speed this up by another factor of 10, this is a detector running at five frames per second. This is a lot better, but for example, if there's any significant movement, I wouldn't want a system like this driving my car.
This is our detection system running in real time on my laptop. So it smoothly tracks me as I move around the frame, and it's robust to a wide variety of changes in size, pose, forward, backward. This is great. This is what we really need if we're going to build systems on top of computer vision.
(Applause)
So in just a few years, we've gone from 20 seconds per image to 20 milliseconds per image, a thousand times faster. How did we get there? Well, in the past, object detection systems would take an image like this and split it into a bunch of regions and then run a classifier on each of these regions, and high scores for that classifier would be considered detections in the image. But this involved running a classifier thousands of times over an image, thousands of neural network evaluations to produce detection. Instead, we trained a single network to do all of detection for us. It produces all of the bounding boxes and class probabilities simultaneously. With our system, instead of looking at an image thousands of times to produce detection, you only look once, and that's why we call it the YOLO method of object detection. So with this speed, we're not just limited to images; we can process video in real time. And now, instead of just seeing that cat and dog, we can see them move around and interact with each other.
Frequently Occurring Word Combinations
ngrams of length 2
collocation |
frequency |
computer vision |
5 |
object detection |
4 |
real time |
3 |
neural network |
2 |
bounding boxes |
2 |
times faster |
2 |
detection system |
2 |
stop signs |
2 |
Important Words
- applause
- bounding
- boxes
- build
- bunch
- call
- car
- cat
- class
- classifier
- computer
- considered
- detection
- detections
- detector
- dog
- driving
- evaluations
- factor
- faster
- frame
- frames
- great
- high
- image
- interact
- involved
- laptop
- limited
- lot
- method
- milliseconds
- move
- movement
- network
- neural
- object
- pose
- probabilities
- process
- produce
- produces
- real
- regions
- robust
- run
- running
- scores
- seconds
- significant
- simultaneously
- single
- size
- smoothly
- speed
- split
- system
- systems
- thousand
- thousands
- time
- times
- top
- tracks
- trained
- variety
- video
- vision
- wide
- years
- yolo