full transcript

From the Ted Talk by Joseph Redmon: How computers learn to recognize objects instantly


Unscramble the Blue Letters


If we speed this up by another factor of 10, this is a detector rinnnug at five frames per second. This is a lot better, but for example, if there's any significant mmovenet, I wouldn't want a stysem like this driving my car.

This is our detection system running in real time on my laptop. So it smoothly tracks me as I move around the frame, and it's robust to a wide vretiay of changes in size, pose, forward, backward. This is great. This is what we really need if we're going to build systems on top of computer vision.

(Applause)

So in just a few years, we've gone from 20 seconds per image to 20 meicllosdnis per igame, a thousand times faster. How did we get there? Well, in the past, object detection ssymtes would take an image like this and split it into a bunch of regions and then run a cflsiiesar on each of these rengios, and high scores for that classifier would be considered dencieotts in the image. But this involved running a classifier thousands of tmeis over an image, tdoasuhns of neural network evaluations to produce detection. Instead, we tiarned a slgnie network to do all of detection for us. It produces all of the bounding boxes and class probabilities simultaneously. With our system, instead of looking at an image thousands of times to produce detection, you only look once, and that's why we call it the YOLO method of object detection. So with this speed, we're not just limited to images; we can process video in real time. And now, instead of just seeing that cat and dog, we can see them move around and interact with each other.

Open Cloze


If we speed this up by another factor of 10, this is a detector _______ at five frames per second. This is a lot better, but for example, if there's any significant ________, I wouldn't want a ______ like this driving my car.

This is our detection system running in real time on my laptop. So it smoothly tracks me as I move around the frame, and it's robust to a wide _______ of changes in size, pose, forward, backward. This is great. This is what we really need if we're going to build systems on top of computer vision.

(Applause)

So in just a few years, we've gone from 20 seconds per image to 20 ____________ per _____, a thousand times faster. How did we get there? Well, in the past, object detection _______ would take an image like this and split it into a bunch of regions and then run a __________ on each of these _______, and high scores for that classifier would be considered __________ in the image. But this involved running a classifier thousands of _____ over an image, _________ of neural network evaluations to produce detection. Instead, we _______ a ______ network to do all of detection for us. It produces all of the bounding boxes and class probabilities simultaneously. With our system, instead of looking at an image thousands of times to produce detection, you only look once, and that's why we call it the YOLO method of object detection. So with this speed, we're not just limited to images; we can process video in real time. And now, instead of just seeing that cat and dog, we can see them move around and interact with each other.

Solution


  1. system
  2. milliseconds
  3. thousands
  4. systems
  5. regions
  6. trained
  7. image
  8. times
  9. classifier
  10. movement
  11. running
  12. detections
  13. variety
  14. single

Original Text


If we speed this up by another factor of 10, this is a detector running at five frames per second. This is a lot better, but for example, if there's any significant movement, I wouldn't want a system like this driving my car.

This is our detection system running in real time on my laptop. So it smoothly tracks me as I move around the frame, and it's robust to a wide variety of changes in size, pose, forward, backward. This is great. This is what we really need if we're going to build systems on top of computer vision.

(Applause)

So in just a few years, we've gone from 20 seconds per image to 20 milliseconds per image, a thousand times faster. How did we get there? Well, in the past, object detection systems would take an image like this and split it into a bunch of regions and then run a classifier on each of these regions, and high scores for that classifier would be considered detections in the image. But this involved running a classifier thousands of times over an image, thousands of neural network evaluations to produce detection. Instead, we trained a single network to do all of detection for us. It produces all of the bounding boxes and class probabilities simultaneously. With our system, instead of looking at an image thousands of times to produce detection, you only look once, and that's why we call it the YOLO method of object detection. So with this speed, we're not just limited to images; we can process video in real time. And now, instead of just seeing that cat and dog, we can see them move around and interact with each other.

Frequently Occurring Word Combinations


ngrams of length 2

collocation frequency
computer vision 5
object detection 4
real time 3
neural network 2
bounding boxes 2
times faster 2
detection system 2
stop signs 2



Important Words


  1. applause
  2. bounding
  3. boxes
  4. build
  5. bunch
  6. call
  7. car
  8. cat
  9. class
  10. classifier
  11. computer
  12. considered
  13. detection
  14. detections
  15. detector
  16. dog
  17. driving
  18. evaluations
  19. factor
  20. faster
  21. frame
  22. frames
  23. great
  24. high
  25. image
  26. interact
  27. involved
  28. laptop
  29. limited
  30. lot
  31. method
  32. milliseconds
  33. move
  34. movement
  35. network
  36. neural
  37. object
  38. pose
  39. probabilities
  40. process
  41. produce
  42. produces
  43. real
  44. regions
  45. robust
  46. run
  47. running
  48. scores
  49. seconds
  50. significant
  51. simultaneously
  52. single
  53. size
  54. smoothly
  55. speed
  56. split
  57. system
  58. systems
  59. thousand
  60. thousands
  61. time
  62. times
  63. top
  64. tracks
  65. trained
  66. variety
  67. video
  68. vision
  69. wide
  70. years
  71. yolo