Dr Jamie Shotton had joined the Machine Learning & Perception group at Microsoft Research Cambridge (MSRC) in June 2008 as a post-doc for a few months when he was roped in by the Xbox product group to help launch the product by Christmas 2010.
He shared the experience with 4th year undergraduate Engineering students at the University of Cambridge Engineering Department earlier this year.
The body was divided into 31 different body parts to be recognised and reconstituted into a human pose.
I was browsing through the university’s newsletter last week when I came upon this interesting story about some of the developmental challenges of the Microsoft Kinect for Xbox 360 and how they were surmounted. You can read the full original article here. Images used in this posting are from the original article.
The Kinect for Xbox 360 is a motion sensing input device for the Xbox 360 game console. Based around a webcam-style add-on accessory for the Xbox 360 console, it allows users to control and interact with the Xbox 360 without the need to touch or hold a game controller such as a joystick – depending instead on bodily gestures and spoken commands.
Dr Jamie Shotton from the Cambridge research laboratory in the UK.
Shotton now works for Microsoft at their Cambridge research laboratory in the UK. He had completed his PhD research in computer vision from 2003 to 2007. His initial research at the MSRC was on automatic visual object recognition – teaching computers how to recognise different types of objects in photographs such as cars, sheep and trees.
“Little did I know at that point how quickly I would get pulled into the frenzy of research and development around Kinect, and how this blue-skies research could be applied to such a practical problem,” Shotton recalled.
Enabling tools
At the point that Shotton was invited, Microsoft had already developed a few enabling tools.
Depth-sensing camera. The new Kinect camera worked at 320×240 pixels and 30 frames per second versus other depth cameras at very low resolutions of 10×10 pixels. “You could even make out the nose and eyes on your face,” “Shotton observed. The better depth accuracy helped with human pose estimation by eliminating objects in the background since they were further away. The colour and texture of clothing, skin and hair could also be normalised away. The depth camera was “active”, illuminating the subject with its own structured dot pattern of infra-red light so that the camera worked even in the dark.
Prototype human tracking algorithm. The algorithm constantly compares its predictions of the body’s movements with the actual movements and then makes adjustments to improve the accuracy of its predictions.
Showstoppers
The tracking algorithm suffered from three limitations. First, the subject had to stand in a T-pose for the algorithm to lock it in initially. Second, if the subject moved too erratically and therefore unpredictably, the algorithm would lose track and would not be able to recover until the subject returned to the T-pose for recalibration. This could happen as often as every 5-10 seconds. Finally, the algorithm only worked with the limited number of body sizes and shapes that it had been trained with. Shotton’s mission was to overcome these showstoppers.
Overcoming the limitations
To allow the algorithm to recognise a subject and its posture without having to start from a T-pose, Shotton leveraged a fellow researcher’s (Dr Stenger) technique called “chamfer matching”: the subject’s image was compared with a training database of body images and once the closest match was selected, the 3D data for that match could then be utilised as the human pose for the subject.
However, there was an astronomical number of human poses based on the different combinations of position and orientation of body parts such as the arms, legs, knees and ankles. Shotton divided up the body into 31 parts so that each of the parts could be matched independently before building up the skeleton and body pose from the position of these parts. This was where Shotton’s PhD work on object recognition came in handy.
Although this substantially reduced the size of the image database needed to train the algorithm, the training database was still huge. The team had recorded hours of footage at a motion capture studio with several actors doing “gaming” moves such as dancing, running, fighting and driving.
The millions of training images would have taken months to train the algorithm. The team got help from colleagues at Microsoft Research in Silicon Valley who had developed an engine called “Dryad” for efficient and reliable distributed computation. Using a cluster of 100 powerful computers, the training time was reduced to less than a day.
Read the details of Shotton’s experience in the full original article here.