LATEST VERSION 2.7 (16.03.2015)



Human body posture detection is an important problem that was recently tackled using various approaches. The most common ones are based either on depth map generation or on the human body parts classification based on a camera image. Dynamic boost of the entertainment technologies related with body posture recognition resulted in the availability of cheap and reliable 3D sensors such as: Microsoft Kinect [], Asus Xtion Pro Live [ Sensor/Xtion_ PRO_LIVE/], SoftKinetic DepthSense 311 []. Most of these solutions are based on the structured light method while some others use an alternative technique base on time-of-flight. Currently, Kinect is the most often used sensor in interactive robotic research projects. 

The first generation of Kinect is a low cost device for 3D measurement that provides both 2D color image and a depth map. RGB camera provides VGA (640x480px) resolution while depth sensor's resolution is limited to 300x200px. However, depth image is interpolated inside this device to VGA size. The depth sensor provides accurate data on objects that are 0,4 – 6,5m away. Moreover, Kinect is equipped with a 4-microphone array, 3-axis accelerometer and a motor to control the tilt angle of the sensor head. Communication with the sensor is based on a popular USB interface. The manufacturer of the depth mapping technology used in Kinect is PrimeSense []). The Kinect for Windows software development kit (SDK) enables developers to use C++, C#, or Visual Basic to create applications that support gesture and voice recognition by using the Kinect for Windows sensor and a computer or an embedded device.


The Kinect for Windows SDK and toolkit contain drivers, tools, APIs, device interfaces, and code samples to simplify development of applications for commercial deployment. The Kinect for Windows SDK has a deep understanding of human characteristics, including skeletal and facial tracking, and gesture recognition. Voice recognition adds an additional dimension of human comprehension, and newly released Kinect Fusion reconstructs data into three dimensional (3-D) models. The latest update to the Kinect for Windows SDK adds new Kinect Interactions that comprehend natural gestures such as “grip” and “push,” etc. Kinect for Windows device enables the sensor’s camera to see objects as close as 40 centimeters in front of the device without losing accuracy or precision, with graceful degradation out to 3 meters.


More information about Kinect SDK can be found here:

1. Kinect SDK overview

2. Kinect SDK documentation

3. Abhijit Jana, Kinect for Windows SDK Programming Guide, Packt Publishing (December 26, 2012), ISBN-10: 1849692386 link[.pdf]

4. David Catuhe, Programming with the Kinect for Windows Software Development Kit, Microsoft Press (October 3, 2012), ISBN-10: 0735666814 link[.pdf]

See UKinect in action


Skeletal Tracking allows Kinect to recognize people and follow their actions. Using the infrared (IR) camera, Kinect can recognize up to six users in the field of view of the sensor. Of these, up to two users can be tracked in detail. An application can locate the joints of the tracked users in space and track their movements over time.

Figure 1. Kinect can recognize six people and track two


Skeletal Tracking is optimized to recognize users standing or sitting, and facing the Kinect; sideways poses provide some challenges regarding the part of the user that is not visible to the sensor. To be recognized, users simply need to be in front of the sensor, making sure the sensor can see their head and upper body; no specific pose or calibration action needs to be taken for a user to be tracked.

Figure 2. Skeleton tracking is designed to recognize users facing the sensor



Some versions of the sensor can also be used with seated mode. The seated tracking mode is designed to track people who are seated on a chair or couch, or whose lower body is not entirely visible to the sensor. The default tracking mode, in contrast, is optimized to recognize and track people who are standing and fully visible to the sensor.



Bone Hierarchy

There is a defined hierarchy of bones based on the joints defined by the skeletal tracking system. The hierarchy has the Hip Center joint as the root and extends to the feet, head, and hands:

Figure 1. Joint Hierarchy


Bones are specified by the parent and child joints that enclose the bone. For example, the Hip Left bone is enclosed by the Hip Center joint (parent) and the Hip Left joint (child).


Bone hierarchy refers to the ordering of the bones defined by the surrounding joints; bones are not explicitly defined as structures in the APIs. Bone rotation is stored in a bone’s child joint. For example, the rotation of the left hip bone is stored in the Hip Left joint.

Absolute User Orientation

In the hierarchical definition, the rotation of the Hip Center joint provides the absolute orientation of the user in camera space coordinates. This assumes that the user object space has the origin at the Hip Center joint, the y-axis is upright, the x-axis is to the left, and the z-axis faces the camera.

Figure 3. Absolute user orientation is rooted at the Hip Center joint


To calculate the absolute orientation of each bone, multiply the rotation matrix of the bone by the rotation matrices of the parents (up to the root joint).

Face Tracking 

This section provides details on the output of the Face Tracking engine. This output contains the following information about a tracked user:

  • Tracking status
  • 2D points
  • 3D head pose
  • AUs

2D Mesh and Points

The Face Tracking SDK tracks the 87 2D points indicated in the following image (in addition to 13 points that aren’t shown in Figure 2 - Tracked Points):

Figure 2. Tracked Points


These points are returned in an array, and are defined in the coordinate space of the RGB image (in 640 x 480 resolution) returned from the Kinect sensor. The additional 13 points (which are not shown in the figure) include:

  • The center of the eye, the corners of the mouth, and the center of the nose
  • A bounding box around the head

3D Head Pose

The X,Y, and Z position of the user’s head are reported based on a right-handed coordinate system (with the origin at the sensor, Z pointed towards the user and Y pointed up – this is the same as the Kinect’s skeleton coordinate frame). Translations are in meters. The user’s head pose is captured by three angles: pitch, roll, and yaw.

Figure 3. Head Pose Angles


The angles are expressed in degrees, with values ranging from -180 degrees to +180 degrees.


Pitch angle


-90 = looking down towards the floor

+90 = looking up towards the ceiling

Face Tracking tracks when the user’s head pitch is less than 20 degrees, but works best when less than 10 degrees.

Roll angle

0 = neutral

-90 = horizontal parallel with right shoulder of subject

+90 = horizontal parallel with left shoulder of the subject

Face Tracking tracks when the user’s head roll is less than 90 degrees, but works best when less than 45 degrees.

Yaw angle

0 = neutral

-90 = turned towards the right shoulder of the subject

+90 = turned towards the left shoulder of the subject

Face Tracking tracks when the user’s head yaw is less than 45 degrees, but works best when less than 30 degrees

Animation Units

The Face Tracking SDK results are also expressed in terms of weights of six AUs and 11 SUs, which are a subset of what is defined in the Candide3 model ( The SUs estimate the particular shape of the user’s head: the neutral position of their mouth, brows, eyes, and so on. The AUs are deltas from the neutral shape that you can use to morph targets on animated avatar models so that the avatar acts as the tracked user does. The Face Tracking SDK tracks the following AUs. Each AU is expressed as a numeric weight varying between -1 and +1.

AU Name and ValueAvatar IllustrationAU Value Interpretation

Neutral Face

(all AUs 0)


AU0 – Upper Lip Raiser

(In Candid3 this is AU10)


0 = neutral, covering teeth

1 = showing teeth fully

-1 = maximal possible pushed down lip

AU1 – Jaw Lowerer

(In Candid3 this is AU26/27)


0 = closed

1 = fully open

-1 = closed, like 0

AU2 – Lip Stretcher

(In Candid3 this is AU20)


0 = neutral

1 = fully stretched (joker’s smile)

-0.5 = rounded (pout)

-1 = fully rounded (kissing mouth)

AU3 – Brow Lowerer

(In Candid3 this is AU4)


0 = neutral

-1 = raised almost all the way

1 = fully lowered (to the limit of the eyes)

AU4 – Lip Corner Depressor

(In Candid3 this is AU13/15)



-1 = very happy smile

1 = very sad frown

AU5 – Outer Brow Raiser

(In Candid3 this is AU2)


0 = neutral

-1 = fully lowered as a very sad face

1 = raised as in an expression of deep surprise

Shape Units

The Face Tracking SDK tracks the following 11 SUs They are discussed here because of their logical relation to the Candide-3 model. Each SU specifies the vertices it affects and the displacement (x, y, z) per affected vertex.

SU NameSU number in Candide-3
Head height 0
Eyebrows vertical position 1
Eyes vertical position 2
Eyes, width 3
Eyes, height 4
Eye separation distance 5
Nose vertical position 8
Mouth vertical position 10
Mouth width 11
Eyes vertical difference n/a
Chin width n/a

In addition to the Candide-3 as described at, face tracking supports the following:

  • Eyes vertical difference
  • Chin width

Face tracking does not support the following:

  • Cheeks z (6)
  • Nose z-extension (7)
  • Nose pointing up (9)


KinectInteraction is a term referring to the set of features that allow Kinect-enabled applications to incorporate gesture-based interactivity. KinectInteraction provides the following high-level features: Identification of up to 2 users and identification and tracking of their primary interaction hand. Detection services for user's hand location and state. Grip and grip release detection. Press detection. Information on the control targeted by the user.


The Kinect sensor includes a four-element, linear microphone array, shown here in purple.


The microphone array captures audio data at a 24-bit resolution, which allows accuracy across a wide dynamic range of voice data, from normal speech at three or more meters to a person yelling. The microphone array makes it possible to determine the direction of an audio source. The sound from a particular audio source arrives at each microphone in the array at a slightly different time. By comparing the audio signals of the four microphones, the Kinect SDK can provide your application with information about the sound source angle. Beamforming allows you to use the Kinect microphone array as if it were a directional microphone. The beamforming functionality supports 11 fixed beams, which range from −50 to +50 degrees in 10 degree increments. Applications can use the adaptive beamforming option, which automatically selects the optimal beam, or specify a particular beam.


Speech recognition is one of the key functionalities of the Kinect SDK. The Kinect sensor’s microphone array is an excellent input device for speech recognition-based applications. It provides better sound quality than a comparable single microphone and is much more convenient to use than a headset. Applications can use the Kinect microphone with the Microsoft Speech API, which supports the latest acoustical algorithms (AEC, AES, NS, AGC,...). Kinect for Windows SDK includes a custom acoustical model that is optimized for the Kinect's microphone array.

Context-free grammars:

The CFG format in Microsoft.Speech API defines the structure of grammars and grammar rules using Extensible Markup Language (XML). The application can dynamically update an already loaded Microsoft.Speech API XML grammar. Example XML grammar files can be found here:

Hardware requirements

  • Kinect XBOX 360 or Kinect for Windows (recommended),
  • 32 bit (x86) or 64 bit (x64) processor,
  • Dual-core 2.66GHz or faster processor,
  • Dedicated USB 2.0 bus,
  • 2 GB RAM.

Software requirements

If only running the module:
  • Windows 7, Windows 8, Windows Embedded Standard 7, or Windows Embedded POSReady 7
  • Microsoft Kinect Runtime (tested with 1.8) or full Microsoft Kinect SDK if you are using Kinect XBOX360,
  • If your Windows 7 edition is Windows 7 N or Windows 7 KN, you must install the Media Feature Pack, which is required by the Kinect for Windows runtime.
In order to compile the module, additional libraries are required:
  • Microsoft Kinect SDK (tested with 1.8),
  • Microsoft Kinect Developer Toolkit (tested with 1.8),
  • Microsoft Speech Platform SDK (tested with 11.0),
  • OpenCV, best used with Intel® Threading Building Blocks (Intel® TBB).
UKinect was compiled with all the shared libraries (all included in package). Copy them to the uobjects folder or set path in system environment variable (PATH).
  • FaceTrackData.dll (x86 ver.)
  • FaceTrackLib.dll (x86 ver.)
  • Kinect10.dll (x86 ver.)
  • KinectInteraction180_32.dll
  • opencv_core231.dll
  • opencv_imgproc231.dll
  • tbb.dll

Module functions

Main functions

Attention!!! Some options are not supported for Kinect XBOX360 (color camera settings, 1024x960 resolution, near mode,...)

UKinect.Open(kinect_number, color, depth, skeleton, face, interaction, audio, speech) - open the sensor connection with device number (0,1,2,...) and initialize boolean flags,
face = true               the same as      face = color = depth = true,
interaction = true    the same as      interaction = depth = skeleton = true,
speech = true         the same as      speech = audio = true,
In order to utilize face tracking and interaction stream it is necessary to use a high performance PC. Otherwise, set UKinect.faceTrackingPause function during interaction.
UKinect.Close() - close connection and release device,
UKinect.PollVideo(wait) - poll all initialized video streams,
true - wait for new data from video streams,
false - check for new data, always returns immediately; it is recomended to use sleep (10-20ms) function in the main urbiscript loop.
UKinect.PollAudio(wait) - poll all initialized audio streams,
true - wait for new data from video streams, 
false - check for new data, always returns immediately; it is recomended to use sleep (10-20ms) function in the main urbiscript loop.

Color camera section

UKinect.colorEnabled - true if color camera has been started,
UKinect.colorImage - get camera image,
UKinect.colorResolution - set color camera resolution, default is 2,
2 - 640x480@30fps,
3 - 1024x960@12fps,
UKinect.colorWidth - image width,
UKinect.colorHeight - image height,
UKinect.colorAutoExposure - set auto exposure, default is 1,
0 - OFF,
1 - ON,
UKinect.colorBrightness - set color camera brightness [0..1],
UKinect.colorExposureTime - set camera exposure time (auto exposure must be OFF) [1..4000],
UKinect.colorGain - set camera gain (auto exposure must be OFF) [1..16],
UKinect.colorPowerLineFrequency - set light filter, default is 0,
0 - none,
1 - 50Hz,
2 - 60Hz,
UKinect.colorBacklightCompensationMode - set backlight compensation mode,
0 - average brightness,
1 - center only,
2 - center priority,
4 - low lights priority,
UKinect.colorAutoWhiteBalance - determines if automatic white balance is enabled, default is 1,
0 - OFF,
1 - ON,
UKinect.colorWhiteBalance - set white balance (auto white balance  must be OFF) [2700..6500],
UKinect.colorContrast - set  contrast [0,5...2],
UKinect.colorHue - set hue [-22..22],
UKinect.colorSaturation - set saturation [0..2],
UKinect.colorGamma - set gamma [1..2,8],
UKinect.colorSharpness - set sharpness [0..1],
UKinect.colorResetSettings() - resets the color camera settings to default values.

Depth camera section

UKinect.depthEnabled - true if depth camera has been started,
UKinect.depthVisualization - enable/disable depth visualization (depthImage), default value is true,
UKinect.depthImage - get depth image (if depthVisualization is set to true),
UKinect.depthResolution-  set depth camera resolution, default is 2,
0 - 80x60@30fps,
1 - 320x280@30fps,
2 - 640x480@30fps,
UKinect.depthWidth - image width,
UKinect.depthHeight - image height,
UKinect.depthNearMode - set near mode, default is 0,
0 - OFF,
1 - ON,
UKinect.depthEmitterOff - turn off IR emitter, defaulf is 0,
0 - ON (emiting IR light),
1 - OFF.

Skeleton stream section

UKinect.skeletonEnabled - true if skeleton stream has been started,
UKinect.skeletonVisualization - enable/disable skeleton visualization (skeletonImage), default value is true,
UKinect.skeletonImage - get image with the user's skeleton drawn in (if skeletonVisualization set true),
UKinect.skeletonVisualizationOnColor - draw skeleton on color image or black background, default is 1,
0 - draw skeletron on a black background,
1 - draw skeleton on a copy of color image,
UKinect.skeletonTrackingMode - set tracking mode, default is 0,
0 - full body,
1 - upper body (seat mode),
UKinect.skeletonChooserMode - set skeleton chooser mode, default is 0,
0 - default (new skeleton gives new tracking candidate),
1 - track the closest skeleton,
2 - track two closest skeletons,
3 - track one skeleton and keep it,
4 - track two skeletons and keep them,
5 - track the most active skeleton,
6 - track two most active skeletons,
UKinect.skeletonFilter - changes parameters for smoothing skeleton data using a mathematical transform, default is 1,
0 - smoothing filter is off,
1 - some smoothing with little latency, only filters out small jitters, good for gesture recognition,
2 - smoothed with some latency, filters out medium jitters, good for a menu system that needs to be smooth but doesn't need the reduced latency as much as gesture recognition does.
3 - very smooth, but with a lot of latency, filters out large jitters, good for situations where smooth data is absolutely required and latency is not an issue.
[ID0, ID1,...,ID5] = UKinect.skeletonIDs - get a list of tracked IDs (max 6),
[ID0, ID1] = UKinect.skeletonTrackedIDs - get a list of tracked skeletons IDs (max 2),
[x, y, z] = UKinect.skeletonPosition(ID) - get skeleton absolute position for the given ID,
[x, y, depth] = UKinect.skeletonPositionOnImage(ID) - get skeleton center position for the given ID on color image,
[x, y, z] = UKinect.skeletonJointPosition(ID, joint) - get joint position ,
[x, y, depth] = UKinect.skeletonJointPositionOnImage(ID, joint) - get joint position on color image.

Face tracker section

UKinect.faceEnabled - true if face stream has been started,
UKinect.faceVisualization - enable/disable face visualization (faceImage), default value is true,
UKinect.faceImage - get image with tracked face (if faceVisualization is set to true),
UKinect.faceTrackingPause - pause face tracking,
UKinect.faceVisualizationOnColor - draw face visualization on color image or black background, default is 1,
0 - draw face on a black background,
1 - draw face on a copy of color image,
UKinect.faceVisualizeMode - set visualization mode,
0 - Candide3 model,
1 -  face characteristic conture,
UKinect.faceIsTracking - get tracking flag,
0 - not tracked,
1 -  tracked,
[x, y ,z, pitch, yaw, roll] = UKinect.facePosition() - get face position (trans. and rot.),
[x, y, width, height] = UKinect.facePositionOnImage() - get position on color image in pixels,
[[x0, y0],[ x1, y1],...,[x86, y86]] = UKinect.facePointsOnImage() - get the coordinates of the 87 tracked points (see above),
[AU0, AU1,..., AU5] = UKinect.faceAU() - get six animation units (see above),
[SU0, SU1,..., SU10] = UKinect.faceSU() - get 11 shape units (see above),
UKinect.faceEmotion = get one of the five basic emotions (neutral, joy, angry, suprise, sad, fear) - available after faceAU() has been called.

Interaction stream section

UKinect.interEnabled - true if interaction stream has been started,
UKinect.interVisualization - enable/disable interaction visualization (interImage), default value is true,
UKinect.interImage - get image with interaction visualisation (if faceVisualization is set to true),
UKinect.interVisualizationOnColor - draw interaction visualization on color image or black background, default is 1,
0 - draw on black background,
1 - draw on a copy of color image,
UKinect.interID - get tracked interaction skeleton ID,
UKinect.interLeftTracked - hand is tracked,
UKinect.interLeftActive - hand is active,
UKinect.interLeftInteractive - hand is in the interactive zone and is actively being monitored for interaction,
UKinect.interLeftPressed - hand is in a pressed state,
UKinect.interLeftEvent - get event status,
0 - none,
1 - grip event,
2 - grip release,
UKinect.interLeftX - get the X coordinate of the hand pointer relative to the UI,
UKinect.interLeftY - get the Y coordinate of the hand pointer relative to the UI,
UKinect.interLeftRawX - get the raw undajusted horizontal position of the hand. There are no units associated with this value,
UKinect.interLeftRawY - get the unadjusted vertical position of the hand. There are no units associated with this value,
UKinect.interLeftRawZ - get the unadjusted extension of the hand. Values range from 0.0 to 1.0, where 0.0 represents the hand being near the shoulder, and 1.0 represents the hand being fully extended. There are no units associated with this value,
UKinect.interLeftPress - get  the progress toward a press action relative to the UI,
UKinect.interRightTracked - same as left hand,
UKinect.interRightActive - same as left hand,
UKinect.interRightInteractive - same as left hand,
UKinect.interRightPressed - same as left hand,
UKinect.interRightEvent - same as left hand,
UKinect.interRightX - same as left hand,
UKinect.interRightY - same as left hand,
UKinect.interRightRawX - same as left hand,
UKinect.interRightRawY - same as left hand,
UKinect.interRightRawZ - same as left hand,
UKinect.interRightPress - same as left hand.

Audio stream section

UKinect.audioEnabled - true if audio stream has been started,
UKinect.audioPause(bool) - pause audio stream (also stops speech recognition), default is disabled,
UKinect.audioRecordStart("fileName.wav") - start recording to file,
UKinect.audioRecordStop() - stop recording to file,
UKinect.audioBeamAngle - get beam angle [-50..50],
UKinect.audioSourceAngle - get sound source angle [-50..50],
UKinect.audioSourceConfidence - get sound source confidence [0..1] higher=better,
UKinect.audioEchoCancellation - set echo cancellation effect (AEC), default is 0,
0 - OFF,
1 - ON,
UKinect.audioEchoSuppresion - set echo suppresion effect (AES), default is 0,
0 - OFF,
1 - ON,
UKinect.audioNoiseSuppresion - set noise suppresion effect (NS), default is 1,
0 - OFF,
1 - ON,
UKinect.audioAutomaticGainControl - set automatic gain control (AGC), default is 0,
0 - OFF,
1 - ON.

Speech stream section

UKinect.speechRecognizer - set recognizer (recognition language) MUST BE SET BEFORE Kinect Open(...) is called,
UKinect.speechAvailableRecognizers - get all available system recognizers,
UKinect.speechEnabled - true if speech stream has been started,
UKinect.speechResult - get recognized phrase,
UKinect.speechResultTag - get the tag associated with the recognized phrase,
UKinect.speechConfidence - confidence value for the recognized phrase computed by the SR engine,
UKinect.speechConfidenceThreshold - confidence threshold value,
UKinect.speechPause - pause speech stream, default is disabled,
UKinect.speechIsListening - returns true if SR is listening to an audio source,
UKinect.speechLoadGrammar("fileName.grxml") - load a new grammar xml file,
UKinect.speechResetGrammar() - reset all loaded grammar rules.
UKinect.speechAddPhrase("rule","phrase") - add a phrase to the SR engine.

Other functionality

UKinect.tilt - set device absolute tilt angle -27...+27,
[x, y, z] = UKinect.accelerometer() - get accelerometer values,
UKinect.fps - get computed fps performance,
UKinect.time - get time performance.


        joint   name

Urbiscript examples

Example 1

var Global.Kinect=;
t: loop {
t: loop {

Example 2

var Global.Kinect=;
Kinect.speechAddPhrase("","my name is John");
Kinect.speechAddPhrase("","please start game");
Kinect.speechAddPhrase("","stop game");
t: loop {

Example 3

g:  robot.body.neck.head.ActAlive(6,3,10,3,10,1),
  var pos;
  loop {if (Kinect.faceIsTracking) pos = Kinect.facePosition;},
  at (Kinect.speechResultTag=="LOOK"){

Example 4

g:  robot.body.neck.head.ActAlive(6,3,10,3,10,1),
  at (Kinect.speechResultTag=="LOOK"){
    if (Kinect.faceIsTracking) {
      var pos = Kinect.facePosition;
  at (Kinect.speechResultTag=="FOLLOW"){
    loop {
      if (Kinect.faceIsTracking) {
  var pos = Kinect.facePosition;
  var au = Kinect.faceAU;
  {a_DiscUp.stop| a_DiscUp:     robot.body.neck.head.disc[up].val = (DiscUp + au[5]*100) smooth:1,}&
  {a_DiscDown.stop| a_DiscDown: robot.body.neck.head.disc[down].val = (DiscDown - au[1]*50) smooth:1,}&
  {a_EyeRightBrow.stop| a_EyeRightBrow: robot.body.neck.head.eye[right].brow = (EyeRightBrow - au[4]*160) smooth:1,}&
  {a_EyeLeftBrow.stop|  a_EyeLeftBrow:  robot.body.neck.head.eye[left].brow = (EyeLeftBrow + au[4]*160) smooth:1,};

Example 5

g:  robot.body.neck.head.ActAlive(6,3,10,3,10,2),
  at (Kinect.audioSourceConfidence>0.2){

Example 6

g:  robot.body.neck.head.ActAlive(6,3,10,3,10,2),
  loop {
      while (Kinect.interRightEvent==1)


UObject module LINK

Microsoft Kinect Runtime 1.8 LINK

Microsoft Kinect SDK 1.8 LINK







EMYS and FLASH are Open Source and distributed according to the GPL v2.0 © Rev. 0.8.0, 27.04.2016

FLASH Documentation