If you have any question about this blog article, feel free to contact me on twitter: @meulta

I have an awesome job at Microsoft: I get to work on mixing AI, Cloud and Mixed Reality experiences with partners and other companies. This enables me to work on cool new stuff and see what companies really want to do. Form all of the Microsoft Cognitive Services : Computer Vision is one that bring a lot to Augmented reality experiences . We’ve already helped a lot of companies using the Custom Vision online service which you can access via a REST API. The service tells you what’s in the image by providing a list of tags along with their probability ratio. A new feature was added more recently to show you where these objects are located in the image, thanks to the new Object Detection model with bounding boxes.

The only piece which was missing is offline support. When you don’t have any internet connectivity, these services are not very useful anymore. When detecting that the device you have is offline, you have two possibilities:

  • Disabling the features that require this REST API.
  • Trying to perform object recognition locally, on the device.

We recently announced Windows Machine Learning. It is a service in Windows accessible through a set of APIs when you are building Universal Windows Platform (UWP) applications. This service was released at the end of April 2018 in the brand-new Windows update. The good news is: this Windows update is also available on HoloLens!

This Windows ML feature requires a trained model in the ONNX format. ONNX is an initiative from companies such as AWS, Facebook and Microsoft to create an open format to represent deep learning models. One way to create an ONNX model today is to convert it from one that already exists. Guess what? Custom Vision gives you the feature of exporting a trained model in ONNX (for some of the model types available).

I recently worked with a company to try and mix all of this and be able to run a Custom Vision model on HoloLens.

It worked! Here is how we did it (using a demo project as an example. You can get the full code here: https://aka.ms/mr-winml-code).

Use Custom Vision to create an ONNX model

Custom Vision (customvision.ai) is one of the Cognitive Services. These are a set of Machine Learning ready-to-use models. The idea is simple: you get the power of Deep Learning and other algorithms without even having to go through the challenge of creating and training the model yourself. If you know how to call a REST API, you know how to use Cognitive Services. Some of these services are customizable. In the Vision API, we have a “Custom” version that you can train by uploading pictures and tagging them.

Note: We also recently released a preview of the Object Detection model. It brings you a customizable Vision service WITH bounding boxes! So now not only do you know what is in the picture, you can also know where it is. It is very helpful in Augmented Reality apps as it enables a way to display data on top of objects.

There is a very comprehensive guide available here explaining how to create a classifier so I am not going to go through it here in detail. Here are some interesting things for you to consider:

First, pick a compact model. If you want to export as ONNX, the model as to be marked as (compact), otherwise you won’t be able to. This makes sense in a way that these models are designed to run offline on devices that do not have a lot of power. It is a good way to insure you have usable exported models.

Once you’ve picked the model you need, you just have to train it. In my examples, I am training it using photos of a Rubik’s cube and a Seahawks football (for no particular reason, these were just the first things that I found in my office when writing this sample 😊 ).

You can start with as few as 10 pictures for each tag and the service will surprise you with its ability to recognize these very accurately. Of course, you can refine the model by adding more pictures with different lightnings, coloring, etc.

Exporting is very easy, go in the Performance tab and click on the Export button. You can choose the format you want and in this case, we need ONNX.

That’s it. We now have a model we can use offline!

Generate the Windows ML wrapper

Now that we have a model, we need to write some code that is going to make use of it. You can find everything you need about Windows Machine Learning in the documentation but let’s get through the main actions you need to perform.

When I say that we need to write some code, it is not completely accurate. We will generate a wrapper to use the ONNX model in UWP. The only code we need to write ourselves is the one that gets the image from the webcam and pass it to the wrapper.

The Windows ML SDK comes with a tool named MLGen.exe. This command line tool helps you generate a wrapper for an ONNX file. It is located where you installed the Windows SDK. On my computer, which has a standard install, here where it is:

C:\Program Files (x86)\Windows Kits\10\bin\10.0.17125.0\x86\mlgen.exe

The command line to run is straightforward:

mlgen -i INPUT-FILE -l LANGUAGE -n NAMESPACE [-o OUTPUT-FILE]
  • INPUT-FILE is the ONNX file.
  • LANGUAGE is the programming language (CS in this case).
  • NAMESPACE is the namespace used in the CSharp file. This should be something usable in your project, but you can change it later.
  • OUTPUT-FILE is the file that is going to be created (a .cs one, in this case).

A generated wrapper contains 3 classes (considering that “ModelName” is the name of your model):

  • ModelName_Input: a structure to hold the input data. In Vision related model, it is a VideoFrame which is easy to get from the Camera in UWP.
  • ModelName_Ouput: a structure used by the wrapper to give you the output.
  • ModelName: a class responsible for running the model evaluation. It contains a static constructor and an asynchronous method to evaluate  the model prediction.

Here is an example of generated code:

using System;
using System.Collections.Generic;
using System.Threading.Tasks;
using Windows.Media;
using Windows.Storage;
using Windows.AI.MachineLearning.Preview;
namespace MyNamespace
{
public sealed class MyModelInput
{
public VideoFrame data { get; set; }
}
public sealed class MyModelOutput
{
public IList<string> classLabel { get; set; }
public IDictionary<string, float> loss { get; set; }
public MyModelOutput()
{
this.classLabel = new List<string>();
this.loss = new Dictionary<string, float>();
}
}
public sealed class MyModel
{
private LearningModelPreview learningModel;
public static async Task<MyModel> CreateMyModel(StorageFile file)
{
LearningModelPreview learningModel = await LearningModelPreview.LoadModelFromStorageFileAsync(file);
MyModel model = new MyModel();
model.learningModel = learningModel;
return model;
}
public async Task<MyModelOutput> EvaluateAsync(MyModelInput input) {
MyModelOutput output = new MyModelOutput();
LearningModelBindingPreview binding = new LearningModelBindingPreview(learningModel);
binding.Bind("data", input.data);
binding.Bind("classLabel", output.classLabel);
binding.Bind("loss", output.loss);
LearningModelEvaluationResultPreview evalResult = await learningModel.EvaluateAsync(binding, string.Empty);
return output;
}
}
}
view raw wrapper.cs hosted with ❤ by GitHub

Note: When exporting from the Custom Vision service, the name of the model is a generated one. You might have to change it to something that is more human readable.

As you can see in this example, the generation tool created an output structure with 2 parameters:

  • classLabel: the labels with the highest probability.
  • loss: a dictionary with all labels and their respective probability.

The dictionary is not initialized. If you leave it this way you will get an Exception when running the code. You have to initialize it using the labels you set in the Custom Vision portal. Here is how it looks like for mine:

this.loss = new Dictionary<string, float>()
{
{ "football", 0f },
{ "rubikscube", 0f }
};
view raw vision.cs hosted with ❤ by GitHub

Awesome, we now have a usable wrapper. Let see how we can integrate this in a Mixed Reality app.

Integrate in your Mixed Reality app

The best way to start integrating your wrapper and ONNX model in an application is to look at the samples that the team is providing. They will guide you through everything you need: starting the Camera, collecting VideoFrames, sending these to the wrapper and getting the result.

When you are creating a Windows Mixed Reality application and, more specifically, a HoloLens app, you use Unity to setup your scene with 3D objects as needed along with C# scripts. This Unity project is then built (or “exported”) as a Visual Studio solution containing a Universal Windows Platform (UWP) project.

Note: In theory, a Unity project is meant to be exported to different platforms. We usually try to keep the code as portable as possible. In this specific case we are going to use APIs that are specific to Windows to run the evaluation of the ONNX model. To make sure this platform specific code does not generate errors in Unity, we will wrap it inside conditional compilation keyword. This will tell Unity and Visual Studio: “don’t try to parse / compile this unless you are in UWP”. If you want to run this code on another environment you will have to (at least) add another conditional compilation switch to add the code specific to this platform.

You can get the full code for this sample here: https://aka.ms/mr-winml-code 

In the sample project, there is a Game Object with no graphical representation. It is called ScriptHolder and its only role is to have some scripts attached to run code at specific moments during execution. This object has a script named Scene Startup attached to it. This script contains all the code needed to create the ONNX wrapper, get the VideoFrames and display the result of prediction.

The Start method in the standard Unity MonoBehaviour is called automatically when the object appears in the scene (i.e. when the application is starting). This will:

  • Get an instance to the label where we are going to display results to.
  • Create and initialize the MediaCapture object to start collecting frames from the camera.
  • Initialize the wrapper for the ONNX model.

Initializing the model is pretty easy, you just load the file from the local storage and call the static constructor for the wrapper:

public async void InitializeModel()
{
StorageFile imageRecoModelFile = await StorageFile.GetFileFromApplicationUriAsync(new Uri($"ms-appx:///Data/StreamingAssets/model.onnx"));
imageRecoModel = await Image_RecoModel.CreateImage_RecoModel(imageRecoModelFile);
}
view raw SceneStartup.cs hosted with ❤ by GitHub

To be able to use the ONNX model file in the app, you will need to create a specific folder in Unity called exactly StreamingAssets and add the .onnx file there. When you generated the UWP project it will be added in it under /Data/StreaminAssets and its Build Action property will be set to Content. This way you can access it using the Storage API.

Initializing the process of getting Video Frames is straightforward. You can take a look at the code in CreateFrameReader(). It involves some parameter initialization and a call to the static method: MediaCapture.CreateFrameReaderAsync.

Once this is done, we call the StartPullFrames method which does all the interesting work.

private void StartPullFrames(MediaFrameReader sender)
{
Task.Run(async () =>
{
for (;;)
{
var frameReference = sender.TryAcquireLatestFrame();
var videoFrame = frameReference?.VideoMediaFrame?.GetVideoFrame();
if (videoFrame == null)
{
continue; //ignoring frame
}
var input = new Image_RecoModelInput();
input.data = videoFrame;
if(videoFrame.Direct3DSurface == null)
{
continue; //ignoring frame
}
try
{
Image_RecoModelOutput prediction = await imageRecoModel.EvaluateAsync(input).ConfigureAwait(false);
var classWithHighestProb = prediction.classLabel[0];
if (prediction.loss[classWithHighestProb] > 0.5)
{
DisplayText("I see a " + classWithHighestProb);
}
else
{
DisplayText("I see nothing");
}
}
catch
{
//Log errors
}
await Task.Delay(predictEvery);
}
});
}
view raw SceneStartup.cs hosted with ❤ by GitHub

Note: There are a lot of different ways to do this and keep in mind that this is only an example. You should find out the best way to integrate this into your product, which might be different than this approach.

This method starts an infinite loop on a separate thread. This loop tries to get the latest frame that was captured by the camera and after a few tests (is the frame null? Is the Direct3DSurface null?) It is given to the EvaluateAsync method of the wrapper.

Note that in this case, we ask for it to be run synchronously (with ConfigureAwait(false)) so we don’t flood the device with a ton of parallel evaluations.

Once we get the result from the model evaluation, we get the name of the class with the highest probability using classLabel[0]. And we check to see if the probability is over 0.5. It is an arbitrary number I picked to not consider detected classes with a too low probability.

If you take a look at the wrapper code in Vision.cs, you will notice that the Input and Output classes are the ones that were generated by the command line tool. The only additions I made were initializing the Dictionary with the 2 types of objects available in my custom vision model.

I tried to optimize the EvaluateAsync method by pre-initializing objects in the static constructor and only binding the output once. It improved a little bit but not significantly enough to say that this is worth it.

DisplayText("Does not work in player.");
#endif
}
private void DisplayText(string text)
{
textToDisplay = text;
textToDisplayChanged = true;
}
#if UNITY_WSA && !UNITY_EDITOR
public async void InitializeModel()
{
StorageFile imageRecoModelFile = await StorageFile.GetFileFromApplicationUriAsync(new Uri($"ms-appx:///Data/StreamingAssets/model.onnx"));
imageRecoModel = await Image_RecoModel.CreateImage_RecoModel(imageRecoModelFile);
}
public async void CreateMediaCapture()
{
MediaCapture = new MediaCapture();
MediaCaptureInitializationSettings settings = new MediaCaptureInitializationSettings();
settings.StreamingCaptureMode = StreamingCaptureMode.Video;
await MediaCapture.InitializeAsync(settings);
CreateFrameReader();
}
view raw SceneStartup.cs hosted with ❤ by GitHub

That’s it! Only a few lines of code using an out of the box feature from Windows and you get an offline Custom Vision model on HoloLens.

Moving forward

Running this on a HoloLens is pretty fun: you look at a football, it says “I see a football” then you look at a Rubik’s cube and it says that is it a Rubik’s cube. Ok… maybe it is not the most exciting app but I can tell you that this will help a lot of developers handling offline scenarios! 🙂

Windows Machine Learning is still new and in preview. We can expect that this will improve a lot in the future. When using a Machine that supports it, Windows ML uses the GPU to do model evaluation. Keep in mind that on HoloLens, it’s only using the CPU.  This means that you must be very cautious about what model you use on this kind of device. Whether it’s a HoloLens, a phone or a tablet, you’ll want to test it and make sure it is fast enough for your scenario. A good idea might be to use the Unity Profiler to understand usage of the CPU and the GPU in your app. Right now, I have no idea if using the GPU will ever be possible on HoloLens for this kind of processing.

Deep Learning and tools from this big AI family are really the next frontier for AR and VR. In the coming years they are going to be the key component to evolve from good apps to magical experiences.

Credits

Huge thanks to Jason Fox, Jared Bienz, Nick Landry and Simon Ferquel for the help on reviewing this article and the sample code.

If you have any question about this blog article, feel free to contact me on twitter: @meulta