Virtual Reality: Creating Immersion (2012): rpdillon.net

Virtual Reality: Creating Immersion (2012)

2012-08-30

The purpose of my last post was to explain one reason why, in 2012, virtual reality is more than a pipe dream. It was meant to preempt a reaction of "Well, they've been talking about 3D movies and photos for 20 years also, and we have only made modest progress on that front." As a coworker said when he heard I was excited about virtual reality: "That sounds like the 90s." If you hype a technology for 20 years and it doesn't really go anywhere, people become jaded and give up. I'm writing this because I believe there is good reason to have hope. Virtual reality is real, and it is cool. Let me explain why.

This post is meant to get into more detail about the current challenges associated with virtual reality, and the state of the art. Almost everything I'm going to write about is sourced from John Carmack's 2012 QuakeCon keynote, and the follow-up panel discussion with Michael Abrash (seminal FPS developer with John Carmack, now researching VR at Valve Software) and Palmer Luckey (VR headset enthusiast and founder of Oculus, the makers of the Rift VR headset due for release in 2013). These discussions are the most comprehensive treatment of the current state of VR I've seen or read anywhere, and they are extraordinarily timely.

What are the Elements of Immersion?

The goal of immersive VR is to create the illusion that you are looking through a portal into a game world. Once you've created a convincing portal, you need to bring that portal right up to user's eyes, so they seem to be in the world, much in the same way you might bring a camera lens right up to a gap in a chain-link fence to make the picture appear as though it were taken on the other side. In order for the illusion to be effective, we really only need three things:

we need to know what views to generate, and
we need to respond to user action immediately, and
we need to send enough information to be convincing

In practice, knowing what view to generate hinges on being able to tell where the viewer's head is, and which way they are looking. Responding quickly is a function of sensor performance and code optimization. Sending enough information to be convincing is a function of the display's field of view, update rate, and resolution.

I'll address each of these challenges in turn.

Head Tracking

One key ingredient in creating a compelling virtual reality is head tracking, which means answering the question "Where is my head, and which way am I looking?" This involves accurately modeling how your head is moving while wearing the headset. There are two major aspects to head tracking that are related, but are solved in distinct ways. I will use the terms "pose" and "position" to describe them.

Determining Pose

"Pose" is the angle at which your head is held, which can be thought of in terms of pitch (up/down), roll (side-to-side) and yaw (rotation around the vertical). Because gravity provides us with a constant acceleration towards the center of the earth, a 3-axis accelerometer can measure where that acceleration is pointing and derive at least two of the values.

Carmack's discussion both during during the keynote at QuakeCon as well as during the Virtual Insanity panel with Abrash and Luckey seemed to focus on the gyroscopic aspect of such an approach; he and Luckey repeatedly mentioned drift being an issue.

Determining the third when it represents a rotation around the vertical is problematic. That can be solved by adding an additional accelerometer that is oriented differently within the headset. This is not a problem, however, as 3-axis accelerometers are available on a chip about the size of a thumbnail, and can be purchased at retail prices of about $1. They aren't quite commodity, however, since different chips will supply data of different quality.

One attribute that is important when applying them for use in a VR setting is how noisy the data is. An easy way to test this is to plug the chip into a laptop, open up a connection to it and stream the data out from it. The simplest version of this will simply print out three floating-point values representing the acceleration measured on each of axes. At that point, it is easy to keep the sensor at rest (by placing it on a table) and watch how the values change. If the sensor's values are fluctuating when there is no movement, that indicates that the sensor is noisy, which, if uncorrected, leads to a shaky view when used for head tracking.

Noisy data can be mitigated in software. In most applications, you run the output of the sensor through a filter that smooths the data, so the view inside the headset doesn't jitter. One way to filter would be to maintain a rolling average of several samples from the sensor, but this approach comes at the expense of responsiveness (an important topic discussed below). There are much more sophisticated filters that can be used that provide better results, naturally.

Determining Position

Whereas pose gives you the direction a user is looking, head position tells you where their head is in the 3D space. Is the user leaning to one side? Are they crouching? Lying down? While pose can generate a compelling experience, it is limited because it has nothing more to go on than where your face is directed, so it has to make an assumption about whether your are standing or sitting, for example. Tracking head position fixes that. For a long time, the idea that a perception of depth could be generated by tracking the position of the viewer's head wasn't taken very seriously, at least commercially. Stereoscopy was considered to be the driving factor and was pursued much more vigorously. There are two main reasons why tracking head position was ignored commercially:

Head tracking is really only limited to a single viewer: if you have two or more viewers sharing a single screen, head tracking actually impairs the experience for all but a single viewer (the one whose head is being tracked!)
Head tracking requires interactivity. For decades, the dominant 3D mediums were photos and movies. Neither medium is interactive: once a photo is taken or a movie is recorded, the viewer has no input that can change the data or allow it to be viewed from a different perspective. This made head tracking all but useless for the main commercial forms of entertainment.

Johnny Lee demonstrating head tracking using IR with Wii hardware

Because the technology was ignored by commercial interests, many folks didn't recognize just how effective the technique could be. But then Johnny Lee came along and, with only commodity hardware provided with the Nintendo Wii, he demonstrated exactly how impressive the effect really is. If you haven't seen his original video and you've made it this far into this article, you owe it to yourself to give it a watch.

In a virtual reality environment, both of the major limitations that inhibited commercial adoption evaporate: VR is inherently personal, so there are no issues with sharing a screen, and it is inherently based on video gaming, so it is highly interactive. That makes it a very worthwhile endeavor for any virtual reality system.

But how do you track the position of a person's head in a 3D space? We're looking for robust, cheap, responsive approaches. It's worth noting that almost every system that can provide positional data also can provide some kind of a pose data as well, which can be used in conjunction with either accelerometer or inertial data to provide a very accurate tracking system.

Pseudo-Tracking

One of the easiest approaches is to fake it, and this is what was being demoed at QuakeCon 2012. Essentially, the insight is that having no head positional model (only angular information to formulate a pose) is awkward for the viewer. Your eyes expect to be displaced forward when you look down, for example, because your head and neck both have length that pushes your eyes forward as your head tilts forward. In the absence of that displacement, things seem a bit off. Carmack's approach for purposes of the demo was to create a rudimentary head and neck model based on his own build to create a basic effect, which he reported as working well for relatively small motions of the neck and head, but falling painfully short in cases where there was any leaning of the viewer's body or crouching.

This is a nice approach for a tech demo, but it just won't work for a commercial product.

Microsoft Kinect

Microsoft's Kinect actually maps the room the user is in and then tracks the motion of the user's body using infrared. The system is extremely promising because it allows full body tracking, so jumping, crouching, turning and leaning would all be tracked very well, providing robust solutions and having the advantage of being relatively affordable. But Kinect falls short on responsiveness. Latency inside the Kinect itself (from time of motion to getting data from the Kinect) is 70ms, which means an update rate of only 14Hz. For a 60fps simulation, that means an update only once every four frames, which is unacceptably slow.

TrackIR

What about Johnny Lee's approach? His approach places an infrared camera in the environment outside the viewer that tracks the position of two infrared lights on glasses the viewer is wearing. The computer connected to the camera knows the distance between the lights, so determining distance is simple. Likewise, tracking where the lights are appearing in the frame gives both an azimuth and a polar angle, assuming the camera is calibrated properly. Those three values: radial distance, polar angle and azimuthal angle are the necessary components to plot the position of the user's head in a spherical coordinate system, and the system is simple enough that it is cheap (as Johnny demonstrated) can be done quickly (the sensor data provided by the TrackIR system is clean and updates at 120Hz, which more than satisfies our responsiveness requirement). So what's the problem?

Robustness.

There is a major issue with this approach. Since the lights are located on the sides of the user's face, as soon as the user turns his or her head at an angle that causes one or both of the lights to be out of the camera's sight, the system has no way of determining the position. This leads to a catastrophic failure: at one moment, the virtual reality is extremely immersive, and the next everything goes haywire. The system we're looking for needs to degrade gracefully, rather than catastrophically, so as a user drifts too far away or looks in a direction that isn't optimal, they don't lose immersion.

Magnetic Tracking

Razor makes a game controller called the Hydra that makes use of a magnetic tracking system to provide angular and positional information. The idea is that by detecting magnetic fields emanating from the controller, a sensor can determine where they are in the space surrounding the sensor. John Carmack seems to think this approach is the most robust of those discussed so far. In particular, it addresses the issue seen in TrackIR in which the system fails to report data when the user's head it positioned in certain ways. As with most any system, it does suffer from degraded accuracy as the distance from the sensor increases, though John points out that the degradation is graceful, which is vital for the VR application. One major advantage over the Kinect is that the Hydra system (which Carmack took out of the controller and embedded into a prototype VR headset) has an update rate of 240Hz, though four updates are needed to get a full positional update, resulting in an effective update rate of 60Hz. This equates to an update every 17ms or so, which is roughly four times the speed of the Kinect.

Response Time

One of the ongoing challenges in developing a virtual reality headset has been getting the entire system to update the display in a timely manner. There is a lot of data out there that manufacturers publish about the latency within just the GPU, or just the display, but it is rare to see an engineer publish end-to-end data about how much delay there is between the moment the system receives input to the moment that input is reflected on the display. John Carmack takes that metric fairly seriously.

To measure display latency, I have a small program that sits in a spin loop polling a game controller, doing a clear to a different color and swapping buffers whenever a button is pressed. I video record showing both the game controller and the screen with a 240 fps camera, then count the number of frames between the button being pressed and the screen starting to show a change.

The problem is that all the delays add up - an LCD display can take up to 20ms to change a pixel, and it is very common for GPUs to buffer data. In many applications, these delays don't affect the user experience very much. One notable example is movies; if a user has to wait an extra 1/10 of a second for the movie to play, it doesn't affect the user experience at all. But if the system introduces 100ms delays between when a user turns their head in VR and the view begins to change, that can seriously affect the credibility of the simulation…it can break immersion and make you feel like you're "dragging" the view in the direction you look. It turns out that if you don't pay close attention, those delays can start add up.

So response time matters. But how fast is fast enough? There are a whole lot of "how much data is enough?" problems in computing when it comes to perception. You see it in bitrates for music and movies, frame rates for games, and resolution on displays and mobile devices. "Magic numbers" are highly subjective, though it usually isn't too hard to get a basic number that works for most people, and VR is no different. In Carmack's tests, he reports that the "magic number" is 20ms, though acceptable results can be obtained in the 40-50ms latency range. So, as long as the delay between user input and the pixels lighting up reflecting that input is no more than 20ms, immersion is maintained. It turns out that such a low latency difficult to obtain, but not outside the realm of possibility. The important point here is that you need to have enough CPU and GPU power to drive your display (at whatever resolution it is running), fast enough accelerometers and/or gyros and fast enough display latency (the time it takes for the display to change the color of a pixel) that the whole process doesn't take more than 20ms. We don't have that yet, but we're getting there.

Another point that bears mentioning is the buffering issue, and Carmack spent a good amount of time discussing this. In both GPUs and displays, advanced features like overlays and picture filters are often implemented, the sake of simplicity, using buffering techniques. This is another case where it doesn't matter much for movies, but it matters a lot for gaming in VR. Because almost every commercial headset wasn't designed with VR in mind, this tends to be an issue. Hence Carmack's well-known post to Twitter:

I can send an IP packet to Europe faster than I can send a pixel to the screen. How f’d up is that?

If you read his explanation of what was happening to cause that much delay, it was essentially a case where Sony's headset, explicitly designed for movies, was implementing overlays using buffering. Why does buffering cause a delay, though?

Basically, a buffer is a queue inside the headset itself. The game engine accepts input, calculates the view, sends it to the GPU for rendering, and it gets pushed out to the display. Let's say that takes 16.7ms (for the sake of simplicity: that gives us the standard 60 frames per second). That frame arrives at the headset and is put at the end of queue. A typical queue may be two or three frames long, and the headset processes the frames at 60fps. For a non-interactive experience, this is completely fine; no one is going to miss the 50ms or so of delay that it takes for the display to work through the three frames in the queue. But a viewer using VR who expects the display to update just as his or her vision would in real life will really notice the total lag of 70ms (20ms for the render, plus 50ms for the display). Add in sensor lag, and you can easily get up to 100ms or more. So buffering, whether it occurs in the GPU or display, is the enemy of a responsive, immersive experience.

Field of View and Display Resolution

There has been a long-standing problem with virtual reality headsets: the field of view is far too narrow to be immersive. Even high-end headsets have fields of view that are 20 or 30 degrees. That compares to a respectable television setup, but compared to your everyday field-of-view, it's tiny. It simply doesn't provide the immersion needed to generate a convincing environment. For reference, if you view a 54" TV at 10 feet, the field of view is about 24 degrees. So, one of the areas that had to be improved in order to produce convincing VR was an increase in field of view. The current demos accomplish credible VR through the use of some recent improvements in display technology, as well as a willingness to relax what was previously believed to be a requirement.

First, they use display technology whose improvement has been driven by the mobile device market. Previous generations of VR headsets have historically used displays found in portable television devices, since there was no market for displays under 10 inches prior to the rise of smartphones around 2006 or so. Portable television displays were power hungry and had low resolution. But why did that restrict field of view?

The problem is that to provide an adequate field of view, the display has to sit very close to the user's eye, and may be viewed through a lens of some type. The result is that unless the display is extremely high resolution, the pixels become too apparent to the viewer and ruin the immersion. So, to effectively provide VR, we essentially have been waiting for portable displays that have the right pixel density. Recent commodity displays have hit the sweet spot in terms of pixel density, and as the public has become more aware of the benefits of high pixel density in a desktop and mobile computing setting, the market has responded accordingly. The current target resolution for the first-generation Oculus Rift will be 1280x800 (640x800 for each eye), but that is expected to increase by four times in the next year or two.

So, displays just magically got that much better? Not quite.

The driving force behind virtual reality in the last 20 years hasn't been gaming, or even the commercial sector. It's true that folks have had big ideas and been enthusiastic in the gaming community, but the funding, and therefore the design and the advances that made the early prototypes possible were funded by the government. It's not so hard to understand: especially in the military, one of the biggest ongoing costs is personnel and training. Virtual reality was a potential mechanism to train personnel in a way that was safe, while being more realistic and immersive than many of the older training materials.

But integral to that use case, and therefore the designs that emerged from it, was a requirement that the displays be relatively distortion free, that is, that a square appear to be a square. If you have any experience with optics in photography, you'll know that distortion-free optics, especially in optics with a low focal length, is an extraordinarily expensive proposition. The concept is similar in big immersive displays for virtual reality: trying to make them distortion free while maintaining high resolution, wide field of view and low cost is simply not possible. So what gives?

It turns out that users simply don't notice the distortion all that much. The biggest drivers behind the immersion are wide field of view and responsiveness, with other more obvious features, like stereoscopy, low distortion and high resolution playing important, but secondary, roles. This isn't actually all that surprising: IMAX, which lacks responsiveness and low distortion still provides a very compelling experience, for example.

Even so, the Rift is designed so that the two eyes share a single screen that is split down the middle, and lens are placed in front of the screen, which distort the image coming through them, gathering pixels toward the center of the image and making them more sparse around the edges. This benefits the user in some ways, since it provides more information in the foveal region of the eye, which is responsible for the sharp image you see at the center of your vision that is used for gaming, reading, driving and other activites that require sharp visual acuity.

But correcting that distortion has historically been a non-trivial operation that has to be implemented as the last pass over the rendered image, increasing computational load significantly, reducing framerate (which reduces responsiveness). What changed? It turns out that newer graphics cards provide really good performance for OpenGL fragment shaders, which can take care of correcting that distortion with a minimal of overhead. Once the work is done to derive the right alogorthim for the shader, the computational time is minimal, since it's accelerated in hardware.

Immersion: Compelling Enough?

I've reached just about the limit of my knowledge (and time!) to cover the material Carmack discussed during his keynote, but I hope I've laid a solid foundation to justify why getting excited about virtual reality makes sense now more than it ever has before.

But even if you buy the argument that virtual reality already exists and that you can have it for a few hundred dollars in a few months, complete with a few really compelling titles, there are a lot of folks that feel that the experience simply isn't compelling enough to warrant the investment, either for consumers or for developers. I couldn't disagree more. I think there is little doubt that virtual reality will be to the next generation of consoles what motion control mechanisms have been to this generation: the implementation my be refined significantly, there will be winners and losers, but the idea will endure. If anything, virtual reality will revitalize interest in motion based controls, since virtual reality games require some level of decoupling between head motion and hand motion, and motion controllers provide that.

But that's a discussion for another post.