Mechanism and Its Consequences : 2018

Thursday, 25 October 2018

Validating UTF8 strings with Lookup Tables

A recent post appearing on Daniel Lemire's blog heralds a new performance milestone in SIMD utf8 string validity verification using AVX(256 bits) instructions. While AVX performance on some existing hardware might actually be pessimization*, the older 128 bit SIMD version should still be blazingly fast on most systems.

Checking for utf8 string validity using SIMD intrinsics is the logical performance optimization to make (having significant experience with this myself). However, in reaching first for our largest optimization hammer, have we have we prematurely neglected the possible performant scalar solutions? In particular the state machine like nature of utf8 string validity checking strongly suggests a lookup table type of solution.

So how might a non SIMD, lookup table (effectively 16 bit) version operate? And what might be its performance? Lookup table based utf8 validate can out perform SIMD-128bit utf8 validate but does not do so on average due to caching effects.

Lookup Table Naive Implementation

First like other utf-8 validating schemes we start with a decoder state. However, due to the fact that we will be using these in lookup tables we will try to reduce the total number of states to as small as possible (with utf8 decode rules appearing in Lemire's post). Below appears the table that maps the next state for each consumed byte (hex) to a enum.

Complete non-degenerate decoder state graph

Once we have our states we can begin with our initial naive lookup table implementation. Our initial implementation will be a single table that takes in the current state and the current charbyte of the array and maps it to the next decoder state. This table is is simple to construct taking a brute force 2^(4+8) operations. Below is a graphical illustration of execution with the propagation of the decoding state.

Flow of decoder state passing through lookup tables

In code it is simply:

uint8_t runState = READY_NEW;
while (len > 0)
{
runState = tableDecode[ *src++ | (runState<< 8)];
len--;
}

So how does this naive implementation perform? The answer is not that well at all! It beats a branching and testing implementation but isnt anywhere close to the original SIMD-128 version published by Lemaire.

The problem with this naive lookup table approach is the latency of a memory load. Even for data in L1 cache the latency of a indexed lookup is at least 5 cycles. The fastest this naive approach will ever be is 5 cycles per byte. While modern hardware can issue multiple instructions in a single cycle (otherwise known as superscalar) the data dependency here prevents this technique from providing any additional performance. To go beyond this naive implementation we need to reduce instruction dependencies and decode more than one byte at a time.

Naive Lookup Implementation will take at least 5 cycles per byte

Chained Superposition Implementation

In order to reduce instruction dependence we are going to have to produce and chain lookups with values that are in superposition to all possible decode states that could have come before. The initial decode translation will also operate on two bytes at a time via a 16 bit lookup table.

While it might sound insanely complex it turns out that there are only a small number of superpositions for all possible chaining of superposition lookups. (20 possible superpositions for any 2 byte lookup and a total of 52 superpositions for all possible chains)

Given the small number of superpositions these can be represented as simple numbers of 6 bits and can be chained using a 6 + 6 = 12 bit lookup table. Below is a cycle graph as to how this new version might be executed in time.

In full details the code of the validation loop looks something like this:

const uint16_t* asShort = reinterpret_cast<const uint16_t*>(src);
const uint16_t* asShortEnd = &asShort[len / sizeof(uint16_t)];
uint8_t runCon = runState;
while (asShort != asShortEnd)
{
uint8_t decoded0 = tableDecode[*asShort++];
uint8_t decoded1 = tableDecode[*asShort++];
uint8_t decoded2 = tableDecode[*asShort++];
uint8_t decoded3 = tableDecode[*asShort++];
uint8_t prejoin0 = tableJoin[(decoded0 << 6) | decoded1 ];
uint8_t prejoin1 = tableJoin[(decoded2 << 6) | decoded3 ];
uint8_t toCon = tableJoin[(prejoin0 << 6)| prejoin1];
runCon = tableConcate[ runCon | (toCon << 4)];
}

Results and Comparisons

Since Lemire provided full source and benchmark for his original SIMD version we are able to do a direct performance comparison between the SIMD-128bit version and the chained superposition lookup table algorithm. Below are tables reporting this comparison for both minimum number cycles/byte performance and average number of cycles/byte performance. Note that sample repeats were made on a unique random utf8 string for every repeat (this is important as timing the same string repeatedly would favor a lookup table solution).

The results actually correspond to general expectations. The lookup table driven algorithm requires a hot cache to perform anywhere close to the SIMD-128 version thus the performance is only comparable for large strings or high sample repeats. As it stands the performance can be summarized as follows: In the limit of long strings and repeated usage Lookup Table is still 6% minimum slower and %12 slower on average than SIMD-128bit.

Discussion and Conclusion

While still slightly slower on average it is quite remarkable that an algorithm that works in effectively 16 bits can be competitive with a 128 bit SIMD algorithm. Even more surprising is that it can, under certain conditions, exceed the speed of code that makes use of these powerful "vector" instructions.

Here are the suggested reasons why the table based approach achieves such performance:

Superscalar machines perform SIMD-like operations automatically by issuing multiple instructions per cycle.
Lookup tables perform computations in a manner such that they are in effect creating new instructions custom fit for the algorithms specific purpose.
Caching behavior causes lookup table algorithms to increase in performance for specific data Values (working similar the effects of branch prediction ).

So what conclusions can we draw from this exercise? Should we validate utf8 strings with lookup tables? Not likely. The lookup tables require creation and causes cache pollution. However one might pause and wonder if this type of performance can be obtained from a machine that isn't designed for these type of operations what might the performance might be attained on hardware with a different design?

Ref

*AVX performance issues
Despite having a less than 2 year old cpu (i7-6700HQ) with AVX2 support the newer AVX version published by Lemire is actually always significantly slower (~30%) than his original SIMD-128bit version on my machine. What is the nature of this unexpected pessimization is not clear.

My full table lookup utf8 validity checking algorithm can be found here:
https://github.com/darkcephas/utf8Validity_lookupTable

ScreenToGif was used to create animations

Update:
Very similar implementation https://bjoern.hoehrmann.de/utf-8/decoder/dfa/

Wednesday, 21 March 2018

Direct X Raytracing: Further Unification

DirectX Raytracing (DXR)

DXR has recently been announced and many oldtime graphics programmers are obviously very interested in how this might change the future of GPU RealTime (RT) rendering. This article will be from the perspective of a Gen~3.5 render engineer and not someone currently writing production code against this new API.

API Summary

DXR presents new API for specifying 3D scene and initial rays that is passed on to the gpu hidden (device) layer to trace and have it report back results in the form of normal executing shader code [2]. For full details check out the DXR forum post [3].

Highlighted features:

"Independent" (to usercode) Rays defined by origin, direction, and min/max t.
Executed shaders for intersection hits (miss, nearest, any)
Executed shaders for ray generation and optional custom intersection test shaders

Harmonizing Techniques

DX Raytracing has the potential to harmonize many of the existing and varied rendering techniques. Many current rendering subsystems try, in their own unique way, to trace and sample light, material, and geometry in the scene. Though perhaps not yet performant enough, the availability of RT raytracing could unify these sampling mechanisms either directly through raw DXR or via an engine layer. The introduction of RT raytracing also further marries offline rendering with RT (as PBR before it). Below is a list of potential shifts in how RT rendering might grown and develop with the introduction of this common API.

Primary Rays

It is very unlikely that, in the near future, primary (from camera) rays will be raytrace due mostly to cost. Rasterization is greatly superior in performance to raytracing for some obscene level of scene complexity. Even with high scene complexity, modern render engines use various forms of occlusion testing to reduce overdraw/overtesting. Still, raytracing primary rays might have some advantages in the areas of transparency, non triangle rendering, and reduced render engine complexity.

Primary rays from camera in red. Secondary in black. [1]

Ubershader

As documented in the initial DXR API there are potential issues with hit shaders and material variation [2, section 3.1] . A single call to DispatchRays may end up having rays hit multiple geometry that in scene actuality have multiple material (and therefore shader) definitions. Since the call to DispatchRays has only the specified hit shader there exists a problem with trying to represent all materials in a single hit shader (what one might call an ubershader). This problem can be overcome by mixed shader combinations, uv storage and post shading passes, or callable material shaders within hit shader.

Local Lighting Sampling

Many of the screenspace techniques developed over the last decade (ie SSAO) are about computing the potential occlusion of light by local geometry [5]. A simple locally limited raytrace can compute caustic, emissive, specular occlusion, ambient occlusion, multi-layer inter refection. Blending high quality local light sampling with more coarse grained lighting information is already done with techniques like Screen Space Reflection (SSR) [8].

Localized raytracing computes lighting in a similar but less obtuse manner than screen space tracing or voxel tracing.

Transparency

It was once hoped that raytracing would resolve the difficult technical issue that transparency affronts to rasterization [6]. However transparency does not appear to be solved by DXR for two reasons. It is unlikely that (initially) the primary view will be raytraced. It is possible to say raytrace only the purely non opaque parts of the primary view but this would still likely be quite costly. Second, the execution order strictness on the hit any shader is undefined [2, section 4.10]; meaning that it cannot be used out of the box to perform do order independent transparency. Reording of these intersection samples would be required to correctly shade the raytrace [7]. It might be possible instead to use hit nearest shader to find the first transparent object and then to continue to trace from there iteratively.

Level of Detail (LOD)

Scene complexity reductions (lower poly count, lower material variation) is just as desired for raytracing as rasterization [4]. Its is not entirely obvious as to how DXR would be used to support LOD as it would need to be per ray and fully dynamic along the trace path. It could perhaps be supported through a combination of hit shaders and DXR API scene filter/masks. More complex solutions could involve regions with portals for tracing between LODs. DXR does not appear to have an out of the box LODing solution.

Shadows

Primary shadows (ie Sun cascaded shadow maps) are likely not to be imminently replaced with raytracing as the cost would be prohibitive due to the amount of scene being traced. Soft shadows have a somewhat possible different character with raytracing than with rasterization as nearest occluder information is available. Very likely some shadow mapping will remain for the time being and will be instead be supplemented with raytraced local shadowing (especially for contact shadows).

Photon Mapping

Under specific conditions it tends to be advantageous to trace rays from the light source and not from the camera. A shadowmap in some way is a first hit ray coherent photon map. Raytracing from light source through bounce lighting could enable some cost effective raytraced caustic rendering. A photon can be reverse frustrum traced to determine its potential location in a screen space buffer.

Special Rendering

Due to the optional custom intersection shader DXR implicitely supports specialized tracing. This type of specialized tracing could aid in the usage of non-polygonal (ie voxel) rendering and lend to high quality effects like fur, volumetric effects, particle rendering, and LOD occlusion [9].

Emissive surface

Emissive materials have been supported in some high quality render engines for some time now. There are a range of techniques to make this type of effect work from voxel tracing to screenspace sampling to shader raytraced billboards [10]. Now this technique can be unified when costs allow and likely integrated into the general local raytracing shader.

Light Probes

If still required for performance, light probes can now be minimally rendered with selective rays and maxium distanced specified. Multiple probes can also be batched in a way that lends to cache coherency. Light information coming from raytraced sources can also be voxelized and then temporally and spatially propagated. In this way raytracing still acts as a harmonizing mechanism for all lighting computations even if those computations are later stored and interpolated.

Potentially Obsolete Techniques

Secondary shadow maps (maybe)
SSAO (likely)
SSR (likely)
SSO (likely)
Caustic emulations (maybe)

Challenges for the DXR Paradigm

The DXR API specifically states that each ray is considered to be independent (when in flight) and has unique reference-able data. This makes GXR a pure raytracing API in that there is no user level access to ray hierarchy (aka coherent bundles). Given that it still remains a practical question as to if coherence mapping of rays adds any performance benefit to the raytracing process [11], the decision to hide ray hierarchy from userland code is likely the correct one.

There are however several downsides to this extremely simple and raw API that cannot be covered here but might be in a future article. They are simply listed below.

Sampling has to reconstruct (texture) filtering parameters
Hidden but perhaps leaky abstraction of performance (driver specific)
Cache issues with user code level adaptive sampling
Ubershader and limitations on "TraceRay" might make usage complex
Analytical (biased) tracing might be superior for low RT performance constraints
Unexposed but useful information is hidden

[1] https://blogs.msdn.microsoft.com/directx/2018/03/19/announcing-microsoft-directx-raytracing/
[2] D3D12 Raytracing Functional Spec
[3] DirectX forums: Getting started with raytracing
[4] Perceptual level of detail for efficient raytracing of complex scenes
[5] Bent Normals and Cones in Screen-space
[6] Order Independent TransparencyIn OpenGL 4.x
[7] Order Independent Transparency with Per-Pixel Linked Lists
[8] Efficient GPU Screen-Space Ray Tracing
[9] Sparse GPU Voxelization of Yarn-Level Cloth
[10] The Technology Behind the DirectX 11 Unreal Engine"Samaritan" Demo
[11] Coherent GPU Ray Tracing: Improving Data Locality

Tuesday, 13 March 2018

Twenty years later: Offline vs Realtime Rendering

Just like its predecessor Riven was a game composed of prerendered CGI images. Twenty years later Obduction a sequel in series was created as realtime with the unreal engine. Given the advancement of hardware and CGI algorithms over 20 years how do these two compare?

Comparison validity

It does not initially seem fair to compare offline rendering to realtime (desktop) rendering. Beyond simply stating that this analysis will be qualitative the following arguments can be made:

Performance

Riven was rendered offlline using 13 SGI Indigo workstations. The performance estimate for this as though it is a single server is likely no more than ~2.0 Gflops. In contrast today something like a gtx 1070 has on the order of 6 Tflops. Worst case frame time for an offline render in riven was an hour or less. While Obduction is fully realtime (30 fps) it uses temporal super sampling that can require up to half a second to settle. Resolution in Riven is at least 1/4 the pixel density of realtime however it is extremely difficult to draw a quantitative comparison due to scene complexity. The original offline rendering still likely has 2-6 times the raw compute power available to it per frame over the realtime advancements.

Tools and Budget

There are likely no first order arguments that exist that would suggest that CGI workflow tools have degraded in performance and usability. It should be much easier to create high quality 3d scenes in 2013 than in 1993. Riven had a budget of ~4 million (2013 dollars) with a development time of 4 years. This contrast with the smaller budget of 2.5 million (2013 dollars) and 3 year dev time for Obduction.

Style

Finally to do qualitative comparisons of images we should first make it clear that this is an apples to apples comparison. The graphical style of Riven and Obduction are the same in the following ways.

Attempts at photorealism
Objects with history (wear)
Real world detail (construction)

Critique

Obduction

con - environment lighting, flat lighting, no contact/secondary shadows, spectacular inconsistent scene lighting, texture as material errors, incorrect weathering, incorrect usage, incorrect or missing construction, visibly low poly

pro - PBR, soft shadows, very high resolution textures,

Riven

con - sharp shadows, low res textures, poor organic representations, inconsistent materials, excessive bump mapping, stretched textures, unrealistic heighfield terrain

pos - raytraced materials and shadows, thousands of lights, apparent occlusion effects, properly weathered objects, properly constructed objects, real world details, no environment lighting,

Conclusion

Graphically Riven is a significantly better work. It stands to reason that it should be possible to achieve the graphical quality of riven in realtime today. Advancements in computer performance, content tools, and graphic algorithms lend credence to this idea. Obduction however is not a data point in support of this theory.

[update]
There apparently is a realtime version of riven in development called Starry Expanse. We will have to see what the end product looks like but the results already appear to be an improvement over Obduction. (See images at http://www.starryexpanse.com/2017/06/10/stacking-shelves-gehns-lab/)

Ref

https://en.wikipedia.org/wiki/SGI_Indigo

https://en.wikipedia.org/wiki/MIPS_architecture#MIPS_III

http://mrillustrated.com/faq.php

Sunday, 14 January 2018

On Group Selection

This post is an attempted response to Steven Pinker's
"THE FALSE ALLURE OF GROUP SELECTION"

Introduction

Can natural selection allow for reproductive individuals in a group to exhibit altruistic behaviors?
In his lengthy article, Steve Pinker suggest that it cannot:

"A new mutation with this effect would not come to predominate in the population, and even if it did, it would be driven out by any immigrant or mutant that favored itself at the expense of the group."

Again, to flip this around a bit. Steve Pinker is suggesting that natural selection has no solution to environmental pressures based on altruism.

In contrast, our simulations indicate that, if species is organized into groups, that the group order selection mechanism provides the solution to the problem of altruism. I would call this type of selection "Anthropic Selection" as selection is based on the ongoing non existence of the low altruistic groups.

Let us first clearly define our terms:
group - a sub species that is only sexually reproductive with its own members.
altruism- a choice by an individual that is harmful to that individual but beneficial to the group.
reproductive individual - individuals that produce offspring with a mate where the offspring are formed through the (random) combination of parental traits and (possibly) mutation
altruism trait - genetically inherited tendency to participate in altruism behaviors

dominating trait - significant representation of a trait (above mutation rate)

Simulation

"What's satisfying about the theory [of natural selection] is that it is so mechanistic. " - Pinker

The simulation to find the contradictory case to the arguments made by Pinker were designed to be as simple as possible but still legitimately express the intended mechanism.

The simulation is composed of temporal collection of groups which are composed of individuals. Each individual belongs to only one group and contains within them a single boolean representing the prevalence of the altruistic expression. Every update tick each group breeds fully. This means all individuals in a group are randomly paired with another intra group member and three offspring are produced. To simulate individual longevity only the offspring remain in the group. The parents can be viewed to completely expire (having successfully bred)

The act of heritable breeding is simulated in the following way. It should be noted that in the actual simulation we can introduce a small mutation probability for each child that will randomly flip the mutation trait.

The expected growth rate for each group is 50% per update. Instead of capping this growth we simply split the groups once they exceed a certain group size.The fission of a group into two groups always results in two new equal sized groups but its members are randomly distributed.

As so far described the number of groups would simply grow without end and the altruistic trait, experiencing no selection pressure, would normalize in the population. Our final critria is to add the altruistic selection phase.

Once the number of groups get to a specific maximum a culling altruistic test is performed on each group independently. A single random individual is pulled from a specific group. If this individual is altruistic then that individual is destroyed but the rest of the group passes the test. If this individual is not altruistic then this entire group is destroyed with probability Pcritical.

It should be noted that if the individual that is chosen is altruistic they will not produce any offspring. This is in contrast to a non altruistic individual which still has the probability of Pcritical of surviving to breed.

Hypothesis:

Altruistic trait will eventually dominate all groups for values of Pcritical sub threshold.
The introduction of low mutation rates will not significantly effect the domination outcome.

Setup:

Simulation basic parameters were configurable but set to values to reduce mechanism noise for all runs.

Default conditions:
Mutation probability = 0.1%
Max number of groups before cull phase = 10000
Init Altruism trait = 50.0% (effectively achiving centered normal distribution)

Results:

Many various simulations were run that all indicate the domination of the altruism trait over the course of time. (see below)

Max pop 100, default conditions.

Max pop 500, default conditions

Max pop 1000, default conditions

As expected it is also possible to change Pcritical to a value where altruism is no longer favored. (see below)

Max pop 200, Pcrit = 1.0, default conditions

Domination table:

Domination of the altruism trait was not only dependent on Pcritical but also dependent on group size. These values were experimentally determined via simulation runs.

Group Size: <Pcrit - altruism dominates:
20 0.92
100 0.94
500 0.95
1000 0.96

Mutation effect:

Mutation rates increase the normalization of the simulation (expected) but do not directly effect the domination outcome. This movement to normalized is likely purely based on individual mutation effect being normalized. The graphs below contrast very low mutation rate with very high mutation rate.

Max pop 1000, mutation rate 0.01%, Pcrit 50%

Max population 1000, mutation rate 1%, Pcrit 50%

Algebraic Example:

Concerns about correctness of simulation motivates us to seek a simpler way to mathematically grasp its conclusions. A tractable example is enough to give us an intuitive feel for the mechanism at work.

Supposedly we have just 2 groups
group A:75 altruistic individuals out of 100 total
group B:25 altruistic individuals out of 100 total

Total average altruism rate is 50% of we simply combine the two groups (A + B)
However if we do a per group culling phase.

Case probabilities:
Case 0: 0.75 * 0.25 = 0.1875 - A pass, B pass
Case 1: 0.75 * 0.75 = 0.5625 - A pass, B fail
Case 2: 0.25 * 0.25 = 0.06.25 - A fail, B pass
Case 3: 0.25 * 0.75 = 0.1875 - A fail, B fail

Case #Altruist, #Total => normalized by probability of case
Case 0: 74 + 24, 99 + 99 =>
18.375, 37.125
Case 1: 74 + 25 * Pcrit, 99 + 100 * Pcrit =>
41.625 + 14.0625 * Pcrit, 55.6875 + 56.25 *Pcrit
Case 2: 75 * Pcrit + 24 , 100 * Pcrit + 99 =>
4.6875 * Pcrit + 1.5 , 6.25 *Pcrit + 6.1875
Case 3: 75 * Pcrit + 25 * Pcrit, 100 * Pcrit + 100 * Pcrit =>
18.75 * Pcrit , 37.5 * Pcrit

Sum all normalized cases:
61.5 + 37.5 * Pcrit, 99 + 100 * Pcrit
In this example Pcrit values less than 0.96 altruistic is increased.

Conclusion

It is possible to construct a simulation of natural selection that favors altruism in sexually reproductive individuals so long as these individuals are confined to groups. It is also a hope of this article to lend credence to the idea that implied mechanistic behaviors should be explored mechanically.

Notes

Relevant Quotes from Paper

"If a person has innate traits that encourage him to contribute to the group's welfare and as a result contribute to his own welfare, group selection is unnecessary; It's only when humans display traits that are disadvantageous to themselves while benefiting their group that group selection might have something to add."

" Natural selection could legitimately apply to groups if they met certain conditions: the groups made copies of themselves by budding or fissioning, the descendant groups faithfully reproduced traits of the parent group (which cannot be reduced to the traits of their individual members), except for mutations that were blind to their costs and benefits to the group; and groups competed with one another for representation in a meta-population of groups. But everyone agrees that this is not what happens in so-called "group selection." In every case I've seen, the three components that make natural selection so indispensable are absent."

"Modern advocates of group selection don't deny that selection acts on individual organisms; they only wish to add that it acts on higher-level aggregates, particularly groups of organisms, as well. "

"Human beings live in groups, are affected by the fortunes of their groups, and sometimes make sacrifices that benefit their groups. Does this mean that the human brain has been shaped by natural selection to promote the welfare of the group in competition with other groups, even when it damages the welfare of the person and his or her kin? If so, does the theory of natural selection have to be revamped to designate "groups" as units of selection, analogous to the role played in the theory by genes?"

"Is group selection necessary to explain the evolution of psychological traits adapted to group living such as tribalism, bravery, self-sacrifice, xenophobia, religion, empathy, and moralistic emotions? "

Ref

https://www.edge.org/conversation/steven_pinker-the-false-allure-of-group-selection

https://eli.thegreenplace.net/2016/drawing-animated-gifs-with-matplotlib/