How it works ============ This document provides a high level overview of how the DensityServer works. ## Overview - Data is stored in using block layout to reduce the number of disk seeks/reads each query requires. - Data is downsampled by ``1/2``, ``1/4``, ``1/8``, ... depending on the size of the input. - To keep the server response time/size small, each query is satisfied using the appropriate downsampling level. - The server response is encoded using the [BinaryCIF](https://github.com/dsehnal/BinaryCIF) format. - The contour level is preserved using relative instead of absolute values. ## Data Layout To enable efficient access to the 3D data, the density values are stored in a "block level" format. This means that the data is split into ``NxNxN`` blocks (by default ``N=96``, which corresponds to ``96^3 * 4 bytes = 3.375MB`` disk read per block access and provides good size/performance ratio). This data layout enables to access the data from a hard drive using a bounded number of disk seeks/reads which greatly reduces the server latency. ## Downsampling - The input is density data with ``[H,K,L]`` number of samples along each axis (i.e. the ``extent`` field in the CCP4 header). - To downsample, use the kernel ``C = [1,4,6,4,1]`` (customizable on the source code level) along each axis, because it is "separable": ``` downsampled[i] = C[0] * source[2 * i - 2] + ... + C[4] * source[2 * i + 2] ``` The downsampling step is applied in 3 steps: ``` [H,K,L] => [H/2, K, L] => [H/2, K/2, L] => [H/2, K/2, L/2] ``` (if the dimension is odd, the value ``(D+1)/2`` is used instead). - Apply the downsampling step iteratively until the number of samples along the largest dimension is smaller than "block size" (or the smallest dimension has >2 samples). ## Satisfying the query When the server receives a query for a 3D region, it chooses the the appropriate downsampling level based on the required details so that the number of voxels in the response is small enough. This enables sub-second response time even for the largest of entries. ### Encoding the response The [BinaryCIF](https://github.com/dsehnal/BinaryCIF) format is used to encode the response. Floating point data are quantized into 1 byte values (256 levels) before being sent back to the client. This quantization is performed by splitting the data interval into uniform pieces. ## Preserving the contour level Downsampling the data results in changing of absolute contour levels. To mitigate this effect, relative values are always used when displaying the data. - Imagine the input data points are ``A = [-0.3, 2, 0.1, 6, 3, -0.4]``: - Downsampling using every other value results in ``B = [-0.3, 0.1, 3]``. - The "range" of the data went from (-0.4, 6) to (-0.3,3). - Attempting to use the same absolute contour level on both "data sets" will likely yield very different results. - The effect is similar if instead of skipping values they are averaged (or weighted averaged in the case of the ``[1 4 6 4 1]`` kernel) only not as severe. - As a result, the "absolute range" of the data changes, some outlier values are lost, but the mean and relative proportions (i.e. deviation ``X`` from mean in ``Y = mean + sigma * X``) are preserved. ---------------------- ## Compression Analysis - Downsampling: ``i-th`` level (starting from zero) reduces the size by approximate factor ``1/[(2^i)^3]`` (i.e. "cubic" of the frequency). - BinaryCIF: CCP4 mode 2 (32 bit floats) is reduced by factor of 4, CCP4 mode 1 (16bit integers) by factor of 2, CCP4 mode 0 (just bytes) is not reduced. This is done by single byte quantization, but smarter than CCP4 mode 0 - Gzip, from observation: - Gzipping BinaryCIF reduces the size by factor ~2 - ~7 (2 for "dense" data such as x-ray density, 7 for sparse data such such an envelope of a virus) - Gzipping CCP4 reduces the size by 10-25% (be it mode 2 or 0) - Applying the downsampling kernel helps with the compression ratios because it smooths out the values. ### Toy example: ``` Start with 3.5GB compressed density data in the CCP4 mode 2 format (32-bit float for each value) => ~4GB uncompressed CCP4 => Downsample by 1/4 => 4GB * (1/4)^3 = 62MB => Convert to BinaryCIF => 62MB / 4 = ~16MB => Gzip: 2 - 8 MB depending on the "density" of the data (e.g. a viral shell data will be smaller because it is "empty" inside) ```