//=========
// Author:  Alex Reynolds & Shane Neph
// Project: starch
// File:    README (test)
//=========



Running Tests
====================================================================================
To run all tests, type:

  $ make all

Names of individual tests refer to specific test categories and are described below.

NOTE! The /tmp folder is used to store intermediate results. If you are running all 
the tests, please have ~1.7 GB of free space available in your /tmp folder (there 
are several tests performed here, with variations on version and compression formats 
that will result in many copies of the same data -- further, some tests are 
dependent on intermediate results of other tests).



Clean-up
====================================================================================
To clean the results of all tests, type:

  $ make clean

This calls each individual test's clean-up process and removes the parent folder
that stores test results.



Overview
====================================================================================
This directory contains the following subdirectories:

. / binaries
  / compression_and_extraction
  / concatenation
  / metadata_integrity
  / nesteds_and_duplicates
  / original_data
  / timing
  / updating



Binaries
====================================================================================
To explain the version numbering applied to starch/starchcat/unstarch binaries, we make the 
distinction between "archive version" and "binary version". 

The "binary version" value will always be concurrent with the rest of the BEDOPS toolkit.
As of April 30 2014, this is v2.5.0.

The "archive version" value refers to the version of metadata contained within a Starch archive
file. Different archive versions provide access to different metadata. As of April 30 2014, the
current archive version is v2.1.0. 

Starch binaries that can open archives containing v2.1 and older metadata are also called 
"v2.1-metadata" binaries. Similarly, binaries that can open and access v2.0 and older metadata 
properties are called "v2.0-metadata" binaries, and so on.

The "binaries" folder in this test suite contains platform-specific starch binaries. The 
v1.2-metadata binaries are sourced from our old Google Code site. The v1.5-metadata binaries 
were compiled from a private source repository through version edits to starchMetadataHelpers.h. 
The archive-v2.0 binaries were sourced from our Github site (generally the latest version
that creates a v2.0 Starch archive; on April 30 2014 this would be a v2.4.2 binary release). 
As of April 30, 2014 we support v2.1-metadata archives with v2.5.0 binaries.

For architecture, we currently offer tests of 64-bit binaries for Linux and universal ("fat")
binaries on the OS X platforms. We do not offer Cygwin tests at this time.



Original Data
====================================================================================
The "original_data" folder contains the "original" BED data that goes into testing the 
functionality of starch, unstarch and starchcat. This consists of a 1M-element BED file
created from randomly-selected hg19 chromosomes from data obtained via the UCSC Genome 
Browser MySQL service.

This file is roughly 45 Mb in size and should do a decent job in testing how we handle 
streaming input on a typical workstation, because that file size will easily cut across 
the size of the buffers used when handling input, compressing, and extracting data (the
intermediate buffers used in starch, unstarch and starchcat are "delicate" areas where 
compression, extraction and concatenation can easily go wrong -- having input data go
over the size of these buffers is a good stress test).

To create a sorted, random BED input file only, without running tests, run:

  $ make random_bed_file

Results are in "/tmp/${username}/starch_regression_test/data/random.bed". See the 
"./original_data/src/make_unsorted_random_bed_file.sh" script for more details. 

The remaining folders are described below and represent test categories.



Compression and Extraction
====================================================================================
To ensure backwards compatibility, this tests whether data compressed with (at this 
time, April 30 2014, archive version v1.2, v1.5, v2.0 and v2.1) starch binaries can 
each be extracted with the versions of unstarch: the version that created the archive, 
and  the current version of unstarch (binary v2.5). Finally, we use 'diff' to ensure 
that the BED input and extracted output are identical.

To run these tests directly:

  $ make compression_and_extraction_tests



Concatenation
====================================================================================
To ensure that starchcat can concatenate Starch archives, we test the following 
use cases:
 
 * joining two or more Starch files of different versions
 * joining two or more Starch files with disjoint chromosome streams
 * joining two or more Starch files with mixed chromosome streams

For the first test, we start with the different versions of archives created in the
"compression-and-extraction" test category. The original BED file has 1M elements, so
we take, say, 250K elements from the first v1.2 archive and recompress them. We take 
the next 250K elements from the next v1.5 archive and recompress those. We take the 
remaining 500K elements from the third and fourth v2.0 and v2.1 archives, resp. and 
recompress them. 

If we concatenate these smaller archives, we expect that the final archive will 
contain all 1M original elements and in the same sort order. (If there are more, fewer 
or different elements, or if elements are out of order, then this test fails.)

For the second test, we take the random bed file and split it by chromosome into
separate BED files. We then compress these disjoint BED files into Starch archives
and concatenate the disjoint archives into one result. We then diff-test the final 
result with the random bed file.

The third test is similar to the first, except that we work within one version of an
archive. We split the original file into 100K-element subfiles, compress them, and
attempt to concatenate them. We expect that the final archive will have all 1M original
elements in preserved order. This test shows that a set of starch files that contain 
records that span multiple chromosomes can be reconstructed correctly.

To run these tests directly:

  $ make concatenation_tests



Metadata Integrity
====================================================================================
We extract the JSON-formatted metadata (with --list-json-no-trailing-newline) from a 
v2.1 archive and create a Base64-encoded SHA-1 hash of that data, using OpenSSL, xxd,
and base64 command-line utilities. This is compared with the result stored in the 
archive via --sha1-signature. If the hashes are identical, this provides assurance of
the integrity of the metadata.

Secondly, we review the metadata contents, specifically the per-chromosome element 
count and the overall and unique base counts. We grab a chromosome at random and 
use awk statements to generate "expected" results, which are compared to values 
obtained through per-chromosome <chr> --elements, <chr> --bases and <chr> --bases-uniq 
calls to the Starch archive. We also test all-chromosome results.

To run these tests directly:

  $ make metadata_integrity_tests



Nested and Duplicate Elements
====================================================================================
Starch v2.1 metadata (written by starch/starchcat v2.5 or greater, and read by unstarch
v2.5 or greater) include flags that indicate if a nested or duplicate element is located 
in a specified chromosome (or if the chromosome is omitted, if the condition is present 
in any chromosome contained within the archive). Here, we test if a given BED file 
compressed and extracted with v2.5 or greater tools display condition flags with 
expected values.

To run these tests directly:

  $ make nested_and_duplicate_element_tests




Updating
====================================================================================
Starchcat can also update older archives to current archive version, writing fresh
metadata in the process. We test whether v1.2, v1.5 and v2.0 archives can be updated 
to v2.1 Starch archives. Additionally, we compare extracted BED streams.

To run these tests directly:

  $ make updating_tests



Timing (TBD)
====================================================================================
Here we will provide tests for the amount of time it takes to compress and extract 
archives using the three versions of starch and unstarch. This should give us an idea 
of relative performance improvements (or areas where improvements are needed). The details 
of this test will be available when written.
