Accessing and analyzing your data
===================================

Once your data has been organized either into a MongoDB or local file format, you can
access streamline and label data to perform all sorts of analyses.

Searching your local datasource
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Once the necessary streamline labels and pickle files are in their specified 
locations, we can begin to search the database. ::

  from dsi2.database.track_datasource import TrackDataSource

  # Load the json file to a list of Scan objects
  scans = get_local_data("path/to/file.json")

``scans`` will be a list of objects that can be used to select only the data 
you want for this analysis. It does **not** load the pkl into memory until 
you call its :py:meth:`~dsi2.database.traited_query.Scan.get_track_dataset()` function.
Let's select only scans from the "example" study and load them into memory.::

  example_scans = [ scan for scan in scans if scan.study_id == "example_study" ]
  # Use them to build a queryable data source
  data_source = TrackDataSource(track_datasets = [scan.get_track_dataset() for scan in scans])

``data_source`` provides an interface to a searchable MNI space. Raw streamlines 
aren't particularly useful, so we will create an object that uses this interface
to search through and aggregate streamlines.

Searching your MongoDB Datasource
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you're using the unit testing data, you can access the ``dsi2_test`` database  
after making sure ``mongod`` is properly configured and running ::

  from dsi2.database.mongo_track_datasource import MongoTrackDataSource
  data_source = MongoTrackDataSource(
        scan_ids = ["0377A", "2843A" ],
        mongo_host = "127.0.0.1",
        mongo_port = 27017,
        db_name="dsi2"
    )

The MongoDB version has a number of advantages. Primarily, it makes concurrency easier
and scales much better than the pickle approach.  The 
:py:class:`~dsi2.database.mongo_track_datasource.MongoTrackDataSource` class has all the 
same methods as the pickle version.


Aggregating streamlines based on termination regions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To apply perform the analysis from [1]_ we need to create an *aggregator*. An 
aggregator 
  
  * Subclasses :py:class:`~dsi2.aggregation.cluster_ui.ClusterEditor`
  * Overrides the ``aggregate`` method 
  * operates on a :py:class:`~dsi2.database.track_datasource.TrackDataSource` 

For a region-based LTPA, we can create a :py:class:`~dsi2.aggregation.region_labeled_clusters.RegionLabelAggregator`
that will provide some methods for easily analyzing these streamlines resulting from the
search. Suppose we'd like to search a set of coordinates around :math:`(33,54,45)`.::
  
  from dsi2.ltpa import mni_white_matter, run_ltpa
  from dsi2.aggregation import make_aggregator
  from dsi2.streamlines.track_math import sphere_around_ijk

  region_agg = make_aggregator( algorithm="region labels",
                                atlas_name="Lausanne2008",
                                atlas_scale=60, data_source=data_source)
  # create a set of search coordinates
  sphere_radius = 2                  # voxels
  center_coordinate = (33,54,45)     # in MNI152 i,j,k coordinates
  search_coords = sphere_around_ijk(sphere_radius, center_coordinate)

  # Query the data source with the coordinates, get new TrackDatasets
  query_tracks = data_source.query_ijk(search_coords, fetch_streamlines=False)

  # Feed the query results to the aggregator
  region_agg.set_track_sets(query_tracks)
  region_agg.update_clusters()

  # which regions pairs were found and how many streamlines to each?
  conn_ids, connection_vectors = region_agg.connection_vector_matrix()

  # if running an interactive ipython session, plot it
  region_agg.plot_connection_vector_lines(connection_vectors,conn_ids)

``conn_ids`` is a list of connections (regionA, regionB) found that pass through
the search coordinates. If there are :math:`n` individuals in the :py:class:`~dsi2.database.track_datasource.TrackDataSource`
and :math:`m` elements in ``conn_ids``, then ``connection_vectors`` will be an
:math:`n \times m` matrix where row :math:`i` column :math:`j` contains the streamline
count connection region pair math:`j` in subject :math:`i`\'s data.


Running a whole-brain LTPA
~~~~~~~~~~~~~~~~~~~~~~~~~~~

The code above is tedious and would take a long time to loop over the whole brain.
It is much more convenient to use the :py:meth:`~dsi2.ltpa.run_ltpa` function. Here 
is an example script that performs a simple whole-brain ltpa.  It requires a description
of an aggregator. 

Create a dictionary containing all the information necessary to construct
an aggregator for your analysis function. It works this way instead of 
by directly creating an Aggregator object because run_ltpa constructs a new
Aggregator inside each independent process it launches.

The dictionary must contain at least the key "algorithm", which can be one of 
{ "region labels", "k-means", "quickbundles"}. The rest of the keys are sent
as keyword arguments to :py:meth:`~dsi2.aggregation.make_aggregator`. 

When run_ltpa is looping over coordinates, result of a spatial query is sent
to an instance of the aggregator.  The aggregator's ``aggregate()`` method 
is called for each TrackDataset returned from the query, then the aggregator
is sent to whichever function you provided to run_ltpa.

NOTE: If you select the "region labels" aggregator, then you won't have access
to the streamline objects. To access streamlines, choose "k-means" or 
"quickbundles". ::
  
  agg_args = {
            "algorithm":"region labels",
            "atlas_name":"Lausanne2008",
            "atlas_scale":60,
            "data_source":data_source
            }

You will also need to define a function that will extract the information you care about
from the aggregator.::

  def get_n_streamlines(aggregator):
      """
      This function should be replaced with a function that accepts a single argument,
      does something, then returns the results you care about.
      
      This particular function calculates the mean number of streamlines observed
      in each subject and returns this value and its standard deviation across all
      subjects.  We used this to calculate how many streamlines pass through each voxel
      then compared this number to how many real axons are known to pass through a voxel
      (based on electron microscopy).
      
      NOTE: you can access streamlines directly by the aggregator's ``track_sets``
      attribute, which is a list of TrackDataset objects.  Each will have a ``.tracks``
      attribute containing the numpy array of streamline coordinates.  Again, in this
      case ``.tracks`` will be empty because we are using a region label aggregator.
      """
      conn_ids, cvec_mat = aggregator.connection_vector_matrix()
      # The "region labels" aggregator has a ``connection_vector_matrix()``
      # function, which returns a list of all connections observed going through
      # the query coordinates (``conn_ids``) and a matrix where each row is a 
      # subject and column is ea connection. 
      sl_counts = cvec_mat.sum(1)
      # Sums across all connections for each subject
      return sl_counts.mean(), sl_counts.std()

The variable ``mni_white_matter`` contains the coordinates inside of FSL's MNI 2mm
white matter mask.  We can split up these coordinates across a number of processes.
Here we use two processors. ::

  results = run_ltpa(get_n_streamlines, data_source=data_source,
                     aggregator_args=agg_args, radius=2,
                     n_procs=2, search_centers=mni_white_matter)

For each coordinate in ``mni_white_matter`` a tuple is stored in ``results`` that contains
the mean streamline count and its standard deviation.  You can make the analysis function
return as much data as you'd like. It will always contain results in the same order as the
coordinates specified in ``search_centers``.


References
~~~~~~~~~~~

.. [1] Cieslak, M., & Grafton, S.T. Local termination pattern analysis:
    a tool for comparing white matter morphology. Brain Imaging Behav, DOI 10.1007/s11682-013-9254-z (2013).