Usage¶
Running sourcepredict on the test dataset¶
$ wget https://raw.githubusercontent.com/maxibor/sourcepredict/master/data/test/dog_test_sink_sample.csv -O dog_example.csv
$ wget https://raw.githubusercontent.com/maxibor/sourcepredict/master/data/modern_gut_microbiomes_labels.csv -O sp_labels.csv
$ wget https://raw.githubusercontent.com/maxibor/sourcepredict/master/data/modern_gut_microbiomes_sources.csv -O sp_sources.csv
$ sourcepredict -s sp_sources.csv -l sp_labels.csv dog_example.csv
Command line interface¶
$ sourcepredict -h
usage: SourcePredict v0.4 [-h] [-a ALPHA] [-s SOURCES] [-l LABELS]
[-n NORMALIZATION] [-dt DISTANCE] [-r TAX_RANK]
[-me METHOD] [-kne NEIGHBORS] [-kw WEIGHTS]
[-e EMBED] [-di DIM] [-o OUTPUT] [-se SEED]
[-k KFOLD] [-t THREADS]
sink_table
==========================================================
SourcePredict v0.4
Coprolite source classification
Author: Maxime Borry
Contact: <borry[at]shh.mpg.de>
Homepage & Documentation: github.com/maxibor/sourcepredict
==========================================================
positional arguments:
sink_table path to sink TAXID count table in csv format
optional arguments:
-h, --help show this help message and exit
-a ALPHA Proportion of sink sample in unknown. Default = 0.1
-s SOURCES Path to source csv file. Default =
data/modern_gut_microbiomes_sources.csv
-l LABELS Path to labels csv file. Default =
data/modern_gut_microbiomes_labels.csv
-n NORMALIZATION Normalization method (RLE | Subsample | GMPR | None).
Default = GMPR
-dt DISTANCE Distance method. (unweighted_unifrac | weighted_unifrac)
Default = weighted_unifrac
-r TAX_RANK Taxonomic rank to use for Unifrac distances. Default =
species
-me METHOD Embedding Method. TSNE, MDS, or UMAP. Default = TSNE
-kne NEIGHBORS Numbers of neigbors if KNN ML classication (integer or
'all'). Default = 0 (chosen by CV)
-kw WEIGHTS Sample weight function for KNN prediction (distance |
uniform). Default = distance.
-e EMBED Output embedding csv file. Default = None
-di DIM Number of dimensions to retain for dimension reduction.
Default = 2
-o OUTPUT Output file basename. Default =
<sample_basename>.sourcepredict.csv
-se SEED Seed for random generator. Default = 42
-k KFOLD Number of fold for K-fold cross validation in parameter
optimization. Default = 5
-t THREADS Number of threads for parallel processing. Default = 2
Command line arguments¶
sink_table¶
Sink TAXID count table in csv
file format
Example sink count table file
+-------+----------+----------+
| TAXID | SINK_1 | SINK_2 |
+-------+----------+----------+
| 283 | 5 | 2 |
+-------+----------+----------+
| 143 | 25 | 48 |
+-------+----------+----------+
-alpha¶
Proportion of alpha of sink sample in unknown. Default = 0.1
$$\alpha \in [0,1]$$
Example:
-alpha 0.1
-s SOURCES¶
Path to source csv
(training) file with samples in columns, and TAXIDs in rows. Default = data/sourcepredict/modern_gut_microbiomes_sources.csv
Example:
-s data/sourcepredict/modern_gut_microbiomes_sources.csv
Example source file :
+-------+----------+----------+
| TAXID | SAMPLE_1 | SAMPLE_2 |
+-------+----------+----------+
| 467 | 18 | 24 |
+-------+----------+----------+
| 786 | 3 | 90 |
+-------+----------+----------+
-l LABELS¶
Path to labels csv
file of sources.
Default = data/modern_gut_microbiomes_labels.csv
Example:
-l data/modern_gut_microbiomes_labels.csv
Example source file :
+----------+--------+
| | labels |
+----------+--------+
| SAMPLE_1 | Dog |
+----------+--------+
| SAMPLE_2 | Human |
+----------+--------+
-n NORMALIZATION¶
Normalization method. One of RLE
, CLR
, Subsample
, or GMPR
. Default = GMPR
-dt DISTANCE¶
Distance method. One of unweighted_unifrac
, weighted_unifrac
. Default = weighted_unifrac
Example:
-dt weighted_unifrac
-kne NEIGHBORS¶
Numbers of neigbors for KNN classication. Default = 0 (chosen by CV). Either an integer, or ‘all’
Example:
-kne 30
-kne all
Setting the number of neighbors to 0 will let Sourcepredict choose the optimal number of neighbors for classification. If set toall
, the KNN algorithm will use all the training samples. For source proportion estimation, setting-kne
to all will give better estimations.See example 2 for illustration.
–kw WEIGHTS¶
Sample weight function for KNN prediction (distance | uniform). Default = distance.
Choose to give a uniform or distance based weights to neighbor samples in KNN algorithm.
Distance base weights will work better for classification while uniform weigths will work better for source proportion estimation.See example 2 for illustration.
-e EMBED¶
File for saving embedding coordinates in csv
format. Default = None
Example:
-e embed_coord.csv
-o OUTPUT¶
Sourcepredict Output file basename. Default = <sample_basename>.sourcepredict.csv
Example:
-o my_output
-k KFOLD¶
Number of fold for K-fold cross validation in parameter optimization. Default = 5
Example:
-k 5
Choice of the taxonomic classifier¶
Different taxonomic classifiers will give different results, because of different algorithms, and different databases.
In order to produce correct results with Sourcepredict, the taxonomic classifier and the database used to produce the source TAXID count table must be the same as the one used to produce the sink TAXID count table.
Because Sourcepredict relies on machine learning, at least 10 samples per sources are required, but more source samples will lead to a better prediction by Sourcepredict.
Therefore, running all these samples through a taxonomic classifier ahead of Sourcepredict requires a non-negligeable computational time.
Hence the choice of the taxonomic classifier is a balance between precision, and computational time.
While this documentation doesn’t intent to be a benchmark of taxonomic classifiers, the author of Sourcepredict has had decent results with Kraken2 and recommends it for its good compromise between precision and runtime.
The example source and sink data provided with Sourcepredict were generated with Kraken2.
If you already have kraken report formatted results (from Kraken, KrakenUniq, Kraken2, Centrifuge, …), you can use the kraken_parse.py script to convert a kraken report to a column of a TAXID count table.