Sourcepredict example 1: Gut host species prediction

[1]:
import pandas as pd
import pandas_ml as pdml

In this example, we will use Sourcepredict and Sourcetracker2 applied to the example dataset provided in the Sourcepredict directory.

The example datasets contains the following samples: - Homo sapiens gut microbiome (1, 2, 3, 4, 5, 6) - Canis familiaris gut microbiome (1) - Soil microbiome (1, 2, 3)

Preparing the data

[2]:
cnt = pd.read_csv("../data/modern_gut_microbiomes_sources.csv", index_col=0)
labels = pd.read_csv("../data/modern_gut_microbiomes_labels.csv", index_col=0)

This is a TAXID count table containing the samples as columns headers, and the TAXID as row indices

[3]:
cnt.head()
[3]:
SRR1175007 SRR042182 SRR061154 SRR061499 SRR063469 SRR062324 SRR1179037 SRR061236 SRR061456 SRR642021 ... mgm4477903_3 mgm4477807_3 mgm4477874_3 mgm4477904_3 mgm4477804_3 mgm4477873_3 ERR1939166 SRR3578625 ERR1939165 SRR3578645
TAXID
0 3528337.0 11563613.0 10084261.0 20054993.0 8747525.0 12116517.0 4191329.0 13992760.0 14825759.0 11083673.0 ... 6169203.0 8820851.0 5713837.0 10238500.0 5055930.0 10380594.0 13391896.0 1553.0 14802198.0 736.0
6 0.0 78.0 0.0 127.0 0.0 79.0 0.0 0.0 0.0 172.0 ... 68.0 247.0 211.0 156.0 147.0 383.0 1353.0 0.0 1522.0 0.0
7 0.0 78.0 0.0 127.0 0.0 79.0 0.0 0.0 0.0 172.0 ... 68.0 247.0 211.0 156.0 147.0 383.0 1353.0 0.0 1522.0 0.0
9 0.0 129.0 0.0 153.0 0.0 151.0 0.0 165.0 96.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 77.0 0.0 65.0 0.0
10 0.0 160.0 0.0 193.0 0.0 99.0 0.0 55.0 249.0 238.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 263.0 0.0 466.0 0.0

5 rows × 432 columns

The labels file contains the mapping of samples names with their actual origin (sources)

[4]:
labels.head()
[4]:
labels
SRR1175007 Homo_sapiens
SRR042182 Homo_sapiens
SRR061154 Homo_sapiens
SRR061499 Homo_sapiens
SRR063469 Homo_sapiens

We will divide the source in training (95%) and testing (5%) dataset

[5]:
cnt_train = cnt.sample(frac=0.95, axis=1)
cnt_test = cnt.drop(cnt_train.columns, axis=1)

We also have to subset the labels file to only the training dataset

[6]:
train_labels = labels.loc[cnt_train.columns,:]
test_labels = labels.loc[cnt_test.columns,:]

Sourcepredict

Last but not least, we must export the files to csv to run sourcepredict

[7]:
cnt_train.to_csv("gut_species_sources.csv")
cnt_test.to_csv("gut_species_sinks.csv")
train_labels.to_csv("gut_species_labels.csv")

We’ll now launch sourcepredict with the GMPR normalization method, and the t-SNE embedding, on 6 cores.

[8]:
%%time
!sourcepredict -s gut_species_sources.csv \
               -l gut_species_labels.csv \
               -n GMPR \
               -m TSNE \
               -e example_embedding.csv \
               -t 6 gut_species_sinks.csv
Step 1: Checking for unknown proportion
  == Sample: SRR1175007 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: SRR1175007
                 known:98.68%
                 unknown:1.32%
  == Sample: SRR061236 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: SRR061236
                 known:85.01%
                 unknown:14.99%
  == Sample: SRR063471 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: SRR063471
                 known:98.4%
                 unknown:1.6%
  == Sample: SRR1930132 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: SRR1930132
                 known:98.48%
                 unknown:1.52%
  == Sample: SRR1930133 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 0.99
        ----------------------
        - Sample: SRR1930133
                 known:98.49%
                 unknown:1.51%
  == Sample: SRR7658586 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: SRR7658586
                 known:79.65%
                 unknown:20.35%
  == Sample: SRR7658645 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 0.99
        ----------------------
        - Sample: SRR7658645
                 known:26.88%
                 unknown:73.12%
  == Sample: SRR7658584 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 0.98
        ----------------------
        - Sample: SRR7658584
                 known:85.78%
                 unknown:14.22%
  == Sample: SRR7658607 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 0.99
        ----------------------
        - Sample: SRR7658607
                 known:98.74%
                 unknown:1.26%
  == Sample: SRR7658597 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 0.98
        ----------------------
        - Sample: SRR7658597
                 known:99.1%
                 unknown:0.9%
  == Sample: SRR5898944 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: SRR5898944
                 known:98.48%
                 unknown:1.52%
  == Sample: ERR1914439 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: ERR1914439
                 known:98.48%
                 unknown:1.52%
  == Sample: ERR1915140 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: ERR1915140
                 known:98.48%
                 unknown:1.52%
  == Sample: ERR1914041 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: ERR1914041
                 known:98.48%
                 unknown:1.52%
  == Sample: ERR1915022 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: ERR1915022
                 known:98.48%
                 unknown:1.52%
  == Sample: ERR1915826 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: ERR1915826
                 known:98.48%
                 unknown:1.52%
  == Sample: ERR1913400 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: ERR1913400
                 known:98.48%
                 unknown:1.52%
  == Sample: ERR1915765 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: ERR1915765
                 known:98.48%
                 unknown:1.52%
  == Sample: ERR1915225 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: ERR1915225
                 known:98.48%
                 unknown:1.52%
  == Sample: mgm4477874_3 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mgm4477874_3
                 known:72.37%
                 unknown:27.63%
  == Sample: ERR1939166 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 0.97
        ----------------------
        - Sample: ERR1939166
                 known:47.44%
                 unknown:52.56%
  == Sample: ERR1939165 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 0.97
        ----------------------
        - Sample: ERR1939165
                 known:55.27%
                 unknown:44.73%
Step 2: Checking for source proportion
        Computing weighted_unifrac distance on species rank
        TSNE embedding in 2 dimensions
        KNN machine learning
        Performing 5 fold cross validation on 6 cores...
        Trained KNN classifier with 10 neighbors
        -> Testing Accuracy: 0.99
        ----------------------
        - Sample: SRR1175007
                 Canis_familiaris:1.81%
                 Homo_sapiens:96.72%
                 Soil:1.47%
        - Sample: SRR061236
                 Canis_familiaris:1.81%
                 Homo_sapiens:96.72%
                 Soil:1.47%
        - Sample: SRR063471
                 Canis_familiaris:1.81%
                 Homo_sapiens:96.72%
                 Soil:1.47%
        - Sample: SRR1930132
                 Canis_familiaris:1.81%
                 Homo_sapiens:96.72%
                 Soil:1.47%
        - Sample: SRR1930133
                 Canis_familiaris:1.81%
                 Homo_sapiens:96.72%
                 Soil:1.47%
        - Sample: SRR7658586
                 Canis_familiaris:1.81%
                 Homo_sapiens:96.72%
                 Soil:1.47%
        - Sample: SRR7658645
                 Canis_familiaris:1.81%
                 Homo_sapiens:96.72%
                 Soil:1.47%
        - Sample: SRR7658584
                 Canis_familiaris:1.81%
                 Homo_sapiens:96.72%
                 Soil:1.47%
        - Sample: SRR7658607
                 Canis_familiaris:1.81%
                 Homo_sapiens:96.72%
                 Soil:1.47%
        - Sample: SRR7658597
                 Canis_familiaris:1.81%
                 Homo_sapiens:96.72%
                 Soil:1.47%
        - Sample: SRR5898944
                 Canis_familiaris:1.81%
                 Homo_sapiens:96.72%
                 Soil:1.47%
        - Sample: ERR1914439
                 Canis_familiaris:94.25%
                 Homo_sapiens:4.28%
                 Soil:1.47%
        - Sample: ERR1915140
                 Canis_familiaris:94.25%
                 Homo_sapiens:4.28%
                 Soil:1.47%
        - Sample: ERR1914041
                 Canis_familiaris:94.25%
                 Homo_sapiens:4.28%
                 Soil:1.47%
        - Sample: ERR1915022
                 Canis_familiaris:94.25%
                 Homo_sapiens:4.28%
                 Soil:1.47%
        - Sample: ERR1915826
                 Canis_familiaris:94.25%
                 Homo_sapiens:4.28%
                 Soil:1.47%
        - Sample: ERR1913400
                 Canis_familiaris:94.25%
                 Homo_sapiens:4.28%
                 Soil:1.47%
        - Sample: ERR1915765
                 Canis_familiaris:94.25%
                 Homo_sapiens:4.28%
                 Soil:1.47%
        - Sample: ERR1915225
                 Canis_familiaris:94.25%
                 Homo_sapiens:4.28%
                 Soil:1.47%
        - Sample: mgm4477874_3
                 Canis_familiaris:2.51%
                 Homo_sapiens:5.95%
                 Soil:91.53%
        - Sample: ERR1939166
                 Canis_familiaris:2.51%
                 Homo_sapiens:5.95%
                 Soil:91.53%
        - Sample: ERR1939165
                 Canis_familiaris:2.51%
                 Homo_sapiens:5.95%
                 Soil:91.53%
Sourcepredict result written to gut_species_sinks.sourcepredict.csv
Embedding coordinates written to example_embedding.csv
CPU times: user 3.64 s, sys: 828 ms, total: 4.47 s
Wall time: 4min 2s

Two files were generated by Sourcepredict: - gut_species_sinks.sourcepredict.csv which contains the proportions of each source

[9]:
sourcepred = pd.read_csv("gut_species_sinks.sourcepredict.csv", index_col=0)
[10]:
sourcepred
[10]:
SRR1175007 SRR061236 SRR063471 SRR1930132 SRR1930133 SRR7658586 SRR7658645 SRR7658584 SRR7658607 SRR7658597 ... ERR1915140 ERR1914041 ERR1915022 ERR1915826 ERR1913400 ERR1915765 ERR1915225 mgm4477874_3 ERR1939166 ERR1939165
Canis_familiaris 0.017814 0.015347 0.017765 0.017779 0.017781 0.014379 0.004853 0.015486 0.017825 0.017891 ... 0.928216 0.928216 0.928216 0.928216 0.928216 0.928216 0.928216 0.018200 0.011931 0.013898
Homo_sapiens 0.954443 0.822254 0.951788 0.952581 0.952644 0.770401 0.260037 0.829699 0.955022 0.958548 ... 0.042128 0.042128 0.042128 0.042128 0.042128 0.042128 0.042128 0.043078 0.028241 0.032896
Soil 0.014516 0.012506 0.014476 0.014488 0.014489 0.011717 0.003955 0.012619 0.014525 0.014579 ... 0.014504 0.014504 0.014504 0.014504 0.014504 0.014504 0.014504 0.662431 0.434275 0.505859
unknown 0.013226 0.149893 0.015971 0.015152 0.015086 0.203503 0.731155 0.142196 0.012627 0.008983 ... 0.015152 0.015152 0.015152 0.015152 0.015152 0.015152 0.015152 0.276292 0.525554 0.447347

4 rows × 22 columns

Let’s check which organism was predicted for each samples, and compare it with the true source

[11]:
comparison = sourcepred.idxmax().to_frame(name="prediction").merge(test_labels, left_index=True, right_index=True)
cm = pdml.ConfusionMatrix(y_true=comparison['labels'],y_pred=comparison['prediction'])
/Users/borry/miniconda3/lib/python3.6/site-packages/pandas/core/indexing.py:1494: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  return self._getitem_tuple(key)

Let’s look at the confusion matrix

[30]:
cm.to_dataframe()
[30]:
Predicted Canis_familiaris Homo_sapiens Soil unknown
Actual
Canis_familiaris 8 0 0 0
Homo_sapiens 0 10 0 1
Soil 0 0 2 1
unknown 0 0 0 0

Finally, let’s compute the accuracy

[31]:
round(cm.stats()['overall']['Accuracy'],2)
[31]:
0.91

91% of the sink samples were correctly predicted !

  • The second file generated by sourcepredict is example_embedding.csv which contains the embedding coordinates of all samples (sources and sinks)
[14]:
embed = pd.read_csv("example_embedding.csv", index_col=0)
embed.head()
[14]:
PC1 PC2 labels name
SRR1761672 -12.706208 -7.860738 Homo_sapiens SRR1761672
SRR061456 -4.520492 24.795073 Homo_sapiens SRR061456
SRR1761718 -20.427488 -6.425568 Homo_sapiens SRR1761718
SRR7658589 -23.176891 -2.985772 Homo_sapiens SRR7658589
ERR1914932 28.669333 12.863045 Canis_familiaris ERR1914932

We can plot this embedding, using for example, plotnine, which implements the grammar of graphics in Python

[15]:
from plotnine import *
import warnings
warnings.filterwarnings('ignore')
[16]:
ggplot(data = embed, mapping = aes(x="PC1",y="PC2", color="labels")) + geom_point() + theme_classic()
_images/run_example_32_0.png
[16]:
<ggplot: (-9223372029842878797)>

We can see on this plot where the sink samples were embedded

Sourcetracker2

SourceTracker is designed to predict the source of microbial communities in a set of input samples” and is generally consired as the gold standard method to do so. The version 2 is a rewrite of the original Sourcetracker in Python.

We’ll reuse the same training and test files, but we need to reformat them a bit for sourcetracker: - In sourcetracker, the source (training) and sink (file) TAXIDs count table is a single file - The metadata file is slightly different

[17]:
cnt.to_csv("gut_species_taxid.csv", sep="\t", index_label="TAXID")
[18]:
test_labels['SourceSink'] = ['sink']*test_labels.shape[0]
[19]:
train_labels['SourceSink'] = ['source']*train_labels.shape[0]
[20]:
metadata = train_labels.append(test_labels).rename(columns = {"labels":"Env"})[['SourceSink','Env']]
metadata.head()
[20]:
SourceSink Env
SRR1761672 source Homo_sapiens
SRR061456 source Homo_sapiens
SRR1761718 source Homo_sapiens
SRR7658589 source Homo_sapiens
ERR1914932 source Canis_familiaris
[21]:
metadata.to_csv("st_gut_species_metadata.csv", sep="\t", index_label='#SampleID')

Finally, we need to convert the TAXIDs count table to biom format

[22]:
!biom convert -i gut_species_taxid.csv -o gut_species_taxid.biom --table-type="Taxon table" --to-json
Soucetracker launch command: sourcetracker2 gibbs -i gut_species_taxid.biom -m st_gut_species_metadata.csv -o gut_species --jobs 6
(Sourcetracker2 was run on a Linux remote server because of issues running it on MacOS)
[32]:
st_pred = pd.read_csv("gut_species/mixing_proportions.txt", sep = "\t", index_col=0)
st_pred.head()
[32]:
Canis_familiaris Homo_sapiens Soil Unknown
#SampleID
SRR1175007 0.0170 0.9609 0.0063 0.0158
SRR061236 0.0358 0.9365 0.0074 0.0203
SRR063471 0.0121 0.9724 0.0032 0.0123
SRR1930132 0.1466 0.3761 0.4477 0.0296
SRR1930133 0.1182 0.5082 0.3507 0.0229
[34]:
st_comparison = st_pred.idxmax(axis=1).to_frame(name="prediction")
st_comparison.head()
[34]:
prediction
#SampleID
SRR1175007 Homo_sapiens
SRR061236 Homo_sapiens
SRR063471 Homo_sapiens
SRR1930132 Soil
SRR1930133 Homo_sapiens

Let’s compare the SourceTracker prediction with the true source

[35]:
comparison2 = st_comparison.merge(test_labels, left_index=True, right_index=True)
comparison2
[35]:
prediction labels SourceSink
SRR1175007 Homo_sapiens Homo_sapiens sink
SRR061236 Homo_sapiens Homo_sapiens sink
SRR063471 Homo_sapiens Homo_sapiens sink
SRR1930132 Soil Homo_sapiens sink
SRR1930133 Homo_sapiens Homo_sapiens sink
SRR7658586 Homo_sapiens Homo_sapiens sink
SRR7658645 Homo_sapiens Homo_sapiens sink
SRR7658584 Soil Homo_sapiens sink
SRR7658607 Homo_sapiens Homo_sapiens sink
SRR7658597 Homo_sapiens Homo_sapiens sink
SRR5898944 Homo_sapiens Homo_sapiens sink
ERR1914439 Canis_familiaris Canis_familiaris sink
ERR1915140 Soil Canis_familiaris sink
ERR1914041 Canis_familiaris Canis_familiaris sink
ERR1915022 Canis_familiaris Canis_familiaris sink
ERR1915826 Canis_familiaris Canis_familiaris sink
ERR1913400 Canis_familiaris Canis_familiaris sink
ERR1915765 Canis_familiaris Canis_familiaris sink
ERR1915225 Canis_familiaris Canis_familiaris sink
mgm4477874_3 Soil Soil sink
ERR1939166 Soil Soil sink
ERR1939165 Soil Soil sink

Computing the accuracy

[36]:
cm2 = pdml.ConfusionMatrix(y_true=comparison2["labels"], y_pred=comparison2["prediction"])
cm2.to_dataframe()
[36]:
Predicted Canis_familiaris Homo_sapiens Soil
Actual
Canis_familiaris 7 0 1
Homo_sapiens 0 9 2
Soil 0 0 3
[38]:
acc2 = round(cm2.stats()['overall']['Accuracy'],2)
[38]:
0.86

Here, Sourcetracker only managed to predict 86% of the sink samples origin correctly

Conclusion

On this dataset, we’ve seen that Sourcepredict performs similary or even better than Sourcetracker on predicting accurately the source species