Sourcepredict example2: Estimating source proportions

For this example, we’ll reuse the dog, human, and soil dataset.
But unlike example1, here we will mix samples from different sources and estimate the mixing proportions with Sourcepredict and Sourcetracker2

Preparing mixed samples

[1]:
import pandas as pd
from plotnine import *
import numpy as np
[2]:
cnt = pd.read_csv("../data/modern_gut_microbiomes_sources.csv", index_col=0)
labels = pd.read_csv("../data/modern_gut_microbiomes_labels.csv",index_col=0)

As in example 1, we’ll first split the dataset into training (95%) and testing(5%)

[3]:
cnt_train = cnt.sample(frac=0.95, axis=1)
cnt_test = cnt.drop(cnt_train.columns, axis=1)
train_labels = labels.loc[cnt_train.columns,:]
test_labels = labels.loc[cnt_test.columns,:]
[4]:
test_labels['labels'].value_counts()
[4]:
Homo_sapiens        11
Canis_familiaris     9
Soil                 2
Name: labels, dtype: int64
[5]:
cnt_test.head()
[5]:
SRR061456 SRR1175013 SRR059395 SRR1930141 SRR1930247 SRR1761710 SRR1761700 SRR7658605 SRR7658665 SRR7658625 ... ERR1916299 ERR1914213 ERR1915363 ERR1913953 ERR1913947 ERR1916319 ERR1915204 ERR1914750 mgm4477803_3 mgm4477877_3
TAXID
0 14825759.0 4352892.0 13691926.0 27457943.0 1212101.0 24026729.0 18876667.0 7776902.0 34166674.0 15983447.0 ... 1481835.0 3254064.0 2182037.0 2225689.0 2292745.0 1549960.0 760058.0 1750182.0 5492333.0 6004642.0
6 0.0 0.0 107.0 193.0 0.0 87.0 94.0 0.0 215.0 105.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 73.0 216.0
7 0.0 0.0 107.0 193.0 0.0 87.0 94.0 0.0 215.0 105.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 73.0 216.0
9 96.0 101.0 70.0 412.0 0.0 395.0 199.0 299.0 563.0 369.0 ... 62.0 63.0 110.0 59.0 63.0 51.0 0.0 61.0 0.0 0.0
10 249.0 0.0 136.0 614.0 0.0 265.0 267.0 76.0 985.0 350.0 ... 66.0 174.0 0.0 74.0 73.0 62.0 0.0 66.0 0.0 0.0

5 rows × 22 columns

We then create a function to randomly select a sample from each source (dog as \(s_{dog}\) and human as \(s_{human}\)), and combine such as the new sample \(s_{mixed} = p1*s_{dog} + p1*s_{human}\)

[6]:
def create_mixed_sample(cnt, labels, p1, samp_name):
    rand_dog = labels.query('labels == "Canis_familiaris"').sample(1).index[0]
    rand_human = labels.query('labels == "Homo_sapiens"').sample(1).index[0]
    dog_samp = cnt[rand_dog]*p1
    human_samp = cnt[rand_human]*(1-p1)
    comb = dog_samp + human_samp
    comb = comb.rename(samp_name)
    meta = pd.DataFrame({'human_sample':[rand_human],'dog_sample':[rand_dog], 'human_prop':[(1-p1)], 'dog_prop':[p1]}, index=[samp_name])
    return(comb, meta)

We run this function for a range of mixed proportions (0 to 90%, by 10%), 3 time for each mix

[7]:
mixed_samp = []
mixed_meta = []
nb = 1
for i in range(3):
    for p1 in np.arange(0.1,1,0.1):
        s = create_mixed_sample(cnt=cnt_test, labels=test_labels, p1=p1, samp_name=f"mixed_sample_{nb}")
        mixed_samp.append(s[0])
        mixed_meta.append(s[1])
        nb += 1
[8]:
mixed_samples = pd.concat(mixed_samp, axis=1, keys=[s.name for s in mixed_samp]).astype(int)
mixed_samples.head()
[8]:
mixed_sample_1 mixed_sample_2 mixed_sample_3 mixed_sample_4 mixed_sample_5 mixed_sample_6 mixed_sample_7 mixed_sample_8 mixed_sample_9 mixed_sample_10 ... mixed_sample_18 mixed_sample_19 mixed_sample_20 mixed_sample_21 mixed_sample_22 mixed_sample_23 mixed_sample_24 mixed_sample_25 mixed_sample_26 mixed_sample_27
TAXID
0 1320165 27791888 14189886 14720060 18710369 1414816 2910789 4518936 2282396 7217415 ... 5331330 24788154 2020394 5968886 3231719 10313424 8480642 5192549 5175478 4809264
6 0 172 65 52 107 0 0 21 10 0 ... 8 173 0 0 0 47 37 32 18 19
7 0 172 65 52 107 0 0 21 10 0 ... 8 173 0 0 0 47 37 32 18 19
9 6 463 158 237 313 30 74 61 36 280 ... 96 370 132 227 81 130 110 56 88 97
10 7 802 239 159 579 37 51 86 34 68 ... 183 552 113 73 24 166 144 84 106 127

5 rows × 27 columns

[9]:
mixed_metadata = pd.concat(mixed_meta)
mixed_metadata.head()
[9]:
human_sample dog_sample human_prop dog_prop
mixed_sample_1 SRR1930247 ERR1913947 0.9 0.1
mixed_sample_2 SRR7658665 ERR1913947 0.8 0.2
mixed_sample_3 SRR1761700 ERR1914213 0.7 0.3
mixed_sample_4 SRR1761710 ERR1915204 0.6 0.4
mixed_sample_5 SRR7658665 ERR1914213 0.5 0.5

Now we can export the new “test” (sink) table to csv for sourcepredict

[10]:
mixed_samples.to_csv('mixed_samples_cnt.csv')

As well as the source count and labels table for the sources

[11]:
train_labels.to_csv('train_labels.csv')
cnt_train.to_csv('sources_cnt.csv')

Sourcepredict

For running Sourcepredict, we’ll change two parameters from their default values: - -me The default method used by Sourcepredict is T-SNE which a non-linear type of embedding, i.e. the distance between points doesn’t reflext their actual distance in the original dimensions, to achieve a better clustering, which is good for source prediction. Because here we’re more interested in source proportion estimation, rather than source prediction, we’ll choose a Multi Dimensional Scaling (MDS) which is a type of linear embedding, where the distance between points in the lozer dimension match more the distances in the embedding in lower dimension, which is better for source proportion estimation. - -kne which is the number of neighbors in KNN algorithm: we use a greater (50) number of neighbors to reflect more global contribution of samples to the proportion estimation, instead of only the immediate neighbors. This will affect negatively the source prediction, but give better source proportion estimations - -kw which is the weigth function in the KNN algorithm. By defaul a distance based weight function is apllied to give more weigth to closer samples. However, here, we’re more interested in source proportion estimation, rather than source prediction, so we’ll disregard the distance based weight function and give the same weight to all neighboring samples, regardless of their distance, with the uniform weight function.

[12]:
%%time
!python ../sourcepredict -s sources_cnt.csv \
               -l train_labels.csv \
               -n GMPR \
               -kne 50\
               -kw uniform \
               -me MDS \
               -e mixed_embedding.csv \
               -t 6 \
               mixed_samples_cnt.csv
Step 1: Checking for unknown proportion
  == Sample: mixed_sample_1 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_1
                 known:98.48%
                 unknown:1.52%
  == Sample: mixed_sample_2 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 0.99
        ----------------------
        - Sample: mixed_sample_2
                 known:99.73%
                 unknown:0.27%
  == Sample: mixed_sample_3 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_3
                 known:98.79%
                 unknown:1.21%
  == Sample: mixed_sample_4 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_4
                 known:98.8%
                 unknown:1.2%
  == Sample: mixed_sample_5 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_5
                 known:98.82%
                 unknown:1.18%
  == Sample: mixed_sample_6 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_6
                 known:98.48%
                 unknown:1.52%
  == Sample: mixed_sample_7 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_7
                 known:98.48%
                 unknown:1.52%
  == Sample: mixed_sample_8 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_8
                 known:98.48%
                 unknown:1.52%
  == Sample: mixed_sample_9 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_9
                 known:98.48%
                 unknown:1.52%
  == Sample: mixed_sample_10 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_10
                 known:98.48%
                 unknown:1.52%
  == Sample: mixed_sample_11 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 0.99
        ----------------------
        - Sample: mixed_sample_11
                 known:99.16%
                 unknown:0.84%
  == Sample: mixed_sample_12 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_12
                 known:98.5%
                 unknown:1.5%
  == Sample: mixed_sample_13 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_13
                 known:98.48%
                 unknown:1.52%
  == Sample: mixed_sample_14 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_14
                 known:98.48%
                 unknown:1.52%
  == Sample: mixed_sample_15 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_15
                 known:98.49%
                 unknown:1.51%
  == Sample: mixed_sample_16 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_16
                 known:98.48%
                 unknown:1.52%
  == Sample: mixed_sample_17 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_17
                 known:98.48%
                 unknown:1.52%
  == Sample: mixed_sample_18 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_18
                 known:98.48%
                 unknown:1.52%
  == Sample: mixed_sample_19 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 0.9
        ----------------------
        - Sample: mixed_sample_19
                 known:99.79%
                 unknown:0.21%
  == Sample: mixed_sample_20 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_20
                 known:98.48%
                 unknown:1.52%
  == Sample: mixed_sample_21 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_21
                 known:98.48%
                 unknown:1.52%
  == Sample: mixed_sample_22 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_22
                 known:98.48%
                 unknown:1.52%
  == Sample: mixed_sample_23 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_23
                 known:98.48%
                 unknown:1.52%
  == Sample: mixed_sample_24 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_24
                 known:98.48%
                 unknown:1.52%
  == Sample: mixed_sample_25 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_25
                 known:98.48%
                 unknown:1.52%
  == Sample: mixed_sample_26 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_26
                 known:98.48%
                 unknown:1.52%
  == Sample: mixed_sample_27 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_27
                 known:98.48%
                 unknown:1.52%
Step 2: Checking for source proportion
        Computing weighted_unifrac distance on species rank
        MDS embedding in 2 dimensions
        KNN machine learning
        Trained KNN classifier with 50 neighbors
        -> Testing Accuracy: 0.88
        ----------------------
        - Sample: mixed_sample_1
                 Canis_familiaris:87.65%
                 Homo_sapiens:11.14%
                 Soil:1.21%
        - Sample: mixed_sample_2
                 Canis_familiaris:3.66%
                 Homo_sapiens:95.03%
                 Soil:1.31%
        - Sample: mixed_sample_3
                 Canis_familiaris:3.66%
                 Homo_sapiens:95.03%
                 Soil:1.31%
        - Sample: mixed_sample_4
                 Canis_familiaris:5.65%
                 Homo_sapiens:93.02%
                 Soil:1.33%
        - Sample: mixed_sample_5
                 Canis_familiaris:3.66%
                 Homo_sapiens:95.03%
                 Soil:1.31%
        - Sample: mixed_sample_6
                 Canis_familiaris:92.78%
                 Homo_sapiens:6.02%
                 Soil:1.2%
        - Sample: mixed_sample_7
                 Canis_familiaris:54.39%
                 Homo_sapiens:44.26%
                 Soil:1.35%
        - Sample: mixed_sample_8
                 Canis_familiaris:36.94%
                 Homo_sapiens:61.66%
                 Soil:1.4%
        - Sample: mixed_sample_9
                 Canis_familiaris:5.65%
                 Homo_sapiens:93.02%
                 Soil:1.33%
        - Sample: mixed_sample_10
                 Canis_familiaris:5.65%
                 Homo_sapiens:93.02%
                 Soil:1.33%
        - Sample: mixed_sample_11
                 Canis_familiaris:5.65%
                 Homo_sapiens:93.02%
                 Soil:1.33%
        - Sample: mixed_sample_12
                 Canis_familiaris:30.32%
                 Homo_sapiens:68.27%
                 Soil:1.41%
        - Sample: mixed_sample_13
                 Canis_familiaris:5.65%
                 Homo_sapiens:93.02%
                 Soil:1.33%
        - Sample: mixed_sample_14
                 Canis_familiaris:30.32%
                 Homo_sapiens:68.27%
                 Soil:1.41%
        - Sample: mixed_sample_15
                 Canis_familiaris:24.34%
                 Homo_sapiens:74.24%
                 Soil:1.41%
        - Sample: mixed_sample_16
                 Canis_familiaris:30.32%
                 Homo_sapiens:68.27%
                 Soil:1.41%
        - Sample: mixed_sample_17
                 Canis_familiaris:27.24%
                 Homo_sapiens:71.35%
                 Soil:1.41%
        - Sample: mixed_sample_18
                 Canis_familiaris:83.76%
                 Homo_sapiens:15.02%
                 Soil:1.22%
        - Sample: mixed_sample_19
                 Canis_familiaris:24.34%
                 Homo_sapiens:74.24%
                 Soil:1.41%
        - Sample: mixed_sample_20
                 Canis_familiaris:3.66%
                 Homo_sapiens:95.03%
                 Soil:1.31%
        - Sample: mixed_sample_21
                 Canis_familiaris:5.65%
                 Homo_sapiens:93.02%
                 Soil:1.33%
        - Sample: mixed_sample_22
                 Canis_familiaris:40.4%
                 Homo_sapiens:58.2%
                 Soil:1.4%
        - Sample: mixed_sample_23
                 Canis_familiaris:3.66%
                 Homo_sapiens:95.03%
                 Soil:1.31%
        - Sample: mixed_sample_24
                 Canis_familiaris:3.66%
                 Homo_sapiens:95.03%
                 Soil:1.31%
        - Sample: mixed_sample_25
                 Canis_familiaris:30.32%
                 Homo_sapiens:68.27%
                 Soil:1.41%
        - Sample: mixed_sample_26
                 Canis_familiaris:21.65%
                 Homo_sapiens:76.94%
                 Soil:1.41%
        - Sample: mixed_sample_27
                 Canis_familiaris:85.18%
                 Homo_sapiens:13.6%
                 Soil:1.22%
Sourcepredict result written to mixed_samples_cnt.sourcepredict.csv
Embedding coordinates written to mixed_embedding.csv
CPU times: user 4.84 s, sys: 1.13 s, total: 5.97 s
Wall time: 5min 14s

Reading Sourcepredict results

[13]:
sp_ebd = pd.read_csv("mixed_embedding.csv", index_col=0)
[14]:
sp_ebd.head()
[14]:
PC1 PC2 labels name
SRR1761712 -3.713176 -0.344326 Homo_sapiens SRR1761712
ERR1913614 2.586560 -0.498098 Canis_familiaris ERR1913614
ERR1914349 -1.595982 -3.363762 Canis_familiaris ERR1914349
SRR1930255 2.174966 -0.728862 Homo_sapiens SRR1930255
SRR1646027 -3.329213 -0.214682 Homo_sapiens SRR1646027
[15]:
import warnings
warnings.filterwarnings('ignore')
[16]:
ggplot(data = sp_ebd, mapping = aes(x='PC1',y='PC2')) + geom_point(aes(color='labels')) + theme_classic()
_images/mixed_prop_26_0.png
[16]:
<ggplot: (-9223372029299469603)>
[17]:
sp_pred = pd.read_csv("mixed_samples_cnt.sourcepredict.csv", index_col=0)
[18]:
sp_pred.T.head()
[18]:
Canis_familiaris Homo_sapiens Soil unknown
mixed_sample_1 0.863266 0.109678 0.011905 0.015152
mixed_sample_2 0.036542 0.947680 0.013046 0.002733
mixed_sample_3 0.036198 0.938776 0.012923 0.012103
mixed_sample_4 0.055810 0.919077 0.013163 0.011951
mixed_sample_5 0.036211 0.939098 0.012927 0.011764
[19]:
mixed_metadata.head()
[19]:
human_sample dog_sample human_prop dog_prop
mixed_sample_1 SRR1930247 ERR1913947 0.9 0.1
mixed_sample_2 SRR7658665 ERR1913947 0.8 0.2
mixed_sample_3 SRR1761700 ERR1914213 0.7 0.3
mixed_sample_4 SRR1761710 ERR1915204 0.6 0.4
mixed_sample_5 SRR7658665 ERR1914213 0.5 0.5
[20]:
sp_res = sp_pred.T.merge(mixed_metadata, left_index=True, right_index=True)
[21]:
sp_res.head()
[21]:
Canis_familiaris Homo_sapiens Soil unknown human_sample dog_sample human_prop dog_prop
mixed_sample_1 0.863266 0.109678 0.011905 0.015152 SRR1930247 ERR1913947 0.9 0.1
mixed_sample_2 0.036542 0.947680 0.013046 0.002733 SRR7658665 ERR1913947 0.8 0.2
mixed_sample_3 0.036198 0.938776 0.012923 0.012103 SRR1761700 ERR1914213 0.7 0.3
mixed_sample_4 0.055810 0.919077 0.013163 0.011951 SRR1761710 ERR1915204 0.6 0.4
mixed_sample_5 0.036211 0.939098 0.012927 0.011764 SRR7658665 ERR1914213 0.5 0.5
[22]:
from sklearn.metrics import r2_score, mean_squared_error
[23]:
mse_sp = round(mean_squared_error(y_pred=sp_res['Homo_sapiens'], y_true=sp_res['human_prop']),2)
[24]:
p = ggplot(data = sp_res, mapping=aes(x='human_prop',y='Homo_sapiens')) + geom_point()
p += labs(title = f"Homo sapiens proportions predicted by Soucepredict - $MSE = {mse_sp}$", x='actual', y='predicted')
p += theme_classic()
p += coord_cartesian(xlim=[0,1], ylim=[0,1])
p += geom_abline(intercept=0, slope=1, color = "red", alpha=0.2, linetype = 'dashed')
p
_images/mixed_prop_34_0.png
[24]:
<ggplot: (-9223372036555534967)>

On this plot, the dotted red line represents what a perfect proportion estimation would give, with a Mean Squared Error (MSE) = 0.

[25]:
sp_res_hist = (sp_res['human_prop'].append(sp_res['Homo_sapiens']).to_frame(name='Homo_sapiens_prop'))
sp_res_hist['source'] = (['actual']*sp_res.shape[0]+['predicted']*sp_res.shape[0])
[47]:
p = ggplot(data = sp_res_hist, mapping=aes(x='Homo_sapiens_prop')) + geom_density(aes(fill='source'), alpha=0.3)
p += labs(title = 'Distribution of Homo sapiens predicted proportions by Sourcepredict')
p += scale_fill_discrete(name="Homo sapiens proportion")
p += theme_classic()
p
_images/mixed_prop_37_0.png
[47]:
<ggplot: (-9223372029298930965)>

This plot shows the actual and predicted by Sourcepredict distribution of Human proportions. What we are interested in is the overlap between the two colors: the higer it is, the more the estimated Human proportion is accurate.

Sourcetracker2

Preparing count table

[27]:
cnt_train.merge(mixed_samples, right_index=True, left_index=True).to_csv("st_mixed_count.csv" , sep="\t", index_label="TAXID")
[28]:
!biom convert -i st_mixed_count.csv -o st_mixed_count.biom --table-type="Taxon table" --to-json

Preparing metadata

[29]:
train_labels['SourceSink'] = ['source']*train_labels.shape[0]
[30]:
mixed_metadata['labels'] = ['-']*mixed_metadata.shape[0]
mixed_metadata['SourceSink'] = ['sink']*mixed_metadata.shape[0]
[31]:
st_labels = train_labels.append(mixed_metadata[['labels', 'SourceSink']])
[32]:
st_labels = st_labels.rename(columns={'labels':'Env'})[['SourceSink','Env']]
[33]:
st_labels.to_csv("st_mixed_labels.csv", sep="\t", index_label='#SampleID')
Running Sourcetracker2 sourcetracker2 gibbs -i st_mixed_count.biom -m st_mixed_labels.csv -o mixed_prop --jobs 6
(Sourcetracker2 was run on a Linux remote server because of issues running it on MacOS)

Sourcetracker2 results

[40]:
st_pred = pd.read_csv("mixed_prop/mixing_proportions.txt", sep="\t", index_col=0)
[41]:
st_res = st_pred.merge(mixed_metadata, left_index=True, right_index=True)
[42]:
mse_st = round(mean_squared_error(y_pred=st_res['Homo_sapiens'], y_true=st_res['human_prop']),2)
[43]:
p = ggplot(data = st_res, mapping=aes(x='human_prop',y='Homo_sapiens')) + geom_point()
p += labs(title = f"Homo sapiens proportions predicted by Soucepretracker2 - $MSE = {mse_st}$", x='actual', y='predicted')
p += theme_classic()
p += coord_cartesian(xlim=[0,1], ylim=[0,1])
p += geom_abline(intercept=0, slope=1, color = "red", alpha=0.2, linetype = 'dashed')
p
_images/mixed_prop_54_0.png
[43]:
<ggplot: (-9223372029300899661)>

On this plot, the dotted red line represents what a perfect proportion estimation would give, with a Mean Squared Error (MSE) = 0. Regarding the MSE, Sourcepredict and Sourcetracker perform similarly with a MSE of 0.13 for Sourcepredict and 0.12 for Sourcetracker.

[44]:
st_res_hist = (st_res['human_prop'].append(st_res['Homo_sapiens']).to_frame(name='Homo_sapiens_prop'))
st_res_hist['source'] = (['actual']*st_res.shape[0]+['predicted']*st_res.shape[0])
[46]:
p = ggplot(data = st_res_hist, mapping=aes(x='Homo_sapiens_prop')) + geom_density(aes(fill='source'), alpha=0.4)
p += labs(title = 'Distribution of Homo sapiens predicted proportions by Sourcetracker2')
p += scale_fill_discrete(name="Homo sapiens proportion")
p += theme_classic()
p
_images/mixed_prop_57_0.png
[46]:
<ggplot: (7555587890)>
This plot shows the actual and predicted by Sourcepredict distribution of Human proportions. What we are interested in is the overlap between the two colors: the higer it is, the more the estimated Human proportion is accurate.
Here, there is a bigger overlap between actual and predicted, suggesting a slightly better source proportion estimation than with Sourcepredict.

Conclusion

For source proportion estimation in samples of mixed sources, we’ve seen that Sourcepredict, with adapted parameters, can perform similarly as Sourcetracker.
However, because Sourcepredict was designed for source prediction in mind, as opposed to source proportion estimation, it requires parameters tweaking to achive the same results as Sourcetracker.
Therefore, for source proportion estimation, we still recommend using Sourcetracker, even if Sourcepredict can perform similarly.