Sourcepredict example2: Estimating source proportions¶

For this example, we’ll reuse the dog, human, and soil dataset.

But unlike example1, here we will mix samples from different sources and estimate the mixing proportions with Sourcepredict and Sourcetracker2

Preparing mixed samples¶

[1]:

import pandas as pd
from plotnine import *
import numpy as np

[2]:

cnt = pd.read_csv("../data/modern_gut_microbiomes_sources.csv", index_col=0)
labels = pd.read_csv("../data/modern_gut_microbiomes_labels.csv",index_col=0)

As in example 1, we’ll first split the dataset into training (95%) and testing(5%)

[3]:

cnt_train = cnt.sample(frac=0.95, axis=1)
cnt_test = cnt.drop(cnt_train.columns, axis=1)
train_labels = labels.loc[cnt_train.columns,:]
test_labels = labels.loc[cnt_test.columns,:]

[4]:

test_labels['labels'].value_counts()

[4]:

Homo_sapiens        11
Canis_familiaris     9
Soil                 2
Name: labels, dtype: int64

[5]:

cnt_test.head()

[5]:

	SRR061456	SRR1175013	SRR059395	SRR1930141	SRR1930247	SRR1761710	SRR1761700	SRR7658605	SRR7658665	SRR7658625	...	ERR1916299	ERR1914213	ERR1915363	ERR1913953	ERR1913947	ERR1916319	ERR1915204	ERR1914750	mgm4477803_3	mgm4477877_3
TAXID
0	14825759.0	4352892.0	13691926.0	27457943.0	1212101.0	24026729.0	18876667.0	7776902.0	34166674.0	15983447.0	...	1481835.0	3254064.0	2182037.0	2225689.0	2292745.0	1549960.0	760058.0	1750182.0	5492333.0	6004642.0
6	0.0	0.0	107.0	193.0	0.0	87.0	94.0	0.0	215.0	105.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	73.0	216.0
7	0.0	0.0	107.0	193.0	0.0	87.0	94.0	0.0	215.0	105.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	73.0	216.0
9	96.0	101.0	70.0	412.0	0.0	395.0	199.0	299.0	563.0	369.0	...	62.0	63.0	110.0	59.0	63.0	51.0	0.0	61.0	0.0	0.0
10	249.0	0.0	136.0	614.0	0.0	265.0	267.0	76.0	985.0	350.0	...	66.0	174.0	0.0	74.0	73.0	62.0	0.0	66.0	0.0	0.0

5 rows × 22 columns

We then create a function to randomly select a sample from each source (dog as \(s_{dog}\) and human as \(s_{human}\)), and combine such as the new sample \(s_{mixed} = p1*s_{dog} + p1*s_{human}\)

[6]:

def create_mixed_sample(cnt, labels, p1, samp_name):
    rand_dog = labels.query('labels == "Canis_familiaris"').sample(1).index[0]
    rand_human = labels.query('labels == "Homo_sapiens"').sample(1).index[0]
    dog_samp = cnt[rand_dog]*p1
    human_samp = cnt[rand_human]*(1-p1)
    comb = dog_samp + human_samp
    comb = comb.rename(samp_name)
    meta = pd.DataFrame({'human_sample':[rand_human],'dog_sample':[rand_dog], 'human_prop':[(1-p1)], 'dog_prop':[p1]}, index=[samp_name])
    return(comb, meta)

We run this function for a range of mixed proportions (0 to 90%, by 10%), 3 time for each mix

[7]:

mixed_samp = []
mixed_meta = []
nb = 1
for i in range(3):
    for p1 in np.arange(0.1,1,0.1):
        s = create_mixed_sample(cnt=cnt_test, labels=test_labels, p1=p1, samp_name=f"mixed_sample_{nb}")
        mixed_samp.append(s[0])
        mixed_meta.append(s[1])
        nb += 1

[8]:

mixed_samples = pd.concat(mixed_samp, axis=1, keys=[s.name for s in mixed_samp]).astype(int)
mixed_samples.head()

[8]:

	mixed_sample_1	mixed_sample_2	mixed_sample_3	mixed_sample_4	mixed_sample_5	mixed_sample_6	mixed_sample_7	mixed_sample_8	mixed_sample_9	mixed_sample_10	...	mixed_sample_18	mixed_sample_19	mixed_sample_20	mixed_sample_21	mixed_sample_22	mixed_sample_23	mixed_sample_24	mixed_sample_25	mixed_sample_26	mixed_sample_27
TAXID
0	1320165	27791888	14189886	14720060	18710369	1414816	2910789	4518936	2282396	7217415	...	5331330	24788154	2020394	5968886	3231719	10313424	8480642	5192549	5175478	4809264
6	0	172	65	52	107	0	0	21	10	0	...	8	173	0	0	0	47	37	32	18	19
7	0	172	65	52	107	0	0	21	10	0	...	8	173	0	0	0	47	37	32	18	19
9	6	463	158	237	313	30	74	61	36	280	...	96	370	132	227	81	130	110	56	88	97
10	7	802	239	159	579	37	51	86	34	68	...	183	552	113	73	24	166	144	84	106	127

5 rows × 27 columns

[9]:

mixed_metadata = pd.concat(mixed_meta)
mixed_metadata.head()

[9]:

	human_sample	dog_sample	human_prop	dog_prop
mixed_sample_1	SRR1930247	ERR1913947	0.9	0.1
mixed_sample_2	SRR7658665	ERR1913947	0.8	0.2
mixed_sample_3	SRR1761700	ERR1914213	0.7	0.3
mixed_sample_4	SRR1761710	ERR1915204	0.6	0.4
mixed_sample_5	SRR7658665	ERR1914213	0.5	0.5

Now we can export the new “test” (sink) table to csv for sourcepredict

[10]:

mixed_samples.to_csv('mixed_samples_cnt.csv')

As well as the source count and labels table for the sources

[11]:

train_labels.to_csv('train_labels.csv')
cnt_train.to_csv('sources_cnt.csv')

Sourcepredict¶

For running Sourcepredict, we’ll change two parameters from their default values: - -me The default method used by Sourcepredict is T-SNE which a non-linear type of embedding, i.e. the distance between points doesn’t reflext their actual distance in the original dimensions, to achieve a better clustering, which is good for source prediction. Because here we’re more interested in source proportion estimation, rather than source prediction, we’ll choose a Multi Dimensional Scaling (MDS) which is a type of linear embedding, where the distance between points in the lozer dimension match more the distances in the embedding in lower dimension, which is better for source proportion estimation. - -kne which is the number of neighbors in KNN algorithm: we use a greater (50) number of neighbors to reflect more global contribution of samples to the proportion estimation, instead of only the immediate neighbors. This will affect negatively the source prediction, but give better source proportion estimations - -kw which is the weigth function in the KNN algorithm. By defaul a distance based weight function is apllied to give more weigth to closer samples. However, here, we’re more interested in source proportion estimation, rather than source prediction, so we’ll disregard the distance based weight function and give the same weight to all neighboring samples, regardless of their distance, with the uniform weight function.

[12]:

%%time
!python ../sourcepredict -s sources_cnt.csv \
               -l train_labels.csv \
               -n GMPR \
               -kne 50\
               -kw uniform \
               -me MDS \
               -e mixed_embedding.csv \
               -t 6 \
               mixed_samples_cnt.csv

Step 1: Checking for unknown proportion
  == Sample: mixed_sample_1 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_1
                 known:98.48%
                 unknown:1.52%
  == Sample: mixed_sample_2 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 0.99
        ----------------------
        - Sample: mixed_sample_2
                 known:99.73%
                 unknown:0.27%
  == Sample: mixed_sample_3 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_3
                 known:98.79%
                 unknown:1.21%
  == Sample: mixed_sample_4 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_4
                 known:98.8%
                 unknown:1.2%
  == Sample: mixed_sample_5 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_5
                 known:98.82%
                 unknown:1.18%
  == Sample: mixed_sample_6 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_6
                 known:98.48%
                 unknown:1.52%
  == Sample: mixed_sample_7 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_7
                 known:98.48%
                 unknown:1.52%
  == Sample: mixed_sample_8 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_8
                 known:98.48%
                 unknown:1.52%
  == Sample: mixed_sample_9 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_9
                 known:98.48%
                 unknown:1.52%
  == Sample: mixed_sample_10 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_10
                 known:98.48%
                 unknown:1.52%
  == Sample: mixed_sample_11 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 0.99
        ----------------------
        - Sample: mixed_sample_11
                 known:99.16%
                 unknown:0.84%
  == Sample: mixed_sample_12 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_12
                 known:98.5%
                 unknown:1.5%
  == Sample: mixed_sample_13 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_13
                 known:98.48%
                 unknown:1.52%
  == Sample: mixed_sample_14 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_14
                 known:98.48%
                 unknown:1.52%
  == Sample: mixed_sample_15 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_15
                 known:98.49%
                 unknown:1.51%
  == Sample: mixed_sample_16 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_16
                 known:98.48%
                 unknown:1.52%
  == Sample: mixed_sample_17 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_17
                 known:98.48%
                 unknown:1.52%
  == Sample: mixed_sample_18 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_18
                 known:98.48%
                 unknown:1.52%
  == Sample: mixed_sample_19 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 0.9
        ----------------------
        - Sample: mixed_sample_19
                 known:99.79%
                 unknown:0.21%
  == Sample: mixed_sample_20 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_20
                 known:98.48%
                 unknown:1.52%
  == Sample: mixed_sample_21 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_21
                 known:98.48%
                 unknown:1.52%
  == Sample: mixed_sample_22 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_22
                 known:98.48%
                 unknown:1.52%
  == Sample: mixed_sample_23 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_23
                 known:98.48%
                 unknown:1.52%
  == Sample: mixed_sample_24 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_24
                 known:98.48%
                 unknown:1.52%
  == Sample: mixed_sample_25 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_25
                 known:98.48%
                 unknown:1.52%
  == Sample: mixed_sample_26 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_26
                 known:98.48%
                 unknown:1.52%
  == Sample: mixed_sample_27 ==
        Adding unknown
        Normalizing (GMPR)
        Computing Bray-Curtis distance
        Performing MDS embedding in 2 dimensions
        KNN machine learning
        Training KNN classifier on 6 cores...
        -> Testing Accuracy: 1.0
        ----------------------
        - Sample: mixed_sample_27
                 known:98.48%
                 unknown:1.52%
Step 2: Checking for source proportion
        Computing weighted_unifrac distance on species rank
        MDS embedding in 2 dimensions
        KNN machine learning
        Trained KNN classifier with 50 neighbors
        -> Testing Accuracy: 0.88
        ----------------------
        - Sample: mixed_sample_1
                 Canis_familiaris:87.65%
                 Homo_sapiens:11.14%
                 Soil:1.21%
        - Sample: mixed_sample_2
                 Canis_familiaris:3.66%
                 Homo_sapiens:95.03%
                 Soil:1.31%
        - Sample: mixed_sample_3
                 Canis_familiaris:3.66%
                 Homo_sapiens:95.03%
                 Soil:1.31%
        - Sample: mixed_sample_4
                 Canis_familiaris:5.65%
                 Homo_sapiens:93.02%
                 Soil:1.33%
        - Sample: mixed_sample_5
                 Canis_familiaris:3.66%
                 Homo_sapiens:95.03%
                 Soil:1.31%
        - Sample: mixed_sample_6
                 Canis_familiaris:92.78%
                 Homo_sapiens:6.02%
                 Soil:1.2%
        - Sample: mixed_sample_7
                 Canis_familiaris:54.39%
                 Homo_sapiens:44.26%
                 Soil:1.35%
        - Sample: mixed_sample_8
                 Canis_familiaris:36.94%
                 Homo_sapiens:61.66%
                 Soil:1.4%
        - Sample: mixed_sample_9
                 Canis_familiaris:5.65%
                 Homo_sapiens:93.02%
                 Soil:1.33%
        - Sample: mixed_sample_10
                 Canis_familiaris:5.65%
                 Homo_sapiens:93.02%
                 Soil:1.33%
        - Sample: mixed_sample_11
                 Canis_familiaris:5.65%
                 Homo_sapiens:93.02%
                 Soil:1.33%
        - Sample: mixed_sample_12
                 Canis_familiaris:30.32%
                 Homo_sapiens:68.27%
                 Soil:1.41%
        - Sample: mixed_sample_13
                 Canis_familiaris:5.65%
                 Homo_sapiens:93.02%
                 Soil:1.33%
        - Sample: mixed_sample_14
                 Canis_familiaris:30.32%
                 Homo_sapiens:68.27%
                 Soil:1.41%
        - Sample: mixed_sample_15
                 Canis_familiaris:24.34%
                 Homo_sapiens:74.24%
                 Soil:1.41%
        - Sample: mixed_sample_16
                 Canis_familiaris:30.32%
                 Homo_sapiens:68.27%
                 Soil:1.41%
        - Sample: mixed_sample_17
                 Canis_familiaris:27.24%
                 Homo_sapiens:71.35%
                 Soil:1.41%
        - Sample: mixed_sample_18
                 Canis_familiaris:83.76%
                 Homo_sapiens:15.02%
                 Soil:1.22%
        - Sample: mixed_sample_19
                 Canis_familiaris:24.34%
                 Homo_sapiens:74.24%
                 Soil:1.41%
        - Sample: mixed_sample_20
                 Canis_familiaris:3.66%
                 Homo_sapiens:95.03%
                 Soil:1.31%
        - Sample: mixed_sample_21
                 Canis_familiaris:5.65%
                 Homo_sapiens:93.02%
                 Soil:1.33%
        - Sample: mixed_sample_22
                 Canis_familiaris:40.4%
                 Homo_sapiens:58.2%
                 Soil:1.4%
        - Sample: mixed_sample_23
                 Canis_familiaris:3.66%
                 Homo_sapiens:95.03%
                 Soil:1.31%
        - Sample: mixed_sample_24
                 Canis_familiaris:3.66%
                 Homo_sapiens:95.03%
                 Soil:1.31%
        - Sample: mixed_sample_25
                 Canis_familiaris:30.32%
                 Homo_sapiens:68.27%
                 Soil:1.41%
        - Sample: mixed_sample_26
                 Canis_familiaris:21.65%
                 Homo_sapiens:76.94%
                 Soil:1.41%
        - Sample: mixed_sample_27
                 Canis_familiaris:85.18%
                 Homo_sapiens:13.6%
                 Soil:1.22%
Sourcepredict result written to mixed_samples_cnt.sourcepredict.csv
Embedding coordinates written to mixed_embedding.csv
CPU times: user 4.84 s, sys: 1.13 s, total: 5.97 s
Wall time: 5min 14s

Reading Sourcepredict results

[13]:

sp_ebd = pd.read_csv("mixed_embedding.csv", index_col=0)

[14]:

sp_ebd.head()

[14]:

	PC1	PC2	labels	name
SRR1761712	-3.713176	-0.344326	Homo_sapiens	SRR1761712
ERR1913614	2.586560	-0.498098	Canis_familiaris	ERR1913614
ERR1914349	-1.595982	-3.363762	Canis_familiaris	ERR1914349
SRR1930255	2.174966	-0.728862	Homo_sapiens	SRR1930255
SRR1646027	-3.329213	-0.214682	Homo_sapiens	SRR1646027

[15]:

import warnings
warnings.filterwarnings('ignore')

[16]:

ggplot(data = sp_ebd, mapping = aes(x='PC1',y='PC2')) + geom_point(aes(color='labels')) + theme_classic()

[16]:

<ggplot: (-9223372029299469603)>

[17]:

sp_pred = pd.read_csv("mixed_samples_cnt.sourcepredict.csv", index_col=0)

[18]:

sp_pred.T.head()

[18]:

	Canis_familiaris	Homo_sapiens	Soil	unknown
mixed_sample_1	0.863266	0.109678	0.011905	0.015152
mixed_sample_2	0.036542	0.947680	0.013046	0.002733
mixed_sample_3	0.036198	0.938776	0.012923	0.012103
mixed_sample_4	0.055810	0.919077	0.013163	0.011951
mixed_sample_5	0.036211	0.939098	0.012927	0.011764

[19]:

mixed_metadata.head()

[19]:

	human_sample	dog_sample	human_prop	dog_prop
mixed_sample_1	SRR1930247	ERR1913947	0.9	0.1
mixed_sample_2	SRR7658665	ERR1913947	0.8	0.2
mixed_sample_3	SRR1761700	ERR1914213	0.7	0.3
mixed_sample_4	SRR1761710	ERR1915204	0.6	0.4
mixed_sample_5	SRR7658665	ERR1914213	0.5	0.5

[20]:

sp_res = sp_pred.T.merge(mixed_metadata, left_index=True, right_index=True)

[21]:

sp_res.head()

[21]:

	Canis_familiaris	Homo_sapiens	Soil	unknown	human_sample	dog_sample	human_prop	dog_prop
mixed_sample_1	0.863266	0.109678	0.011905	0.015152	SRR1930247	ERR1913947	0.9	0.1
mixed_sample_2	0.036542	0.947680	0.013046	0.002733	SRR7658665	ERR1913947	0.8	0.2
mixed_sample_3	0.036198	0.938776	0.012923	0.012103	SRR1761700	ERR1914213	0.7	0.3
mixed_sample_4	0.055810	0.919077	0.013163	0.011951	SRR1761710	ERR1915204	0.6	0.4
mixed_sample_5	0.036211	0.939098	0.012927	0.011764	SRR7658665	ERR1914213	0.5	0.5

[22]:

from sklearn.metrics import r2_score, mean_squared_error

[23]:

mse_sp = round(mean_squared_error(y_pred=sp_res['Homo_sapiens'], y_true=sp_res['human_prop']),2)

[24]:

p = ggplot(data = sp_res, mapping=aes(x='human_prop',y='Homo_sapiens')) + geom_point()
p += labs(title = f"Homo sapiens proportions predicted by Soucepredict - $MSE = {mse_sp}$", x='actual', y='predicted')
p += theme_classic()
p += coord_cartesian(xlim=[0,1], ylim=[0,1])
p += geom_abline(intercept=0, slope=1, color = "red", alpha=0.2, linetype = 'dashed')
p

[24]:

<ggplot: (-9223372036555534967)>

On this plot, the dotted red line represents what a perfect proportion estimation would give, with a Mean Squared Error (MSE) = 0.

[25]:

sp_res_hist = (sp_res['human_prop'].append(sp_res['Homo_sapiens']).to_frame(name='Homo_sapiens_prop'))
sp_res_hist['source'] = (['actual']*sp_res.shape[0]+['predicted']*sp_res.shape[0])

[47]:

p = ggplot(data = sp_res_hist, mapping=aes(x='Homo_sapiens_prop')) + geom_density(aes(fill='source'), alpha=0.3)
p += labs(title = 'Distribution of Homo sapiens predicted proportions by Sourcepredict')
p += scale_fill_discrete(name="Homo sapiens proportion")
p += theme_classic()
p

[47]:

<ggplot: (-9223372029298930965)>

This plot shows the actual and predicted by Sourcepredict distribution of Human proportions. What we are interested in is the overlap between the two colors: the higer it is, the more the estimated Human proportion is accurate.

Sourcetracker2¶

Preparing count table

[27]:

cnt_train.merge(mixed_samples, right_index=True, left_index=True).to_csv("st_mixed_count.csv" , sep="\t", index_label="TAXID")

[28]:

!biom convert -i st_mixed_count.csv -o st_mixed_count.biom --table-type="Taxon table" --to-json

Preparing metadata

[29]:

train_labels['SourceSink'] = ['source']*train_labels.shape[0]

[30]:

mixed_metadata['labels'] = ['-']*mixed_metadata.shape[0]
mixed_metadata['SourceSink'] = ['sink']*mixed_metadata.shape[0]

[31]:

st_labels = train_labels.append(mixed_metadata[['labels', 'SourceSink']])

[32]:

st_labels = st_labels.rename(columns={'labels':'Env'})[['SourceSink','Env']]

[33]:

st_labels.to_csv("st_mixed_labels.csv", sep="\t", index_label='#SampleID')

Running Sourcetracker2 sourcetracker2 gibbs -i st_mixed_count.biom -m st_mixed_labels.csv -o mixed_prop --jobs 6

(Sourcetracker2 was run on a Linux remote server because of issues running it on MacOS)

Sourcetracker2 results

[40]:

st_pred = pd.read_csv("mixed_prop/mixing_proportions.txt", sep="\t", index_col=0)

[41]:

st_res = st_pred.merge(mixed_metadata, left_index=True, right_index=True)

[42]:

mse_st = round(mean_squared_error(y_pred=st_res['Homo_sapiens'], y_true=st_res['human_prop']),2)

[43]:

p = ggplot(data = st_res, mapping=aes(x='human_prop',y='Homo_sapiens')) + geom_point()
p += labs(title = f"Homo sapiens proportions predicted by Soucepretracker2 - $MSE = {mse_st}$", x='actual', y='predicted')
p += theme_classic()
p += coord_cartesian(xlim=[0,1], ylim=[0,1])
p += geom_abline(intercept=0, slope=1, color = "red", alpha=0.2, linetype = 'dashed')
p

[43]:

<ggplot: (-9223372029300899661)>

On this plot, the dotted red line represents what a perfect proportion estimation would give, with a Mean Squared Error (MSE) = 0. Regarding the MSE, Sourcepredict and Sourcetracker perform similarly with a MSE of 0.13 for Sourcepredict and 0.12 for Sourcetracker.

[44]:

st_res_hist = (st_res['human_prop'].append(st_res['Homo_sapiens']).to_frame(name='Homo_sapiens_prop'))
st_res_hist['source'] = (['actual']*st_res.shape[0]+['predicted']*st_res.shape[0])

[46]:

p = ggplot(data = st_res_hist, mapping=aes(x='Homo_sapiens_prop')) + geom_density(aes(fill='source'), alpha=0.4)
p += labs(title = 'Distribution of Homo sapiens predicted proportions by Sourcetracker2')
p += scale_fill_discrete(name="Homo sapiens proportion")
p += theme_classic()
p

[46]:

<ggplot: (7555587890)>

This plot shows the actual and predicted by Sourcepredict distribution of Human proportions. What we are interested in is the overlap between the two colors: the higer it is, the more the estimated Human proportion is accurate.

Here, there is a bigger overlap between actual and predicted, suggesting a slightly better source proportion estimation than with Sourcepredict.

Conclusion¶

For source proportion estimation in samples of mixed sources, we’ve seen that Sourcepredict, with adapted parameters, can perform similarly as Sourcetracker.
However, because Sourcepredict was designed for source prediction in mind, as opposed to source proportion estimation, it requires parameters tweaking to achive the same results as Sourcetracker.
Therefore, for source proportion estimation, we still recommend using Sourcetracker, even if Sourcepredict can perform similarly.