Part 2 : Problem Formulation / Patient Similarity Network Generation #

In this section we will formulate a task to predict if an individual in the Generation Scotland dataset is a smoker or not. We will formulate the task using the Patient Similarity Networks generated earlier. We will also highlight that not every network is created equal.

Networks are flexible and we can leverage/incorporate additional informative information. We have used the EWAS Catalog to identify CpG sites affected by smoking. We will compare the effectiveness of GNN’s when incorporating external informative versus doing a direct correlation network.

Import Functions and Packages #

import pickle
import pandas as pd
import numpy as np
import networkx as nx
from palettable import wesanderson
import sys
sys.path.insert(0 , '/tutorial/')
from functions import *

Import and Process Generation Scotland Data#

First we will load the data from Generation Scotland and create labels for individuals who have smoked before using the ‘pack_years’ feature

'''
#################################################
               YOUR CODE HERE
#################################################

Extract 'DNAm_w1' (Generation Scotland Wave 1) from loaded_data pkl file, dropping any na's from the dataset

Extract 'Phenotypes' from loaded_data pkl file, set the index as `Sample_SentrixID` and subset to wave 1
'''

with open('/data/raw/ISMB_GS_Data.pkl' , 'rb') as file : 
    loaded_data = pd.read_pickle(file)
w1 = loaded_data['DNAm_w1'].T.dropna() # we are only working with Wave 1 patients for this task
phenotypes = loaded_data['Phenotypes'].set_index('Sample_SentrixID').loc[w1.index] # get the phenotypes for the wave 1 patients
# Create a new column in the phenotypes dataframe that categorizes patients as either smokers or non-smokers
def smoking_cat(row) : 
    if row['pack_years'] == 0 :
        return 'Non-Smoker'
    else :
        return 'Smoker'

phenotypes['Smoking'] = phenotypes.apply(smoking_cat , axis = 1) # apply the function to the dataframe
# For efficiency of graph training and learning we will downsample the dataset size to 1000 patients
never_smoked = phenotypes[phenotypes['Smoking'] == 'Non-Smoker'].sample(500) # sample 500 non-smokers
smoked = phenotypes[phenotypes['Smoking'] == 'Smoker'].sample(500) # sample 500 smokers

phenotypes = pd.concat([never_smoked , smoked]) # combine the two dataframes
w1 = w1.loc[phenotypes.index] # filter the methylation data to only include the patients in the downsampled phenotypes dataframe
'''
#################################################
               YOUR CODE HERE
#################################################

Plot the distribution of smokers and non-smokers
'''

phenotypes['Smoking'].hist() # plot the distribution of smokers and non-smokers
plt.title('Distribution of Smokers and Non-Smokers')
Text(0.5, 1.0, 'Distribution of Smokers and Non-Smokers')
../_images/2a7798ab4d6877baceea57d7d8f871ba06d2530c67cfd208f6911d68e4c8edb1.png

EWAS Network Generation #

Patient Similarity Network Generation#

'''
#################################################
                YOUR CODE HERE
#################################################

We want to include the most informative CpG sites, so we will only include CpG sites that have been annotated multiple times with smoking

1. Count the number of times a CpG site is present in the smoking.tsv dataset
2. If a CpG is present in the dataset more than 10 times add it to the list common_annotated_cpgs
3. Filter the methylation data to only include the CpG sites in common_annotated_cpgs
4. Call the new dataframe w1_filt

'''
common_annotated_cpgs = list(df['cpg'].value_counts()[df['cpg'].value_counts() > 10].index) 

cpgs = set(common_annotated_cpgs) & set(w1.columns)

w1_filt = w1.loc[ : , list(cpgs)]
'''
#################################################
                YOUR CODE HERE
#################################################

We want to create our patient similarity matrix using the filtered methylation data
We will use biweight midcorrelation to calculate the similarity between patients 
The function for this is given as abs_bicorr

'''
patient_similarity_matrix = abs_bicorr(w1_filt.T)
'''
#################################################
                YOUR CODE HERE
#################################################
We want to plot our Patient Similarity Network using the function plot_knn_network() from functions.py

Specify the node colours by assigning a colour to each phenotype using the .astype(`category`).cat.codes panda functionality

'''
node_colour = phenotypes['Smoking'].astype('category').cat.set_categories(wesanderson.FantasticFox2_5.hex_colors , rename=True) # set the colours for the nodes

print(f"{phenotypes['Smoking'].astype('category').cat.categories[0]} : 0 \n{phenotypes['Smoking'].astype('category').cat.categories[1]} : 1") # print the mapping of the colours to the categories

G = plot_knn_network(patient_similarity_matrix , 25 , phenotypes['Smoking'] , node_colours=node_colour) # plot the network
plt.title('EWAS Smoking Patient Similarity Network')
legend_handles = gen_graph_legend(node_colour , G , 'label') # generate the legend
plt.legend(handles = legend_handles)
plt.show()
Non-Smoker : 0 
Smoker : 1
../_images/9625d2fd8a6c4bdb8d693bc72024692bca6e3d3de24c1c828926533458c926fd.png

Save Network and Data for GNN Learning #

# Put the data in a single dictionary
data = {'Feat' : w1 , 'phenotypes' : phenotypes , 'PSN_raw' : [] , 'PSN_EWAS' : G}

# File path to save the pickle file
file_path = '/data/intermediate/GCN_Data.pkl'

# Save data to a pickle file
with open(file_path, 'wb') as f:
    pickle.dump(data, f)