Muito tem sido dito durante os últimos anos sobre como a medicina de precisão e, mais concretamente, como o teste genético, vai provocar disrupção no tratamento de doenças como o câncer.
Mas isso ainda está acontecendo apenas parcialmente devido à enorme quantidade de trabalho manual ainda necessário. Neste projeto, tentaremos levar a medicina personalizada ao seu potencial máximo. Uma vez sequenciado, um tumor cancerígeno pode ter milhares de mutações genéticas. O desafio é distinguir as mutações que contribuem para o crescimento do tumor das mutações.
Atualmente, esta interpretação de mutações genéticas está sendo feita manualmente. Esta é uma tarefa muito demorada, onde um patologista clínico tem que revisar manualmente e classificar cada mutação genética com base em evidências da literatura clínica baseada em texto.
Para este projeto, o MSKCC (Memorial Sloan Kettering Cancer Center) está disponibilizando uma base de conhecimento anotada por especialistas, onde pesquisadores e oncologistas de nível mundial anotaram manualmente milhares de mutações.
Dataset: https://www.kaggle.com/c/msk-redefining-cancer-treatment/overview
-- Objetivos
- Atingir 65% de precisão.
- Log Loss menor que 1.0 .
import nltk
import spacy
import re
import string
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import xgboost as xgb
import shap
from pathlib import Path
from warnings import simplefilter
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import confusion_matrix, accuracy_score, log_loss, precision_score, recall_score, f1_score
from sklearn.linear_model import SGDClassifier
from sklearn.calibration import CalibratedClassifierCV
from scipy import sparse
from os.path import isfile
from scipy import stats
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
# Versões dos pacotes usados neste jupyter notebook
%reload_ext watermark
%watermark -a "Herikc Brecher" --iversions
Author: Herikc Brecher numpy : 1.19.5 re : 2.2.1 pandas : 1.2.4 xgboost : 1.4.2 nltk : 3.6.1 seaborn : 0.11.1 scipy : 1.7.0 matplotlib: 3.3.4 spacy : 3.1.0 shap : 0.39.0
simplefilter(action='ignore', category=FutureWarning)
%matplotlib inline
sns.set_theme()
seed_ = 194
np.random.seed(seed_)
# Carregando datatable de treino com as variantes
variant = pd.read_csv('data/training_variants')
# Carregando datatable de treino com os textos de caso
text_data = pd.read_csv('data/training_text', sep = '\|\|', engine = 'python', names = ['ID', 'Text'], skiprows = 1)
variant.head()
ID | Gene | Variation | Class | |
---|---|---|---|---|
0 | 0 | FAM58A | Truncating Mutations | 1 |
1 | 1 | CBL | W802* | 2 |
2 | 2 | CBL | Q249E | 2 |
3 | 3 | CBL | N454D | 3 |
4 | 4 | CBL | L399V | 4 |
text_data.head()
ID | Text | |
---|---|---|
0 | 0 | Cyclin-dependent kinases (CDKs) regulate a var... |
1 | 1 | Abstract Background Non-small cell lung canc... |
2 | 2 | Abstract Background Non-small cell lung canc... |
3 | 3 | Recent evidence has demonstrated that acquired... |
4 | 4 | Oncogenic mutations in the monomeric Casitas B... |
# Unificando os dois datatable pela coluna 'ID'
train_data = pd.merge(variant, text_data, on = 'ID', how = 'left')
train_data.head()
ID | Gene | Variation | Class | Text | |
---|---|---|---|---|---|
0 | 0 | FAM58A | Truncating Mutations | 1 | Cyclin-dependent kinases (CDKs) regulate a var... |
1 | 1 | CBL | W802* | 2 | Abstract Background Non-small cell lung canc... |
2 | 2 | CBL | Q249E | 2 | Abstract Background Non-small cell lung canc... |
3 | 3 | CBL | N454D | 3 | Recent evidence has demonstrated that acquired... |
4 | 4 | CBL | L399V | 4 | Oncogenic mutations in the monomeric Casitas B... |
train_data.describe(include = 'all')
ID | Gene | Variation | Class | Text | |
---|---|---|---|---|---|
count | 3321.000000 | 3321 | 3321 | 3321.000000 | 3316 |
unique | NaN | 264 | 2996 | NaN | 1920 |
top | NaN | BRCA1 | Truncating Mutations | NaN | The PTEN (phosphatase and tensin homolog) phos... |
freq | NaN | 264 | 93 | NaN | 53 |
mean | 1660.000000 | NaN | NaN | 4.365854 | NaN |
std | 958.834449 | NaN | NaN | 2.309781 | NaN |
min | 0.000000 | NaN | NaN | 1.000000 | NaN |
25% | 830.000000 | NaN | NaN | 2.000000 | NaN |
50% | 1660.000000 | NaN | NaN | 4.000000 | NaN |
75% | 2490.000000 | NaN | NaN | 7.000000 | NaN |
max | 3320.000000 | NaN | NaN | 9.000000 | NaN |
# Possuimos 3321 observações para treino
train_data.shape
(3321, 5)
# Verificando se os tipos das colunas estão corretos
train_data.dtypes
ID int64 Gene object Variation object Class int64 Text object dtype: object
É analisado que possuimos valores missing para a coluna 'Text' para uma melhor generalização, e não descartar os dados iremos fazer que os dados missing sejam preenchidos pelo valor de 'Gene' + 'Variation'.
# Verificando valores missing
print(train_data.isna().sum())
ID 0 Gene 0 Variation 0 Class 0 Text 5 dtype: int64
É interessante notar que possuimos 264 'Gene' diferentes e 2996 'Variation', também possuimos 1920 'Text'. Considerando que temos 3321 observações, o numero de 'Variation' unicas é muito alto, podendo prejudicar o modelo.
# Verificando valores unicos
print(train_data.nunique())
ID 3321 Gene 264 Variation 2996 Class 9 Text 1920 dtype: int64
# Verificando valores duplicados
print(sum(train_data.duplicated()))
0
Algumas classes como 3, 9 e 8 possuimos muitos poucos dados, o que pode gerar um problema no aprendizado do nosso modelo. Induzindo a determinados víes. Um balanceamento pode auxiliar.
# Criando vetor de cores
colors = ['r', 'g', 'b', 'y', 'k']
train_data['Class'].value_counts().plot(kind = 'bar', color = colors)
<AxesSubplot:>
# Funções para analise univariavel
def distribuicao_column(data, column, distribuicao_nao_cumulativa = True, distribuicao_cumulativa = True):
# Calculando distribuição da coluna
valores = data[column].value_counts()
distribuicao_valores = valores / sum(valores.values)
if distribuicao_nao_cumulativa:
plt.plot(distribuicao_valores)
plt.xlabel(column)
plt.ylabel('Taxa de Observações')
plt.show()
if distribuicao_cumulativa:
plt.plot(np.cumsum(distribuicao_valores))
plt.xlabel(column)
plt.ylabel('Taxa de Observações')
plt.show()
def relevancia_classe(data, column, column_target, top = 10):
top_significancia = round( ( sum(data.groupby(by = column).count().sort_values(by = 'ID',\
ascending = False).head(top)[column_target]) / data.shape[0] ) * 100, 2)
representacao_column = round( (top / data.nunique()[column]) * 100, 2)
print(representacao_column, '% da coluna', column,'representa', top_significancia, '% da coluna', column_target)
def top_frequencias(data, column, column_target, ID, top = 10, colors = ['r', 'g', 'b', 'y', 'k']):
data.groupby(by = column).count().sort_values(by = ID, ascending = False).head(top).plot(\
kind = 'bar', ylabel = 'Frequencia', xlabel = column, y = column_target,\
color = colors)
Para melhor entendimento dos dados é interessante analisar o acumulo do numero de genes ao longo da sua distribuição, de forma não cumulativa e cumulativa. Já no segundo grafico mais uma vez é perceptivel que o grafico tende a crescer mais rapidamente no inicio, tendendo a uma forma exponencial.
No primeiro grafico é perceptivel que muitos 'Genes' concentram uma maior taxa de ocorrencia, o que tende a cair rapidamente em relação aos outros.
distribuicao_column(train_data, 'Gene')
Analisando abaixo é perceptivel que os 10 Genes com maior frequência representam 36% das classes, assim essas possuem a maior relevancia para o nosso modelo. Porém isso significa que apenas 3.8% dos nossos genes representam 36% das classes.
relevancia_classe(train_data, 'Gene', 'Class')
3.79 % da coluna Gene representa 36.4 % da coluna Class
top_frequencias(train_data, 'Gene', 'Class', 'ID', colors = colors)
Apos analisar os genes é necessário analisar a 'Variation' pois essa já foi identificada que possui mais valores unicos.
Apenas as 10 variations de maior relevancia, ou seja 0.33% dos tipos de 'Variation' representam um total de 8.85% das classes alvo.
relevancia_classe(train_data, 'Variation', 'Class')
0.33 % da coluna Variation representa 8.85 % da coluna Class
top_frequencias(train_data, 'Variation', 'Class', 'ID', colors = colors)
É perceptivel que possuem algumas poucas 'Variations' que possuem uma relevancia muito maior que as outras, porém o restante possui uma relevância similar. Observando a variancia acumulada temos o mesmo comportamento, um pulo no inicio e após um crescimento que se torna constante.
distribuicao_column(train_data, 'Variation')
qualitativas = ['Gene', 'Variation']
def crosstab_column(data, col, target, percentage = True):
res = pd.crosstab(data[col], data[target], margins = True)
if percentage:
res = pd.crosstab(data[col], data[target], margins = True, normalize = 'index').round(4) * 100
return res
Analisando de forma generalizada é perceptivel qque determinados 'Gene' estão amplamente concentrados em uma categoria, o mesmo serve para 'Variation' que possui uma especificidade ainda maior.
Por exemplo, o 'Gene' ABL1 estã 92% na classe 2 e o restante, 7.69%, na classe 7. Já para 'Variation' o valor '1_2009trunc' esta 100% na classe 1.
for col in qualitativas:
print(crosstab_column(train_data, col, 'Class'), end = '\n\n\n')
Class 1 2 3 4 5 6 7 8 9 Gene ABL1 0.0 92.31 0.00 0.00 0.00 0.00 7.69 0.00 0.00 ACVR1 0.0 33.33 0.00 0.00 0.00 0.00 66.67 0.00 0.00 AGO2 80.0 20.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 AKT1 0.0 10.71 10.71 0.00 10.71 0.00 60.71 7.14 0.00 AKT2 0.0 9.09 0.00 0.00 0.00 0.00 72.73 0.00 18.18 ... ... ... ... ... ... ... ... ... ... WHSC1L1 0.0 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 XPO1 0.0 50.00 0.00 0.00 0.00 50.00 0.00 0.00 0.00 XRCC2 100.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 YAP1 0.0 75.00 0.00 0.00 0.00 0.00 25.00 0.00 0.00 All 17.1 13.61 2.68 20.66 7.29 8.28 28.70 0.57 1.11 [265 rows x 9 columns] Class 1 2 3 4 5 6 7 8 \ Variation 1_2009trunc 100.0 0.00 0.00 0.00 0.00 0.00 0.0 0.00 2010_2471trunc 0.0 100.00 0.00 0.00 0.00 0.00 0.0 0.00 256_286trunc 0.0 0.00 0.00 0.00 0.00 0.00 100.0 0.00 3' Deletion 0.0 0.00 0.00 100.00 0.00 0.00 0.0 0.00 385_418del 0.0 0.00 0.00 100.00 0.00 0.00 0.0 0.00 ... ... ... ... ... ... ... ... ... YAP1-MAMLD1 Fusion 0.0 100.00 0.00 0.00 0.00 0.00 0.0 0.00 ZC3H7B-BCOR Fusion 0.0 0.00 0.00 0.00 0.00 0.00 0.0 100.00 ZNF198-FGFR1 Fusion 0.0 0.00 0.00 0.00 0.00 0.00 100.0 0.00 p61BRAF 0.0 0.00 0.00 0.00 0.00 0.00 100.0 0.00 All 17.1 13.61 2.68 20.66 7.29 8.28 28.7 0.57 Class 9 Variation 1_2009trunc 0.00 2010_2471trunc 0.00 256_286trunc 0.00 3' Deletion 0.00 385_418del 0.00 ... ... YAP1-MAMLD1 Fusion 0.00 ZC3H7B-BCOR Fusion 0.00 ZNF198-FGFR1 Fusion 0.00 p61BRAF 0.00 All 1.11 [2997 rows x 9 columns]
def qui2(data, col, target, alpha = 0.05):
for c in col:
cross = pd.crosstab(data[c], data[target])
chi2, p, dof, exp = stats.chi2_contingency(cross)
print("Qui-quadrado entre a variavel", target, "e a variavel categorica", c, ": {:0.4}".format(chi2))
print("Apresentando um p-value de: {:0.4}".format(p))
if p < alpha:
print('A variavel', c,'possui relação direta com a variavel',target, end = '\n\n')
else:
print('A variavel', c,'não possui relação direta com a variavel',target, end = '\n\n')
Analisando o teste do qui-quadrado, as nossas variaveis apontam uma alta taxa de confiança entre a relação de ambas as variaveis qualitativas com a variavem target.
qui2(train_data, qualitativas, 'Class')
Qui-quadrado entre a variavel Class e a variavel categorica Gene : 1.004e+04 Apresentando um p-value de: 0.0 A variavel Gene possui relação direta com a variavel Class Qui-quadrado entre a variavel Class e a variavel categorica Variation : 2.559e+04 Apresentando um p-value de: 1.856e-13 A variavel Variation possui relação direta com a variavel Class
Visto que não possuimos dados quantitativos, a analise exploratoria finaliza por aqui. Concluindo que possuimos muitos valores unicos para ambas as features qualitativas. Possuimos uma grande parcela das classes atribuidas a poucos 'Genes' e 'Variations'.
Ambas as variaveis quantitativas possuem associação com a variavel target. O nivel de confiança é alto.
Possuimos valores NA para a coluna 'Text', para um melhor tratamento iremos subsituir o seu valor por: 'Gene' + 'Variaton'.
# Verificando os valores NA
train_data[train_data.isnull().any(axis = 1)]
ID | Gene | Variation | Class | Text | |
---|---|---|---|---|---|
1109 | 1109 | FANCA | S1088F | 1 | NaN |
1277 | 1277 | ARID5B | Truncating Mutations | 1 | NaN |
1407 | 1407 | FGFR3 | K508M | 6 | NaN |
1639 | 1639 | FLT1 | Amplification | 6 | NaN |
2755 | 2755 | BRAF | G596C | 7 | NaN |
train_data.loc[train_data['Text'].isnull(), 'Text'] = train_data['Gene'] + train_data['Variation']
def limpar_texto(text):
# Convertendo para str
text = str(text)
# Remover caracteres non-ascii
text = ''.join(caracter for caracter in text if ord(caracter) < 128)
# Convertendo para lower case
text = text.lower()
# Removendo pontuação por expressão regular
regex = re.compile('[' + re.escape(string.punctuation) + '\\r\\t\\n]')
text = regex.sub(' ', str(text))
# Carregando stopwords em Inglês
english_stops = set(stopwords.words('english'))
# Removendo stopwords em Inglês
# Mantendo somente palavras que não são consideradas stopwords
text = ' '.join(palavra for palavra in text.split() if palavra not in english_stops)
# Criando a estrutura baseada em uma wordnet para lemmatization
wordnet_lemmatizer = WordNetLemmatizer()
# Aplicando Lemmatization
text = ' '.join(wordnet_lemmatizer.lemmatize(palavra) for palavra in text.split())
return text
def carap_data(path, data = [], column = 'x'):
if isfile(path):
print('Carregando conjunto de dados...')
data = pd.read_csv(path, sep = ',')
else:
print('Tratanto texto...')
data[column] = data[column].map(limpar_texto)
print('Salvando dados...')
data.to_csv(path, sep = ',')
return data
Iremos executar uma série de operações para realizar a limpeza do texto e generalizar para os algoritmos:
Lemmatization:
Lema: organiza
Forma flexionada: organizado
%%time
train_data_processado = carap_data('data/treino_processado.csv', train_data, 'Text')
Carregando conjunto de dados... Wall time: 1.15 s
Segue comparação abaixo entre o texto da primeira linha antes de ser processado e após ser processado.
train_data['Text'][0]
"Cyclin-dependent kinases (CDKs) regulate a variety of fundamental cellular processes. CDK10 stands out as one of the last orphan CDKs for which no activating cyclin has been identified and no kinase activity revealed. Previous work has shown that CDK10 silencing increases ETS2 (v-ets erythroblastosis virus E26 oncogene homolog 2)-driven activation of the MAPK pathway, which confers tamoxifen resistance to breast cancer cells. The precise mechanisms by which CDK10 modulates ETS2 activity, and more generally the functions of CDK10, remain elusive. Here we demonstrate that CDK10 is a cyclin-dependent kinase by identifying cyclin M as an activating cyclin. Cyclin M, an orphan cyclin, is the product of FAM58A, whose mutations cause STAR syndrome, a human developmental anomaly whose features include toe syndactyly, telecanthus, and anogenital and renal malformations. We show that STAR syndrome-associated cyclin M mutants are unable to interact with CDK10. Cyclin M silencing phenocopies CDK10 silencing in increasing c-Raf and in conferring tamoxifen resistance to breast cancer cells. CDK10/cyclin M phosphorylates ETS2 in vitro, and in cells it positively controls ETS2 degradation by the proteasome. ETS2 protein levels are increased in cells derived from a STAR patient, and this increase is attributable to decreased cyclin M levels. Altogether, our results reveal an additional regulatory mechanism for ETS2, which plays key roles in cancer and development. They also shed light on the molecular mechanisms underlying STAR syndrome.Cyclin-dependent kinases (CDKs) play a pivotal role in the control of a number of fundamental cellular processes (1). The human genome contains 21 genes encoding proteins that can be considered as members of the CDK family owing to their sequence similarity with bona fide CDKs, those known to be activated by cyclins (2). Although discovered almost 20 y ago (3, 4), CDK10 remains one of the two CDKs without an identified cyclin partner. This knowledge gap has largely impeded the exploration of its biological functions. CDK10 can act as a positive cell cycle regulator in some cells (5, 6) or as a tumor suppressor in others (7, 8). CDK10 interacts with the ETS2 (v-ets erythroblastosis virus E26 oncogene homolog 2) transcription factor and inhibits its transcriptional activity through an unknown mechanism (9). CDK10 knockdown derepresses ETS2, which increases the expression of the c-Raf protein kinase, activates the MAPK pathway, and induces resistance of MCF7 cells to tamoxifen (6).Here, we deorphanize CDK10 by identifying cyclin M, the product of FAM58A, as a binding partner. Mutations in this gene that predict absence or truncation of cyclin M are associated with STAR syndrome, whose features include toe syndactyly, telecanthus, and anogenital and renal malformations in heterozygous females (10). However, both the functions of cyclin M and the pathogenesis of STAR syndrome remain unknown. We show that a recombinant CDK10/cyclin M heterodimer is an active protein kinase that phosphorylates ETS2 in vitro. Cyclin M silencing phenocopies CDK10 silencing in increasing c-Raf and phospho-ERK expression levels and in inducing tamoxifen resistance in estrogen receptor (ER)+ breast cancer cells. We show that CDK10/cyclin M positively controls ETS2 degradation by the proteasome, through the phosphorylation of two neighboring serines. Finally, we detect an increased ETS2 expression level in cells derived from a STAR patient, and we demonstrate that it is attributable to the decreased cyclin M expression level observed in these cells.Previous SectionNext SectionResultsA yeast two-hybrid (Y2H) screen unveiled an interaction signal between CDK10 and a mouse protein whose C-terminal half presents a strong sequence homology with the human FAM58A gene product [whose proposed name is cyclin M (11)]. We thus performed Y2H mating assays to determine whether human CDK10 interacts with human cyclin M (Fig. 1 A–C). The longest CDK10 isoform (P1) expressed as a bait protein produced a strong interaction phenotype with full-length cyclin M (expressed as a prey protein) but no detectable phenotype with cyclin D1, p21 (CIP1), and Cdi1 (KAP), which are known binding partners of other CDKs (Fig. 1B). CDK1 and CDK3 also produced Y2H signals with cyclin M, albeit notably weaker than that observed with CDK10 (Fig. 1B). An interaction phenotype was also observed between full-length cyclin M and CDK10 proteins expressed as bait and prey, respectively (Fig. S1A). We then tested different isoforms of CDK10 and cyclin M originating from alternative gene splicing, and two truncated cyclin M proteins corresponding to the hypothetical products of two mutated FAM58A genes found in STAR syndrome patients (10). None of these shorter isoforms produced interaction phenotypes (Fig. 1 A and C and Fig. S1A).Fig. 1.In a new window Download PPTFig. 1.CDK10 and cyclin M form an interaction complex. (A) Schematic representation of the different protein isoforms analyzed by Y2H assays. Amino acid numbers are indicated. Black boxes indicate internal deletions. The red box indicates a differing amino acid sequence compared with CDK10 P1. (B) Y2H assay between a set of CDK proteins expressed as baits (in fusion to the LexA DNA binding domain) and CDK interacting proteins expressed as preys (in fusion to the B42 transcriptional activator). pEG202 and pJG4-5 are the empty bait and prey plasmids expressing LexA and B42, respectively. lacZ was used as a reporter gene, and blue yeast are indicative of a Y2H interaction phenotype. (C) Y2H assay between the different CDK10 and cyclin M isoforms. The amino-terminal region of ETS2, known to interact with CDK10 (9), was also assayed. (D) Western blot analysis of Myc-CDK10 (wt or kd) and CycM-V5-6His expression levels in transfected HEK293 cells. (E) Western blot analysis of Myc-CDK10 (wt or kd) immunoprecipitates obtained using the anti-Myc antibody. “Inputs” correspond to 10 μg total lysates obtained from HEK293 cells coexpressing Myc-CDK10 (wt or kd) and CycM-V5-6His. (F) Western blot analysis of immunoprecipitates obtained using the anti-CDK10 antibody or a control goat antibody, from human breast cancer MCF7 cells. “Input” corresponds to 30 μg MCF7 total cell lysates. The lower band of the doublet observed on the upper panel comigrates with the exogenously expressed untagged CDK10 and thus corresponds to endogenous CDK10. The upper band of the doublet corresponds to a nonspecific signal, as demonstrated by it insensitivity to either overexpression of CDK10 (as seen on the left lane) or silencing of CDK10 (Fig. S2B). Another experiment with a longer gel migration is shown in Fig. S1D.Next we examined the ability of CDK10 and cyclin M to interact when expressed in human cells (Fig. 1 D and E). We tested wild-type CDK10 (wt) and a kinase dead (kd) mutant bearing a D181A amino acid substitution that abolishes ATP binding (12). We expressed cyclin M-V5-6His and/or Myc-CDK10 (wt or kd) in a human embryonic kidney cell line (HEK293). The expression level of cyclin M-V5-6His was significantly increased upon coexpression with Myc-CDK10 (wt or kd) and, to a lesser extent, that of Myc-CDK10 (wt or kd) was increased upon coexpression with cyclin M-V5-6His (Fig. 1D). We then immunoprecipitated Myc-CDK10 proteins and detected the presence of cyclin M in the CDK10 (wt) and (kd) immunoprecipitates only when these proteins were coexpressed pair-wise (Fig. 1E). We confirmed these observations by detecting the presence of Myc-CDK10 in cyclin M-V5-6His immunoprecipitates (Fig. S1B). These experiments confirmed the lack of robust interaction between the CDK10.P2 isoform and cyclin M (Fig. S1C). To detect the interaction between endogenous proteins, we performed immunoprecipitations on nontransfected MCF7 cells derived from a human breast cancer. CDK10 and cyclin M antibodies detected their cognate endogenous proteins by Western blotting. We readily detected cyclin M in immunoprecipitates obtained with the CDK10 antibody but not with a control antibody (Fig. 1F). These results confirm the physical interaction between CDK10 and cyclin M in human cells.To unveil a hypothesized CDK10/cyclin M protein kinase activity, we produced GST-CDK10 and StrepII-cyclin M fusion proteins in insect cells, either individually or in combination. We observed that GST-CDK10 and StrepII-cyclin M copurified, thus confirming their interaction in yet another cellular model (Fig. 2A). We then performed in vitro kinase assays with purified proteins, using histone H1 as a generic substrate. Histone H1 phosphorylation was detected only from lysates of cells coexpressing GST-CDK10 and StrepII-cyclin M. No phosphorylation was detected when GST-CDK10 or StrepII-cyclin M were expressed alone, or when StrepII-cyclin M was coexpressed with GST-CDK10(kd) (Fig. 2A). Next we investigated whether ETS2, which is known to interact with CDK10 (9) (Fig. 1C), is a phosphorylation substrate of CDK10/cyclin M. We detected strong phosphorylation of ETS2 by the GST-CDK10/StrepII-cyclin M purified heterodimer, whereas no phosphorylation was detected using GST-CDK10 alone or GST-CDK10(kd)/StrepII-cyclin M heterodimer (Fig. 2B).Fig. 2.In a new window Download PPTFig. 2.CDK10 is a cyclin M-dependent protein kinase. (A) In vitro protein kinase assay on histone H1. Lysates from insect cells expressing different proteins were purified on a glutathione Sepharose matrix to capture GST-CDK10(wt or kd) fusion proteins alone, or in complex with STR-CycM fusion protein. Purified protein expression levels were analyzed by Western blots (Top and Upper Middle). The kinase activity was determined by autoradiography of histone H1, whose added amounts were visualized by Coomassie staining (Lower Middle and Bottom). (B) Same as in A, using purified recombinant 6His-ETS2 as a substrate.CDK10 silencing has been shown to increase ETS2-driven c-RAF transcription and to activate the MAPK pathway (6). We investigated whether cyclin M is also involved in this regulatory pathway. To aim at a highly specific silencing, we used siRNA pools (mix of four different siRNAs) at low final concentration (10 nM). Both CDK10 and cyclin M siRNA pools silenced the expression of their cognate targets (Fig. 3 A and C and Fig. S2) and, interestingly, the cyclin M siRNA pool also caused a marked decrease in CDK10 protein level (Fig. 3A and Fig. S2B). These results, and those shown in Fig. 1D, suggest that cyclin M binding stabilizes CDK10. Cyclin M silencing induced an increase in c-Raf protein and mRNA levels (Fig. 3 B and C) and in phosphorylated ERK1 and ERK2 protein levels (Fig. S3B), similarly to CDK10 silencing. As expected from these effects (6), CDK10 and cyclin M silencing both decreased the sensitivity of ER+ MCF7 cells to tamoxifen, to a similar extent. The combined silencing of both genes did not result in a higher resistance to the drug (Fig. S3C). Altogether, these observations demonstrate a functional interaction between cyclin M and CDK10, which negatively controls ETS2.Fig. 3.In a new window Download PPTFig. 3.Cyclin M silencing up-regulates c-Raf expression. (A) Western blot analysis of endogenous CDK10 and cyclin M expression levels in MCF7 cells, in response to siRNA-mediated gene silencing. (B) Western blot analysis of endogenous c-Raf expression levels in MCF7 cells, in response to CDK10 or cyclin M silencing. A quantification is shown in Fig. S3A. (C) Quantitative RT-PCR analysis of CDK10, cyclin M, and c-Raf mRNA levels, in response to CDK10 (Upper) or cyclin M (Lower) silencing. **P ≤ 0.01; ***P ≤ 0.001.We then wished to explore the mechanism by which CDK10/cyclin M controls ETS2. ETS2 is a short-lived protein degraded by the proteasome (13). A straightforward hypothesis is that CDK10/cyclin M positively controls ETS2 degradation. We thus examined the impact of CDK10 or cyclin M silencing on ETS2 expression levels. The silencing of CDK10 and that of cyclin M caused an increase in the expression levels of an exogenously expressed Flag-ETS2 protein (Fig. S4A), as well as of the endogenous ETS2 protein (Fig. 4A). This increase is not attributable to increased ETS2 mRNA levels, which marginally fluctuated in response to CDK10 or cyclin M silencing (Fig. S4B). We then examined the expression levels of the Flag-tagged ETS2 protein when expressed alone or in combination with Myc-CDK10 or -CDK10(kd), with or without cyclin M-V5-6His. Flag-ETS2 was readily detected when expressed alone or, to a lesser extent, when coexpressed with CDK10(kd). However, its expression level was dramatically decreased when coexpressed with CDK10 alone, or with CDK10 and cyclin M (Fig. 4B). These observations suggest that endogenous cyclin M levels are in excess compared with those of CDK10 in MCF7 cells, and they show that the major decrease in ETS2 levels observed upon CDK10 coexpression involves CDK10 kinase activity. Treatment of cells coexpressing Flag-ETS2, CDK10, and cyclin M with the proteasome inhibitor MG132 largely rescued Flag-ETS2 expression levels (Fig. 4B).Fig. 4.In a new window Download PPTFig. 4.CDK10/cyclin M controls ETS2 stability in human cancer derived cells. (A) Western blot analysis of endogenous ETS2 expression levels in MCF7 cells, in response to siRNA-mediated CDK10 and/or cyclin M silencing. A quantification is shown in Fig. S4B. (B) Western blot analysis of exogenously expressed Flag-ETS2 protein levels in MCF7 cells cotransfected with empty vectors or coexpressing Myc-CDK10 (wt or kd), or Myc-CDK10/CycM-V5-6His. The latter cells were treated for 16 h with the MG132 proteasome inhibitor. Proper expression of CDK10 and cyclin M tagged proteins was verified by Western blot analysis. (C and D) Western blot analysis of expression levels of exogenously expressed Flag-ETS2 wild-type or mutant proteins in MCF7 cells, in the absence of (C) or in response to (D) Myc-CDK10/CycM-V5-6His expression. Quantifications are shown in Fig. S4 C and D.A mass spectrometry analysis of recombinant ETS2 phosphorylated by CDK10/cyclin M in vitro revealed the existence of multiple phosphorylated residues, among which are two neighboring phospho-serines (at positions 220 and 225) that may form a phosphodegron (14) (Figs. S5–S8). To confirm this finding, we compared the phosphorylation level of recombinant ETS2wt with that of ETS2SASA protein, a mutant bearing alanine substitutions of these two serines. As expected from the existence of multiple phosphorylation sites, we detected a small but reproducible, significant decrease of phosphorylation level of ETS2SASA compared with ETS2wt (Fig. S9), thus confirming that Ser220/Ser225 are phosphorylated by CDK10/cyclin M. To establish a direct link between ETS2 phosphorylation by CDK10/cyclin M and degradation, we examined the expression levels of Flag-ETS2SASA. In the absence of CDK10/cyclin M coexpression, it did not differ significantly from that of Flag-ETS2. This is contrary to that of Flag-ETS2DBM, bearing a deletion of the N-terminal destruction (D-) box that was previously shown to be involved in APC-Cdh1–mediated degradation of ETS2 (13) (Fig. 4C). However, contrary to Flag-ETS2 wild type, the expression level of Flag-ETS2SASA remained insensitive to CDK10/cyclin M coexpression (Fig. 4D). Altogether, these results suggest that CDK10/cyclin M directly controls ETS2 degradation through the phosphorylation of these two serines.Finally, we studied a lymphoblastoid cell line derived from a patient with STAR syndrome, bearing FAM58A mutation c.555+1G>A, predicted to result in aberrant splicing (10). In accordance with incomplete skewing of X chromosome inactivation previously found in this patient, we detected a decreased expression level of cyclin M protein in the STAR cell line, compared with a control lymphoblastoid cell line. In line with our preceding observations, we detected an increased expression level of ETS2 protein in the STAR cell line compared with the control (Fig. 5A and Fig. S10A). We then examined by quantitative RT-PCR the mRNA expression levels of the corresponding genes. The STAR cell line showed a decreased expression level of cyclin M mRNA but an expression level of ETS2 mRNA similar to that of the control cell line (Fig. 5B). To demonstrate that the increase in ETS2 protein expression is indeed a result of the decreased cyclin M expression observed in the STAR patient-derived cell line, we expressed cyclin M-V5-6His in this cell line. This expression caused a decrease in ETS2 protein levels (Fig. 5C).Fig. 5.In a new window Download PPTFig. 5.Decreased cyclin M expression in STAR patient-derived cells results in increased ETS2 protein level. (A) Western blot analysis of cyclin M and ETS2 protein levels in a STAR patient-derived lymphoblastoid cell line and in a control lymphoblastoid cell line, derived from a healthy individual. A quantification is shown in Fig. S10A. (B) Quantitative RT-PCR analysis of cyclin M and ETS2 mRNA levels in the same cells. ***P ≤ 0.001. (C) Western blot analysis of ETS2 protein levels in the STAR patient-derived lymphoblastoid cell line transfected with an empty vector or a vector directing the expression of cyclin M-V5-6His. Another Western blot revealing endogenously and exogenously expressed cyclin M levels is shown in Fig. S10B. A quantification of ETS2 protein levels is shown in Fig. S10C.Previous SectionNext SectionDiscussionIn this work, we unveil the interaction between CDK10, the last orphan CDK discovered in the pregenomic era (2), and cyclin M, the only cyclin associated with a human genetic disease so far, and whose functions remain unknown (10). The closest paralogs of CDK10 within the CDK family are the CDK11 proteins, which interact with L-type cyclins (15). Interestingly, the closest paralog of these cyclins within the cyclin family is cyclin M (Fig. S11). The fact that none of the shorter CDK10 isoforms interact robustly with cyclin M suggests that alternative splicing of the CDK10 gene (16, 17) plays an important role in regulating CDK10 functions.The functional relevance of the interaction between CDK10 and cyclin M is supported by different observations. Both proteins seem to enhance each other’s stability, as judged from their increased expression levels when their partner is exogenously coexpressed (Fig. 1D) and from the much reduced endogenous CDK10 expression level observed in response to cyclin M silencing (Fig. 3A and Fig. S2B). CDK10 is subject to ubiquitin-mediated degradation (18). Our observations suggest that cyclin M protects CDK10 from such degradation and that it is the only cyclin partner of CDK10, at least in MCF7 cells. They also suggest that cyclin M stability is enhanced upon binding to CDK10, independently from its kinase activity, as seen for cyclin C and CDK8 (19). We uncover a cyclin M-dependent CDK10 protein kinase activity in vitro, thus demonstrating that this protein, which was named a CDK on the sole basis of its amino acid sequence, is indeed a genuine cyclin-dependent kinase. Our Y2H assays reveal that truncated cyclin M proteins corresponding to the hypothetical products of two STAR syndrome-associated FAM58A mutations do not produce an interaction phenotype with CDK10. Hence, regardless of whether these mutated mRNAs undergo nonsense-mediated decay (as suggested from the decreased cyclin M mRNA levels in STAR cells, shown in Fig. 5B) or give rise to truncated cyclin M proteins, females affected by the STAR syndrome must exhibit compromised CDK10/cyclin M kinase activity at least in some tissues and during specific developmental stages.We show that ETS2, a known interactor of CDK10, is a phosphorylation substrate of CDK10/cyclin M in vitro and that CDK10/cyclin M kinase activity positively controls ETS2 degradation by the proteasome. This control seems to be exerted through a very fine mechanism, as judged from the sensitivity of ETS2 levels to partially decreased CDK10 and cyclin M levels, achieved in MCF7 cells and observed in STAR cells, respectively. These findings offer a straightforward explanation for the already reported up-regulation of ETS2-driven transcription of c-RAF in response to CDK10 silencing (6). We bring evidence that CDK10/cyclin M directly controls ETS2 degradation through the phosphorylation of two neighboring serines, which may form a noncanonical β-TRCP phosphodegron (DSMCPAS) (14). Because none of these two serines precede a proline, they do not conform to usual CDK phosphorylation sites. However, multiple so-called transcriptional CDKs (CDK7, -8, -9, and -11) (to which CDK10 may belong; Fig. S11) have been shown to phosphorylate a variety of motifs in a non–proline-directed fashion, especially in the context of molecular docking with the substrate (20). Here, it can be hypothesized that the high-affinity interaction between CDK10 and the Pointed domain of ETS2 (6, 9) (Fig. 1C) would allow docking-mediated phosphorylation of atypical sites. The control of ETS2 degradation involves a number of players, including APC-Cdh1 (13) and the cullin-RING ligase CRL4 (21). The formal identification of the ubiquitin ligase involved in the CDK10/cyclin M pathway and the elucidation of its concerted action with the other ubiquitin ligases to regulate ETS2 degradation will require further studies.Our results present a number of significant biological and medical implications. First, they shed light on the regulation of ETS2, which plays an important role in development (22) and is frequently deregulated in many cancers (23). Second, our results contribute to the understanding of the molecular mechanisms causing tamoxifen resistance associated with reduced CDK10 expression levels, and they suggest that, like CDK10 (6), cyclin M could also be a predictive clinical marker of hormone therapy response of ERα-positive breast cancer patients. Third, our findings offer an interesting hypothesis on the molecular mechanisms underlying STAR syndrome. Ets2 transgenic mice showing a less than twofold overexpression of Ets2 present severe cranial abnormalities (24), and those observed in STAR patients could thus be caused at least in part by increased ETS2 protein levels. Another expected consequence of enhanced ETS2 expression levels would be a decreased risk to develop certain types of cancers and an increased risk to develop others. Studies on various mouse models (including models of Down syndrome, in which three copies of ETS2 exist) have revealed that ETS2 dosage can repress or promote tumor growth and, hence, that ETS2 exerts noncell autonomous functions in cancer (25). Intringuingly, one of the very few STAR patients identified so far has been diagnosed with a nephroblastoma (26). Finally, our findings will facilitate the general exploration of the biological functions of CDK10 and, in particular, its role in the control of cell division. Previous studies have suggested either a positive role in cell cycle control (5, 6) or a tumor-suppressive activity in some cancers (7, 8). The severe growth retardation exhibited by STAR patients strongly suggests that CDK10/cyclin M plays an important role in the control of cell proliferation.Previous SectionNext SectionMaterials and MethodsCloning of CDK10 and cyclin M cDNAs, plasmid constructions, tamoxifen response analysis, quantitative RT-PCR, mass spectrometry experiments, and antibody production are detailed in SI Materials and Methods.Yeast Two-Hybrid Interaction Assays. We performed yeast interaction mating assays as previously described (27).Mammalian Cell Cultures and Transfections. We grew human HEK293 and MCF7 cells in DMEM supplemented with 10% (vol/vol) FBS (Invitrogen), and we grew lymphoblastoid cells in RPMI 1640 GlutaMAX supplemented with 15% (vol/vol) FBS. We transfected HEK293 and MCF7 cells using Lipofectamine 2000 (Invitrogen) for plasmids, Lipofectamine RNAiMAX (Invitrogen) for siRNAs, and Jetprime (Polyplus) for plasmids/siRNAs combinations according to the manufacturers’ instructions. We transfected lymphoblastoid cells by electroporation (Neon, Invitrogen). For ETS2 stability studies we treated MCF7 cells 32 h after transfection with 10 μM MG132 (Fisher Scientific) for 16 h.Coimmunoprecipitation and Western Blot Experiments. We collected cells by scraping in PBS (or centrifugation for lymphoblastoid cells) and lysed them by sonication in a lysis buffer containing 60 mM β-glycerophosphate, 15 mM p-nitrophenylphosphate, 25 mM 3-(N-morpholino)propanesulfonic acid (Mops) (pH 7.2), 15 mM EGTA, 15 mM MgCl2, 1 mM Na vanadate, 1 mM NaF, 1mM phenylphosphate, 0.1% Nonidet P-40, and a protease inhibitor mixture (Roche). We spun the lysates 15 min at 20,000 × g at 4 °C, collected the supernatants, and determined the protein content using a Bradford assay. We performed the immunoprecipitation experiments on 500 μg of total proteins, in lysis buffer. We precleared the lysates with 20 μL of protein A or G-agarose beads, incubated 1 h 4 °C on a rotating wheel. We added 5 μg of antibody to the supernatants, incubated 1 h 4 °C on a rotating wheel, added 20 μL of protein A or G-agarose beads, and incubated 1 h 4 °C on a rotating wheel. We collected the beads by centrifugation 30 s at 18,000 × g at 4 °C and washed three times in a bead buffer containing 50 mM Tris (pH 7.4), 5 mM NaF, 250 mM NaCl, 5 mM EDTA, 5 mM EGTA, 0.1% Nonidet P-40, and a protease inhibitor coktail (Roche). We directly added sample buffer to the washed pellets, heat-denatured the proteins, and ran the samples on 10% Bis-Tris SDS/PAGE. We transferred the proteins onto Hybond nitrocellulose membranes and processed the blots according to standard procedures. For Western blot experiments, we used the following primary antibodies: anti-Myc (Abcam ab9106, 1:2,000), anti-V5 (Invitrogen R960, 1:5,000), anti-tubulin (Santa Cruz Biotechnology B-7, 1:500), anti-CDK10 (Covalab pab0847p, 1:500 or Santa Cruz Biotechnology C-19, 1:500), anti-CycM (home-made, dilution 1:500 or Covalab pab0882-P, dilution 1:500), anti-Raf1 (Santa Cruz Biotechnology C-20, 1:1,000), anti-ETS2 (Santa Cruz Biotechnology C-20, 1:1,000), anti-Flag (Sigma F7425, 1:1,000), and anti-actin (Sigma A5060, 1:5,000). We used HRP-coupled anti-goat (Santa Cruz Biotechnology SC-2033, dilution 1:2,000), anti-mouse (Bio-Rad 170–6516, dilution 1:3,000) or anti-rabbit (Bio-Rad 172–1019, 1:5,000) as secondary antibodies. We revealed the blots by enhanced chemiluminescence (SuperSignal West Femto, Thermo Scientific).Production and Purification of Recombinant Proteins.GST-CDK10(kd)/StrepII-CycM. We generated recombinant bacmids in DH10Bac Escherichia coli and baculoviruses in Sf9 cells using the Bac-to-Bac system, as described by the provider (Invitrogen). We infected Sf9 cells with GST-CDK10- (or GST-CDK10kd)-producing viruses, or coinfected the cells with StrepII-CycM–producing viruses, and we collected the cells 72 h after infection. To purify GST-fusion proteins, we spun 250 mL cells and resuspended the pellet in 40 mL lysis buffer (PBS, 250 mM NaCl, 0.5% Nonidet P-40, 50 mM NaF, 10 mM β-glycerophosphate, and 0.3 mM Na-vanadate) containing a protease inhibitor mixture (Roche). We lysed the cells by sonication, spun the lysate 30 min at 15,000 × g, collected the soluble fraction, and added it to a 1-mL glutathione-Sepharose matrix. We incubated 1 h at 4 °C, washed four times with lysis buffer, one time with kinase buffer A (see below), and finally resuspended the beads in 100 μL kinase buffer A containing 10% (vol/vol) glycerol for storage.6His-ETS2. We transformed Origami2 DE3 (Novagen) with the 6His-ETS2 expression vector. We induced expression with 0.2 mM isopropyl-β-d-1-thiogalactopyranoside for 3 h at 22 °C. To purify 6His-ETS2, we spun 50 mL cells and resuspended the pellet in 2 mL lysis buffer (PBS, 300 mM NaCl, 10 mM Imidazole, 1 mM DTT, and 0.1% Nonidet P-40) containing a protease inhibitor mixture without EDTA (Roche). We lysed the cells at 1.6 bar using a cell disruptor and spun the lysate 10 min at 20,000 × g. We collected the soluble fraction and added it to 200 μL Cobalt beads (Thermo Scientific). After 1 h incubation at 4 °C on a rotating wheel, we washed four times with lysis buffer. To elute, we incubated beads 30 min with elution buffer (PBS, 250 mM imidazole, pH 7.6) containing the protease inhibitor mixture, spun 30 s at 10,000 × g, and collected the eluted protein.Protein Kinase Assays. We mixed glutathione-Sepharose beads (harboring GST-CDK10 wt or kd, either monomeric or complexed with StrepII-CycM), 22.7 μM BSA, 15 mM DTT, 100 μM ATP, 5 μCi ATP[γ-32P], 7.75 μM histone H1, or 1 μM 6His-ETS2 and added kinase buffer A (25 mM Tris·HCl, 10 mM MgCl2, 1 mM EGTA, 1 mM DTT, and 3.7 μM heparin, pH 7.5) up to a total volume of 30 μL. We incubated the reactions 30 min at 30 °C, added Laemli sample buffer, heat-denatured the samples, and ran 10% Bis-Tris SDS/PAGE. We cut gel slices to detect GST-CDK10 and StrepII-CycM by Western blotting. We stained the gel slices containing the substrate with Coomassie (R-250, Bio-Rad), dried them, and detected the incorporated radioactivity by autoradiography. We identified four unrelated girls with anogenital and renal malformations, dysmorphic facial features, normal intellect and syndactyly of toes. A similar combination of features had been reported previously in a mother–daughter pair1 (Table 1 and Supplementary Note online). These authors noted clinical overlap with Townes-Brocks syndrome but suggested that the phenotype represented a separate autosomal dominant entity (MIM601446). Here we define the cardinal features of this syndrome as a characteristic facial appearance with apparent telecanthus and broad tripartite nasal tip, variable syndactyly of toes 2–5, hypoplastic labia, anal atresia and urogenital malformations (Fig. 1a–h). We also observed a variety of other features (Table 1). Figure 1: Clinical and molecular characterization of STAR syndrome. Figure 1 : Clinical and molecular characterization of STAR syndrome. (a–f) Facial appearances of cases 1–3 (apparent telecanthus, dysplastic ears and thin upper lips; a,c,e), and toe syndactyly 2–5, 3–5 or 4–5 (b,d,f) in these cases illustrate recognizable features of STAR syndrome (specific parental consent has been obtained for publication of these photographs). Anal atresia and hypoplastic labia are not shown. (g,h) X-ray films of the feet of case 2 showing only four rays on the left and delta-shaped 4th and 5th metatarsals on the right (h; compare to clinical picture in d). (i) Array-CGH data. Log2 ratio represents copy number loss of six probes spanning between 37.9 and 50.7 kb, with one probe positioned within FAM58A. The deletion does not remove parts of other functional genes. (j) Schematic structure of FAM58A and position of the mutations. FAM58A has five coding exons (boxes). The cyclin domain (green) is encoded by exons 2–4. The horizontal arrow indicates the deletion extending 5' in case 1, which includes exons 1 and 2, whereas the horizontal line below exon 5 indicates the deletion found in case 3, which removes exon 5 and some 3' sequence. The pink horizontal bars above the boxes indicate the amplicons used for qPCR and sequencing (one alternative exon 5 amplicon is not indicated because of space constraints). The mutation 201dupT (case 4) results in an immediate stop codon, and the 555+1G>A and 555-1G>A splice mutations in cases 2, 5 and 6 are predicted to be deleterious because they alter the conserved splice donor and acceptor site of intron 4, respectively. Full size image (97 KB) Table 1: Clinical features in STAR syndrome cases Table 1 - Clinical features in STAR syndrome cases Full table On the basis of the phenotypic overlap with Townes-Brocks, Okihiro and Feingold syndromes, we analyzed SALL1 (ref. 2), SALL4 (ref. 3) and MYCN4 but found no mutations in any of these genes (Supplementary Methods online). Next, we carried out genome-wide high-resolution oligonucleotide array comparative genomic hybridization (CGH)5 analysis (Supplementary Methods) of genomic DNA from the most severely affected individual (case 1, with lower lid coloboma, epilepsy and syringomyelia) and identified a heterozygous deletion of 37.9–50.7 kb on Xq28, which removed exons 1 and 2 of FAM58A (Fig. 1i,j). Using real-time PCR, we confirmed the deletion in the child and excluded it in her unaffected parents (Supplementary Fig. 1a online, Supplementary Methods and Supplementary Table 1 online). Through CGH with a customized oligonucleotide array enriched in probes for Xq28, followed by breakpoint cloning, we defined the exact deletion size as 40,068 bp (g.152,514,164_152,554,231del(chromosome X, NCBI Build 36.2); Fig. 1j and Supplementary Figs. 2,3 online). The deletion removes the coding regions of exons 1 and 2 as well as intron 1 (2,774 bp), 492 bp of intron 2, and 36,608 bp of 5' sequence, including the 5' UTR and the entire KRT18P48 pseudogene (NCBI gene ID 340598). Paternity was proven using routine methods. We did not find deletions overlapping FAM58A in the available copy number variation (CNV) databases. Subsequently, we carried out qPCR analysis of the three other affected individuals (cases 2, 3 and 4) and the mother-daughter pair from the literature (cases 5 and 6). In case 3, we detected a de novo heterozygous deletion of 1.1–10.3 kb overlapping exon 5 (Supplementary Fig. 1b online). Using Xq28-targeted array CGH and breakpoint cloning, we identified a deletion of 4,249 bp (g.152,504,123_152,508,371del(chromosome X, NCBI Build 36.2); Fig. 1j and Supplementary Figs. 2,3), which removed 1,265 bp of intron 4, all of exon 5, including the 3' UTR, and 2,454 bp of 3' sequence. We found heterozygous FAM58A point mutations in the remaining cases (Fig. 1j, Supplementary Fig. 2, Supplementary Methods and Supplementary Table 1). In case 2, we identified the mutation 555+1G>A, affecting the splice donor site of intron 4. In case 4, we identified the frameshift mutation 201dupT, which immediately results in a premature stop codon N68XfsX1. In cases 5 and 6, we detected the mutation 556-1G>A, which alters the splice acceptor site of intron 4. We validated the point mutations and deletions by independent rounds of PCR and sequencing or by qPCR. We confirmed paternity and de novo status of the point mutations and deletions in all sporadic cases. None of the mutations were seen in the DNA of 60 unaffected female controls, and no larger deletions involving FAM58A were found in 93 unrelated array-CGH investigations. By analyzing X-chromosome inactivation (Supplementary Methods and Supplementary Fig. 4 online), we found complete skewing of X inactivation in cases 1 and 3–6 and almost complete skewing in case 2, suggesting that cells carrying the mutation on the active X chromosome have a growth disadvantage during fetal development. Using RT-PCR on RNA from lymphoblastoid cells of case 2 (Supplementary Fig. 2), we did not find any aberrant splice products as additional evidence that the mutated allele is inactivated. Furthermore, FAM58A is subjected to X inactivation6. In cases 1 and 3, the parental origin of the deletions could not be determined, as a result of lack of informative SNPs. Case 5, the mother of case 6, gave birth to two boys, both clinically unaffected (samples not available). We cannot exclude that the condition is lethal in males. No fetal losses were reported from any of the families. The function of FAM58A is unknown. The gene consists of five coding exons, and the 642-bp coding region encodes a protein of 214 amino acids. GenBank lists a mRNA length of 1,257 bp for the reference sequence (NM_152274.2). Expression of the gene (by EST data) was found in 27 of 48 adult tissues including kidney, colon, cervix and uterus, but not heart (NCBI expression viewer, UniGene Hs.496943). Expression was also noted in 24 of 26 listed tumor tissues as well as in embryo and fetus. Genes homologous to FAM58A (NCBI HomoloGene: 13362) are found on the X chromosome in the chimpanzee and the dog. The zebrafish has a similar gene on chromosome 23. However, in the mouse and rat, there are no true homologs. These species have similar but intronless genes on chromosomes 11 (mouse) and 10 (rat), most likely arising from a retrotransposon insertion event. On the murine X chromosome, the flanking genes Atp2b3 and Dusp9 are conserved, but only remnants of the FAM58A sequence can be detected. FAM58A contains a cyclin-box-fold domain, a protein-binding domain found in cyclins with a role in cell cycle and transcription control. No human phenotype resulting from a cyclin gene mutation has yet been reported. Homozygous knockout mice for Ccnd1 (encoding cyclin D1) are viable but small and have reduced lifespan. They also have dystrophic changes of the retina, likely as a result of decreased cell proliferation and degeneration of photoreceptor cells during embryogenesis7, 8. Cyclin D1 colocalizes with SALL4 in the nucleus, and both proteins cooperatively mediate transcriptional repression9. As the phenotype of our cases overlaps considerably with that of Townes-Brocks syndrome caused by SALL1 mutations1, we carried out co-immunoprecipitation to find out if SALL1 or SALL4 would interact with FAM58A in a manner similar to that observed for SALL4 and cyclin D1. We found that FAM58A interacts with SALL1 but not with SALL4 (Supplementary Fig. 5 online), supporting the hypothesis that FAM58A and SALL1 participate in the same developmental pathway. How do FAM58A mutations lead to STAR syndrome? Growth retardation (all cases; Table 1) and retinal abnormalities (three cases) are reminiscent of the reduced body size and retinal anomalies in cyclin D1 knockout mice7, 8. Therefore, a proliferation defect might be partly responsible for STAR syndrome. To address this question, we carried out a knockdown of FAM58A mRNA followed by a proliferation assay. Transfection of HEK293 cells with three different FAM58A-specific RNAi oligonucleotides resulted in a significant reduction of both FAM58A mRNA expression and proliferation of transfected cells (Supplementary Methods and Supplementary Fig. 6 online), supporting the link between FAM58A and cell proliferation. We found that loss-of-function mutations of FAM58A result in a rather homogeneous clinical phenotype. The additional anomalies in case 1 are likely to result from an effect of the 40-kb deletion on expression of a neighboring gene, possibly ATP2B3 or DUSP9. However, we cannot exclude that the homogeneous phenotype results from an ascertainment bias and that FAM58A mutations, including missense changes, could result in a broader spectrum of malformations. The genes causing the overlapping phenotypes of STAR syndrome and Townes-Brocks syndrome seem to act in the same pathway. Of note, MYCN, a gene mutated in Feingold syndrome, is a direct regulator of cyclin D2 (refs. 10,11); thus, it is worth exploring whether the phenotypic similarities between Feingold and STAR syndrome might be explained by direct regulation of FAM58A by MYCN. FAM58A is located approximately 0.56 Mb centromeric to MECP2 on Xq28. Duplications overlapping both MECP2 and FAM58A have been described and are not associated with a clinical phenotype in females12, but no deletions overlapping both MECP2 and FAM58A have been observed to date13. Although other genes between FAM58A and MECP2 have been implicated in brain development, FAM58A and MECP2 are the only genes in this region known to result in X-linked dominant phenotypes; thus, deletion of both genes on the same allele might be lethal in both males and females."
train_data_processado['Text'][0]
'cyclin dependent kinase cdks regulate variety fundamental cellular process cdk10 stand one last orphan cdks activating cyclin identified kinase activity revealed previous work shown cdk10 silencing increase ets2 v ets erythroblastosis virus e26 oncogene homolog 2 driven activation mapk pathway confers tamoxifen resistance breast cancer cell precise mechanism cdk10 modulates ets2 activity generally function cdk10 remain elusive demonstrate cdk10 cyclin dependent kinase identifying cyclin activating cyclin cyclin orphan cyclin product fam58a whose mutation cause star syndrome human developmental anomaly whose feature include toe syndactyly telecanthus anogenital renal malformation show star syndrome associated cyclin mutant unable interact cdk10 cyclin silencing phenocopies cdk10 silencing increasing c raf conferring tamoxifen resistance breast cancer cell cdk10 cyclin phosphorylates ets2 vitro cell positively control ets2 degradation proteasome ets2 protein level increased cell derived star patient increase attributable decreased cyclin level altogether result reveal additional regulatory mechanism ets2 play key role cancer development also shed light molecular mechanism underlying star syndrome cyclin dependent kinase cdks play pivotal role control number fundamental cellular process 1 human genome contains 21 gene encoding protein considered member cdk family owing sequence similarity bona fide cdks known activated cyclins 2 although discovered almost 20 ago 3 4 cdk10 remains one two cdks without identified cyclin partner knowledge gap largely impeded exploration biological function cdk10 act positive cell cycle regulator cell 5 6 tumor suppressor others 7 8 cdk10 interacts ets2 v ets erythroblastosis virus e26 oncogene homolog 2 transcription factor inhibits transcriptional activity unknown mechanism 9 cdk10 knockdown derepresses ets2 increase expression c raf protein kinase activates mapk pathway induces resistance mcf7 cell tamoxifen 6 deorphanize cdk10 identifying cyclin product fam58a binding partner mutation gene predict absence truncation cyclin associated star syndrome whose feature include toe syndactyly telecanthus anogenital renal malformation heterozygous female 10 however function cyclin pathogenesis star syndrome remain unknown show recombinant cdk10 cyclin heterodimer active protein kinase phosphorylates ets2 vitro cyclin silencing phenocopies cdk10 silencing increasing c raf phospho erk expression level inducing tamoxifen resistance estrogen receptor er breast cancer cell show cdk10 cyclin positively control ets2 degradation proteasome phosphorylation two neighboring serine finally detect increased ets2 expression level cell derived star patient demonstrate attributable decreased cyclin expression level observed cell previous sectionnext sectionresultsa yeast two hybrid y2h screen unveiled interaction signal cdk10 mouse protein whose c terminal half present strong sequence homology human fam58a gene product whose proposed name cyclin 11 thus performed y2h mating assay determine whether human cdk10 interacts human cyclin fig 1 ac longest cdk10 isoform p1 expressed bait protein produced strong interaction phenotype full length cyclin expressed prey protein detectable phenotype cyclin d1 p21 cip1 cdi1 kap known binding partner cdks fig 1b cdk1 cdk3 also produced y2h signal cyclin albeit notably weaker observed cdk10 fig 1b interaction phenotype also observed full length cyclin cdk10 protein expressed bait prey respectively fig s1a tested different isoforms cdk10 cyclin originating alternative gene splicing two truncated cyclin protein corresponding hypothetical product two mutated fam58a gene found star syndrome patient 10 none shorter isoforms produced interaction phenotype fig 1 c fig s1a fig 1 new window download pptfig 1 cdk10 cyclin form interaction complex schematic representation different protein isoforms analyzed y2h assay amino acid number indicated black box indicate internal deletion red box indicates differing amino acid sequence compared cdk10 p1 b y2h assay set cdk protein expressed bait fusion lexa dna binding domain cdk interacting protein expressed prey fusion b42 transcriptional activator peg202 pjg4 5 empty bait prey plasmid expressing lexa b42 respectively lacz used reporter gene blue yeast indicative y2h interaction phenotype c y2h assay different cdk10 cyclin isoforms amino terminal region ets2 known interact cdk10 9 also assayed western blot analysis myc cdk10 wt kd cycm v5 6his expression level transfected hek293 cell e western blot analysis myc cdk10 wt kd immunoprecipitates obtained using anti myc antibody input correspond 10 g total lysates obtained hek293 cell coexpressing myc cdk10 wt kd cycm v5 6his f western blot analysis immunoprecipitates obtained using anti cdk10 antibody control goat antibody human breast cancer mcf7 cell input corresponds 30 g mcf7 total cell lysates lower band doublet observed upper panel comigrates exogenously expressed untagged cdk10 thus corresponds endogenous cdk10 upper band doublet corresponds nonspecific signal demonstrated insensitivity either overexpression cdk10 seen left lane silencing cdk10 fig s2b another experiment longer gel migration shown fig s1d next examined ability cdk10 cyclin interact expressed human cell fig 1 e tested wild type cdk10 wt kinase dead kd mutant bearing d181a amino acid substitution abolishes atp binding 12 expressed cyclin v5 6his myc cdk10 wt kd human embryonic kidney cell line hek293 expression level cyclin v5 6his significantly increased upon coexpression myc cdk10 wt kd lesser extent myc cdk10 wt kd increased upon coexpression cyclin v5 6his fig 1d immunoprecipitated myc cdk10 protein detected presence cyclin cdk10 wt kd immunoprecipitates protein coexpressed pair wise fig 1e confirmed observation detecting presence myc cdk10 cyclin v5 6his immunoprecipitates fig s1b experiment confirmed lack robust interaction cdk10 p2 isoform cyclin fig s1c detect interaction endogenous protein performed immunoprecipitations nontransfected mcf7 cell derived human breast cancer cdk10 cyclin antibody detected cognate endogenous protein western blotting readily detected cyclin immunoprecipitates obtained cdk10 antibody control antibody fig 1f result confirm physical interaction cdk10 cyclin human cell unveil hypothesized cdk10 cyclin protein kinase activity produced gst cdk10 strepii cyclin fusion protein insect cell either individually combination observed gst cdk10 strepii cyclin copurified thus confirming interaction yet another cellular model fig 2a performed vitro kinase assay purified protein using histone h1 generic substrate histone h1 phosphorylation detected lysates cell coexpressing gst cdk10 strepii cyclin phosphorylation detected gst cdk10 strepii cyclin expressed alone strepii cyclin coexpressed gst cdk10 kd fig 2a next investigated whether ets2 known interact cdk10 9 fig 1c phosphorylation substrate cdk10 cyclin detected strong phosphorylation ets2 gst cdk10 strepii cyclin purified heterodimer whereas phosphorylation detected using gst cdk10 alone gst cdk10 kd strepii cyclin heterodimer fig 2b fig 2 new window download pptfig 2 cdk10 cyclin dependent protein kinase vitro protein kinase assay histone h1 lysates insect cell expressing different protein purified glutathione sepharose matrix capture gst cdk10 wt kd fusion protein alone complex str cycm fusion protein purified protein expression level analyzed western blot top upper middle kinase activity determined autoradiography histone h1 whose added amount visualized coomassie staining lower middle bottom b using purified recombinant 6his ets2 substrate cdk10 silencing shown increase ets2 driven c raf transcription activate mapk pathway 6 investigated whether cyclin also involved regulatory pathway aim highly specific silencing used sirna pool mix four different sirnas low final concentration 10 nm cdk10 cyclin sirna pool silenced expression cognate target fig 3 c fig s2 interestingly cyclin sirna pool also caused marked decrease cdk10 protein level fig 3a fig s2b result shown fig 1d suggest cyclin binding stabilizes cdk10 cyclin silencing induced increase c raf protein mrna level fig 3 b c phosphorylated erk1 erk2 protein level fig s3b similarly cdk10 silencing expected effect 6 cdk10 cyclin silencing decreased sensitivity er mcf7 cell tamoxifen similar extent combined silencing gene result higher resistance drug fig s3c altogether observation demonstrate functional interaction cyclin cdk10 negatively control ets2 fig 3 new window download pptfig 3 cyclin silencing regulates c raf expression western blot analysis endogenous cdk10 cyclin expression level mcf7 cell response sirna mediated gene silencing b western blot analysis endogenous c raf expression level mcf7 cell response cdk10 cyclin silencing quantification shown fig s3a c quantitative rt pcr analysis cdk10 cyclin c raf mrna level response cdk10 upper cyclin lower silencing p 0 01 p 0 001 wished explore mechanism cdk10 cyclin control ets2 ets2 short lived protein degraded proteasome 13 straightforward hypothesis cdk10 cyclin positively control ets2 degradation thus examined impact cdk10 cyclin silencing ets2 expression level silencing cdk10 cyclin caused increase expression level exogenously expressed flag ets2 protein fig s4a well endogenous ets2 protein fig 4a increase attributable increased ets2 mrna level marginally fluctuated response cdk10 cyclin silencing fig s4b examined expression level flag tagged ets2 protein expressed alone combination myc cdk10 cdk10 kd without cyclin v5 6his flag ets2 readily detected expressed alone lesser extent coexpressed cdk10 kd however expression level dramatically decreased coexpressed cdk10 alone cdk10 cyclin fig 4b observation suggest endogenous cyclin level excess compared cdk10 mcf7 cell show major decrease ets2 level observed upon cdk10 coexpression involves cdk10 kinase activity treatment cell coexpressing flag ets2 cdk10 cyclin proteasome inhibitor mg132 largely rescued flag ets2 expression level fig 4b fig 4 new window download pptfig 4 cdk10 cyclin control ets2 stability human cancer derived cell western blot analysis endogenous ets2 expression level mcf7 cell response sirna mediated cdk10 cyclin silencing quantification shown fig s4b b western blot analysis exogenously expressed flag ets2 protein level mcf7 cell cotransfected empty vector coexpressing myc cdk10 wt kd myc cdk10 cycm v5 6his latter cell treated 16 h mg132 proteasome inhibitor proper expression cdk10 cyclin tagged protein verified western blot analysis c western blot analysis expression level exogenously expressed flag ets2 wild type mutant protein mcf7 cell absence c response myc cdk10 cycm v5 6his expression quantification shown fig s4 c mass spectrometry analysis recombinant ets2 phosphorylated cdk10 cyclin vitro revealed existence multiple phosphorylated residue among two neighboring phospho serine position 220 225 may form phosphodegron 14 fig s5s8 confirm finding compared phosphorylation level recombinant ets2wt ets2sasa protein mutant bearing alanine substitution two serine expected existence multiple phosphorylation site detected small reproducible significant decrease phosphorylation level ets2sasa compared ets2wt fig s9 thus confirming ser220 ser225 phosphorylated cdk10 cyclin establish direct link ets2 phosphorylation cdk10 cyclin degradation examined expression level flag ets2sasa absence cdk10 cyclin coexpression differ significantly flag ets2 contrary flag ets2dbm bearing deletion n terminal destruction box previously shown involved apc cdh1mediated degradation ets2 13 fig 4c however contrary flag ets2 wild type expression level flag ets2sasa remained insensitive cdk10 cyclin coexpression fig 4d altogether result suggest cdk10 cyclin directly control ets2 degradation phosphorylation two serine finally studied lymphoblastoid cell line derived patient star syndrome bearing fam58a mutation c 555 1g predicted result aberrant splicing 10 accordance incomplete skewing x chromosome inactivation previously found patient detected decreased expression level cyclin protein star cell line compared control lymphoblastoid cell line line preceding observation detected increased expression level ets2 protein star cell line compared control fig 5a fig s10a examined quantitative rt pcr mrna expression level corresponding gene star cell line showed decreased expression level cyclin mrna expression level ets2 mrna similar control cell line fig 5b demonstrate increase ets2 protein expression indeed result decreased cyclin expression observed star patient derived cell line expressed cyclin v5 6his cell line expression caused decrease ets2 protein level fig 5c fig 5 new window download pptfig 5 decreased cyclin expression star patient derived cell result increased ets2 protein level western blot analysis cyclin ets2 protein level star patient derived lymphoblastoid cell line control lymphoblastoid cell line derived healthy individual quantification shown fig s10a b quantitative rt pcr analysis cyclin ets2 mrna level cell p 0 001 c western blot analysis ets2 protein level star patient derived lymphoblastoid cell line transfected empty vector vector directing expression cyclin v5 6his another western blot revealing endogenously exogenously expressed cyclin level shown fig s10b quantification ets2 protein level shown fig s10c previous sectionnext sectiondiscussionin work unveil interaction cdk10 last orphan cdk discovered pregenomic era 2 cyclin cyclin associated human genetic disease far whose function remain unknown 10 closest paralogs cdk10 within cdk family cdk11 protein interact l type cyclins 15 interestingly closest paralog cyclins within cyclin family cyclin fig s11 fact none shorter cdk10 isoforms interact robustly cyclin suggests alternative splicing cdk10 gene 16 17 play important role regulating cdk10 function functional relevance interaction cdk10 cyclin supported different observation protein seem enhance others stability judged increased expression level partner exogenously coexpressed fig 1d much reduced endogenous cdk10 expression level observed response cyclin silencing fig 3a fig s2b cdk10 subject ubiquitin mediated degradation 18 observation suggest cyclin protects cdk10 degradation cyclin partner cdk10 least mcf7 cell also suggest cyclin stability enhanced upon binding cdk10 independently kinase activity seen cyclin c cdk8 19 uncover cyclin dependent cdk10 protein kinase activity vitro thus demonstrating protein named cdk sole basis amino acid sequence indeed genuine cyclin dependent kinase y2h assay reveal truncated cyclin protein corresponding hypothetical product two star syndrome associated fam58a mutation produce interaction phenotype cdk10 hence regardless whether mutated mrna undergo nonsense mediated decay suggested decreased cyclin mrna level star cell shown fig 5b give rise truncated cyclin protein female affected star syndrome must exhibit compromised cdk10 cyclin kinase activity least tissue specific developmental stage show ets2 known interactor cdk10 phosphorylation substrate cdk10 cyclin vitro cdk10 cyclin kinase activity positively control ets2 degradation proteasome control seems exerted fine mechanism judged sensitivity ets2 level partially decreased cdk10 cyclin level achieved mcf7 cell observed star cell respectively finding offer straightforward explanation already reported regulation ets2 driven transcription c raf response cdk10 silencing 6 bring evidence cdk10 cyclin directly control ets2 degradation phosphorylation two neighboring serine may form noncanonical trcp phosphodegron dsmcpas 14 none two serine precede proline conform usual cdk phosphorylation site however multiple called transcriptional cdks cdk7 8 9 11 cdk10 may belong fig s11 shown phosphorylate variety motif nonproline directed fashion especially context molecular docking substrate 20 hypothesized high affinity interaction cdk10 pointed domain ets2 6 9 fig 1c would allow docking mediated phosphorylation atypical site control ets2 degradation involves number player including apc cdh1 13 cullin ring ligase crl4 21 formal identification ubiquitin ligase involved cdk10 cyclin pathway elucidation concerted action ubiquitin ligases regulate ets2 degradation require study result present number significant biological medical implication first shed light regulation ets2 play important role development 22 frequently deregulated many cancer 23 second result contribute understanding molecular mechanism causing tamoxifen resistance associated reduced cdk10 expression level suggest like cdk10 6 cyclin could also predictive clinical marker hormone therapy response er positive breast cancer patient third finding offer interesting hypothesis molecular mechanism underlying star syndrome ets2 transgenic mouse showing le twofold overexpression ets2 present severe cranial abnormality 24 observed star patient could thus caused least part increased ets2 protein level another expected consequence enhanced ets2 expression level would decreased risk develop certain type cancer increased risk develop others study various mouse model including model syndrome three copy ets2 exist revealed ets2 dosage repress promote tumor growth hence ets2 exerts noncell autonomous function cancer 25 intringuingly one star patient identified far diagnosed nephroblastoma 26 finally finding facilitate general exploration biological function cdk10 particular role control cell division previous study suggested either positive role cell cycle control 5 6 tumor suppressive activity cancer 7 8 severe growth retardation exhibited star patient strongly suggests cdk10 cyclin play important role control cell proliferation previous sectionnext sectionmaterials methodscloning cdk10 cyclin cdna plasmid construction tamoxifen response analysis quantitative rt pcr mass spectrometry experiment antibody production detailed si material method yeast two hybrid interaction assay performed yeast interaction mating assay previously described 27 mammalian cell culture transfections grew human hek293 mcf7 cell dmem supplemented 10 vol vol fbs invitrogen grew lymphoblastoid cell rpmi 1640 glutamax supplemented 15 vol vol fbs transfected hek293 mcf7 cell using lipofectamine 2000 invitrogen plasmid lipofectamine rnaimax invitrogen sirnas jetprime polyplus plasmid sirnas combination according manufacturer instruction transfected lymphoblastoid cell electroporation neon invitrogen ets2 stability study treated mcf7 cell 32 h transfection 10 mg132 fisher scientific 16 h coimmunoprecipitation western blot experiment collected cell scraping pb centrifugation lymphoblastoid cell lysed sonication lysis buffer containing 60 mm glycerophosphate 15 mm p nitrophenylphosphate 25 mm 3 n morpholino propanesulfonic acid mop ph 7 2 15 mm egta 15 mm mgcl2 1 mm na vanadate 1 mm naf 1mm phenylphosphate 0 1 nonidet p 40 protease inhibitor mixture roche spun lysates 15 min 20 000 g 4 c collected supernatant determined protein content using bradford assay performed immunoprecipitation experiment 500 g total protein lysis buffer precleared lysates 20 l protein g agarose bead incubated 1 h 4 c rotating wheel added 5 g antibody supernatant incubated 1 h 4 c rotating wheel added 20 l protein g agarose bead incubated 1 h 4 c rotating wheel collected bead centrifugation 30 18 000 g 4 c washed three time bead buffer containing 50 mm tris ph 7 4 5 mm naf 250 mm nacl 5 mm edta 5 mm egta 0 1 nonidet p 40 protease inhibitor coktail roche directly added sample buffer washed pellet heat denatured protein ran sample 10 bi tris sd page transferred protein onto hybond nitrocellulose membrane processed blot according standard procedure western blot experiment used following primary antibody anti myc abcam ab9106 1 2 000 anti v5 invitrogen r960 1 5 000 anti tubulin santa cruz biotechnology b 7 1 500 anti cdk10 covalab pab0847p 1 500 santa cruz biotechnology c 19 1 500 anti cycm home made dilution 1 500 covalab pab0882 p dilution 1 500 anti raf1 santa cruz biotechnology c 20 1 1 000 anti ets2 santa cruz biotechnology c 20 1 1 000 anti flag sigma f7425 1 1 000 anti actin sigma a5060 1 5 000 used hrp coupled anti goat santa cruz biotechnology sc 2033 dilution 1 2 000 anti mouse bio rad 1706516 dilution 1 3 000 anti rabbit bio rad 1721019 1 5 000 secondary antibody revealed blot enhanced chemiluminescence supersignal west femto thermo scientific production purification recombinant protein gst cdk10 kd strepii cycm generated recombinant bacmids dh10bac escherichia coli baculoviruses sf9 cell using bac bac system described provider invitrogen infected sf9 cell gst cdk10 gst cdk10kd producing virus coinfected cell strepii cycmproducing virus collected cell 72 h infection purify gst fusion protein spun 250 ml cell resuspended pellet 40 ml lysis buffer pb 250 mm nacl 0 5 nonidet p 40 50 mm naf 10 mm glycerophosphate 0 3 mm na vanadate containing protease inhibitor mixture roche lysed cell sonication spun lysate 30 min 15 000 g collected soluble fraction added 1 ml glutathione sepharose matrix incubated 1 h 4 c washed four time lysis buffer one time kinase buffer see finally resuspended bead 100 l kinase buffer containing 10 vol vol glycerol storage 6his ets2 transformed origami2 de3 novagen 6his ets2 expression vector induced expression 0 2 mm isopropyl 1 thiogalactopyranoside 3 h 22 c purify 6his ets2 spun 50 ml cell resuspended pellet 2 ml lysis buffer pb 300 mm nacl 10 mm imidazole 1 mm dtt 0 1 nonidet p 40 containing protease inhibitor mixture without edta roche lysed cell 1 6 bar using cell disruptor spun lysate 10 min 20 000 g collected soluble fraction added 200 l cobalt bead thermo scientific 1 h incubation 4 c rotating wheel washed four time lysis buffer elute incubated bead 30 min elution buffer pb 250 mm imidazole ph 7 6 containing protease inhibitor mixture spun 30 10 000 g collected eluted protein protein kinase assay mixed glutathione sepharose bead harboring gst cdk10 wt kd either monomeric complexed strepii cycm 22 7 bsa 15 mm dtt 100 atp 5 ci atp 32p 7 75 histone h1 1 6his ets2 added kinase buffer 25 mm trishcl 10 mm mgcl2 1 mm egta 1 mm dtt 3 7 heparin ph 7 5 total volume 30 l incubated reaction 30 min 30 c added laemli sample buffer heat denatured sample ran 10 bi tris sd page cut gel slice detect gst cdk10 strepii cycm western blotting stained gel slice containing substrate coomassie r 250 bio rad dried detected incorporated radioactivity autoradiography identified four unrelated girl anogenital renal malformation dysmorphic facial feature normal intellect syndactyly toe similar combination feature reported previously motherdaughter pair1 table 1 supplementary note online author noted clinical overlap townes brocks syndrome suggested phenotype represented separate autosomal dominant entity mim601446 define cardinal feature syndrome characteristic facial appearance apparent telecanthus broad tripartite nasal tip variable syndactyly toe 25 hypoplastic labium anal atresia urogenital malformation fig 1ah also observed variety feature table 1 figure 1 clinical molecular characterization star syndrome figure 1 clinical molecular characterization star syndrome af facial appearance case 13 apparent telecanthus dysplastic ear thin upper lip c e toe syndactyly 25 35 45 b f case illustrate recognizable feature star syndrome specific parental consent obtained publication photograph anal atresia hypoplastic labium shown g h x ray film foot case 2 showing four ray left delta shaped 4th 5th metatarsal right h compare clinical picture array cgh data log2 ratio represents copy number loss six probe spanning 37 9 50 7 kb one probe positioned within fam58a deletion remove part functional gene j schematic structure fam58a position mutation fam58a five coding exon box cyclin domain green encoded exon 24 horizontal arrow indicates deletion extending 5 case 1 includes exon 1 2 whereas horizontal line exon 5 indicates deletion found case 3 remove exon 5 3 sequence pink horizontal bar box indicate amplicons used qpcr sequencing one alternative exon 5 amplicon indicated space constraint mutation 201dupt case 4 result immediate stop codon 555 1g 555 1g splice mutation case 2 5 6 predicted deleterious alter conserved splice donor acceptor site intron 4 respectively full size image 97 kb table 1 clinical feature star syndrome case table 1 clinical feature star syndrome case full table basis phenotypic overlap townes brocks okihiro feingold syndrome analyzed sall1 ref 2 sall4 ref 3 mycn4 found mutation gene supplementary method online next carried genome wide high resolution oligonucleotide array comparative genomic hybridization cgh 5 analysis supplementary method genomic dna severely affected individual case 1 lower lid coloboma epilepsy syringomyelia identified heterozygous deletion 37 950 7 kb xq28 removed exon 1 2 fam58a fig 1i j using real time pcr confirmed deletion child excluded unaffected parent supplementary fig 1a online supplementary method supplementary table 1 online cgh customized oligonucleotide array enriched probe xq28 followed breakpoint cloning defined exact deletion size 40 068 bp g 152 514 164 152 554 231del chromosome x ncbi build 36 2 fig 1j supplementary fig 2 3 online deletion remove coding region exon 1 2 well intron 1 2 774 bp 492 bp intron 2 36 608 bp 5 sequence including 5 utr entire krt18p48 pseudogene ncbi gene id 340598 paternity proven using routine method find deletion overlapping fam58a available copy number variation cnv database subsequently carried qpcr analysis three affected individual case 2 3 4 mother daughter pair literature case 5 6 case 3 detected de novo heterozygous deletion 1 110 3 kb overlapping exon 5 supplementary fig 1b online using xq28 targeted array cgh breakpoint cloning identified deletion 4 249 bp g 152 504 123 152 508 371del chromosome x ncbi build 36 2 fig 1j supplementary fig 2 3 removed 1 265 bp intron 4 exon 5 including 3 utr 2 454 bp 3 sequence found heterozygous fam58a point mutation remaining case fig 1j supplementary fig 2 supplementary method supplementary table 1 case 2 identified mutation 555 1g affecting splice donor site intron 4 case 4 identified frameshift mutation 201dupt immediately result premature stop codon n68xfsx1 case 5 6 detected mutation 556 1g alters splice acceptor site intron 4 validated point mutation deletion independent round pcr sequencing qpcr confirmed paternity de novo status point mutation deletion sporadic case none mutation seen dna 60 unaffected female control larger deletion involving fam58a found 93 unrelated array cgh investigation analyzing x chromosome inactivation supplementary method supplementary fig 4 online found complete skewing x inactivation case 1 36 almost complete skewing case 2 suggesting cell carrying mutation active x chromosome growth disadvantage fetal development using rt pcr rna lymphoblastoid cell case 2 supplementary fig 2 find aberrant splice product additional evidence mutated allele inactivated furthermore fam58a subjected x inactivation6 case 1 3 parental origin deletion could determined result lack informative snp case 5 mother case 6 gave birth two boy clinically unaffected sample available cannot exclude condition lethal male fetal loss reported family function fam58a unknown gene consists five coding exon 642 bp coding region encodes protein 214 amino acid genbank list mrna length 1 257 bp reference sequence nm 152274 2 expression gene est data found 27 48 adult tissue including kidney colon cervix uterus heart ncbi expression viewer unigene h 496943 expression also noted 24 26 listed tumor tissue well embryo fetus gene homologous fam58a ncbi homologene 13362 found x chromosome chimpanzee dog zebrafish similar gene chromosome 23 however mouse rat true homologs specie similar intronless gene chromosome 11 mouse 10 rat likely arising retrotransposon insertion event murine x chromosome flanking gene atp2b3 dusp9 conserved remnant fam58a sequence detected fam58a contains cyclin box fold domain protein binding domain found cyclins role cell cycle transcription control human phenotype resulting cyclin gene mutation yet reported homozygous knockout mouse ccnd1 encoding cyclin d1 viable small reduced lifespan also dystrophic change retina likely result decreased cell proliferation degeneration photoreceptor cell embryogenesis7 8 cyclin d1 colocalizes sall4 nucleus protein cooperatively mediate transcriptional repression9 phenotype case overlap considerably townes brocks syndrome caused sall1 mutations1 carried co immunoprecipitation find sall1 sall4 would interact fam58a manner similar observed sall4 cyclin d1 found fam58a interacts sall1 sall4 supplementary fig 5 online supporting hypothesis fam58a sall1 participate developmental pathway fam58a mutation lead star syndrome growth retardation case table 1 retinal abnormality three case reminiscent reduced body size retinal anomaly cyclin d1 knockout mice7 8 therefore proliferation defect might partly responsible star syndrome address question carried knockdown fam58a mrna followed proliferation assay transfection hek293 cell three different fam58a specific rnai oligonucleotides resulted significant reduction fam58a mrna expression proliferation transfected cell supplementary method supplementary fig 6 online supporting link fam58a cell proliferation found loss function mutation fam58a result rather homogeneous clinical phenotype additional anomaly case 1 likely result effect 40 kb deletion expression neighboring gene possibly atp2b3 dusp9 however cannot exclude homogeneous phenotype result ascertainment bias fam58a mutation including missense change could result broader spectrum malformation gene causing overlapping phenotype star syndrome townes brocks syndrome seem act pathway note mycn gene mutated feingold syndrome direct regulator cyclin d2 ref 10 11 thus worth exploring whether phenotypic similarity feingold star syndrome might explained direct regulation fam58a mycn fam58a located approximately 0 56 mb centromeric mecp2 xq28 duplication overlapping mecp2 fam58a described associated clinical phenotype females12 deletion overlapping mecp2 fam58a observed date13 although gene fam58a mecp2 implicated brain development fam58a mecp2 gene region known result x linked dominant phenotype thus deletion gene allele might lethal male female'
# Contar palavras unicas
palavras_unicas = set()
train_data_processado['Text'].str.lower().str.split().apply(palavras_unicas.update)
print(len(palavras_unicas))
160553
Iremos calcular o TFIDF das palavras para entendermos a frequencia de ocorrencia por frase x documento. Porem iremos solicitar um df minimo para a palavra ser considerada, assim não iremos contar palavras com poucas ocorrencias, também iremos considerar que pode ser considerado: unigrama, bigrama e trigrama para uma melhor contextualização. Por ultimo iremos limitar a somente 1000 features em nosso TFIDF, assim buscando as palavras com maior impacto.
def gerar_TFIDF(path, data = [], max_features = 1000, ngram_range = (1, 1), min_df = 3):
if isfile(path):
print('Carregando matriz TFIDF...')
tfidf = np.load(path, allow_pickle = False)
else:
print('Gerando matriz TFIDF...')
TFIDF = TfidfVectorizer(min_df = min_df, ngram_range = ngram_range, max_features = max_features)
tfidf = TFIDF.fit_transform(data).toarray()
np.save(path, tfidf, allow_pickle = False)
print('Matriz TFIDF carregada')
return tfidf
tfidf_train = gerar_TFIDF('data/tfidf_treino.npy', train_data_processado['Text'].values, 100000, (1, 3), 3)
Carregando matriz TFIDF... Matriz TFIDF carregada
tfidf_train_dt = pd.DataFrame(tfidf_train, index = train_data_processado.index)
tfidf_train_dt.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 99990 | 99991 | 99992 | 99993 | 99994 | 99995 | 99996 | 99997 | 99998 | 99999 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 |
1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 |
2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 |
3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 |
4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.011535 | 0.0 | 0.0 | 0.0 |
5 rows × 100000 columns
Iremos utilizar o algoritmo de TruncatedSVD para reduzir os nossos dados a componentes, a sua vantagem em relação ao PCA é que o TruncatedSVD trabalha melhor com dados esparsos. A sua diferença é que não é realizado uma centralização dos dados. Assim irá ter melhores resultados com a nossa matriz de TFIDF.
def gerar_componentes_TruncatedSVD(path, n_components, data, n_iter = 50, seed = 120):
if isfile(path):
print('Carregando componentes...')
svd_componentes = np.load(path, allow_pickle = False)
else:
print('Gerando componentes...')
svd = TruncatedSVD(n_components = n_components, n_iter = n_iter, random_state = seed)
svd_componentes = svd.fit_transform(data)
np.save(path, svd_componentes, allow_pickle = False)
print('Componentes carregados')
return svd_componentes
n_components = 1000
truncated_train = gerar_componentes_TruncatedSVD('data/matriz_esparsa_treino.npy', n_components,\
tfidf_train_dt, 50, seed_)
Carregando componentes... Componentes carregados
truncated_train_dt = pd.DataFrame(truncated_train)
truncated_train_dt.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 990 | 991 | 992 | 993 | 994 | 995 | 996 | 997 | 998 | 999 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.209456 | -0.082272 | -0.019928 | -0.072945 | 0.035063 | 0.015121 | -0.017422 | -0.006298 | -0.007785 | -0.018207 | ... | -0.000667 | -0.013015 | -0.007143 | -0.004988 | 0.002724 | 0.010163 | -0.007411 | -0.020190 | 0.001096 | 0.009950 |
1 | 0.230858 | -0.117900 | -0.055719 | 0.066605 | -0.050562 | 0.010161 | -0.069204 | 0.020518 | -0.002667 | 0.000851 | ... | 0.016981 | 0.009618 | 0.023742 | 0.015233 | -0.023636 | -0.002040 | 0.001046 | 0.020775 | 0.006279 | 0.020099 |
2 | 0.230858 | -0.117900 | -0.055719 | 0.066605 | -0.050562 | 0.010161 | -0.069204 | 0.020518 | -0.002667 | 0.000851 | ... | 0.016981 | 0.009618 | 0.023742 | 0.015233 | -0.023636 | -0.002040 | 0.001046 | 0.020775 | 0.006279 | 0.020099 |
3 | 0.201979 | -0.079566 | -0.032508 | 0.007221 | 0.001920 | 0.006085 | 0.026844 | 0.011929 | -0.014232 | 0.004337 | ... | -0.000602 | 0.000908 | 0.003219 | 0.003609 | -0.002592 | -0.001493 | -0.004328 | -0.000660 | -0.000304 | 0.003618 |
4 | 0.215967 | -0.065170 | -0.015736 | 0.002577 | -0.034887 | 0.008384 | -0.038474 | 0.023668 | -0.005299 | -0.008350 | ... | -0.002130 | 0.000202 | 0.000001 | 0.000555 | -0.000001 | -0.001294 | -0.000886 | -0.002356 | -0.000172 | -0.001923 |
5 rows × 1000 columns
# Calcular taxa de variancia por componente
# Executar só se o grafico da variancia não estiver exibindo no output
svd = TruncatedSVD(n_components = 1000, n_iter = 50, random_state = seed_)
svd_componentes = svd.fit_transform(tfidf_train_dt)
variancia = svd.explained_variance_ratio_
variancia_acumulada = np.cumsum(variancia * 100)
Para o nosso caso 1000 componentes será o suficiente para o nosso modelo cobrir os dados, então iremos de 100.000 features para 1000 componentes, assim reduzindo a nossa dimensionalidade.
plt.ylabel('Variancia')
plt.xlabel('Componentes')
plt.title('Analise de PCA')
plt.ylim(10, 100)
plt.xlim(0, n_components)
plt.plot(variancia_acumulada)
[<matplotlib.lines.Line2D at 0x2791f451250>]
# Renomeando colunas
truncated_train_dt.columns = [('Componente ', i) for i in range(1, n_components + 1)]
Iremos utilizar one hot encoding para converter as nossas classes categoricas como 'Gene' e 'Variation' em numericas. Porém essas classes não apresentam hierarquia, assim iremos optar por One Hot Encoding e não Labelencode.
# Carregando somente as colunas de 'Gene' e 'Variation'
train_data_one_hot = train_data_processado[['Gene', 'Variation']]
# Removendo as colunas de 'Gene' e 'Variation'
train_data_temp = train_data_processado.drop(['Gene', 'Variation'], axis = 1)
# Aplicando OneHotEncoder
onehot = OneHotEncoder(dtype = int)
train_data_one_hot = onehot.fit_transform(train_data_one_hot)
# Convertendo o resultado para dataframe
train_data_one_hot_dt = pd.DataFrame(train_data_one_hot.toarray())
# Realizando join nas colunas extras
train_data_one_hot_dt = train_data_temp.join(train_data_one_hot_dt)
train_data_one_hot_dt.head()
Unnamed: 0 | ID | Class | Text | 0 | 1 | 2 | 3 | 4 | 5 | ... | 3250 | 3251 | 3252 | 3253 | 3254 | 3255 | 3256 | 3257 | 3258 | 3259 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | cyclin dependent kinase cdks regulate variety ... | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 1 | 1 | 2 | abstract background non small cell lung cancer... | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 2 | 2 | 2 | abstract background non small cell lung cancer... | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 3 | 3 | 3 | recent evidence demonstrated acquired uniparen... | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 4 | 4 | 4 | oncogenic mutation monomeric casitas b lineage... | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 3264 columns
Agora iremos concatenar os componentes do TruncatedSVD, gerados apartir do TFIDF no datatable final.
# Removendo coluna Text pois o TFIDF irá sobrepor
train_data_final = train_data_one_hot_dt.drop('Text', axis = 1)
# Concatenando componentes com onehot
train_data_final = pd.concat([train_data_final, truncated_train_dt], axis = 1)
train_data_final.head()
Unnamed: 0 | ID | Class | 0 | 1 | 2 | 3 | 4 | 5 | 6 | ... | (Componente , 991) | (Componente , 992) | (Componente , 993) | (Componente , 994) | (Componente , 995) | (Componente , 996) | (Componente , 997) | (Componente , 998) | (Componente , 999) | (Componente , 1000) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | -0.000667 | -0.013015 | -0.007143 | -0.004988 | 0.002724 | 0.010163 | -0.007411 | -0.020190 | 0.001096 | 0.009950 |
1 | 1 | 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0.016981 | 0.009618 | 0.023742 | 0.015233 | -0.023636 | -0.002040 | 0.001046 | 0.020775 | 0.006279 | 0.020099 |
2 | 2 | 2 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0.016981 | 0.009618 | 0.023742 | 0.015233 | -0.023636 | -0.002040 | 0.001046 | 0.020775 | 0.006279 | 0.020099 |
3 | 3 | 3 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | -0.000602 | 0.000908 | 0.003219 | 0.003609 | -0.002592 | -0.001493 | -0.004328 | -0.000660 | -0.000304 | 0.003618 |
4 | 4 | 4 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | -0.002130 | 0.000202 | 0.000001 | 0.000555 | -0.000001 | -0.001294 | -0.000886 | -0.002356 | -0.000172 | -0.001923 |
5 rows × 4263 columns
train_data_final = train_data_final.drop('ID', axis = 1)
Antes de iniciarmos a fase de treino é interessante salvarmos em um aruivo '.csv' os dados, assim para execuções futuras não é necessário executar as etapas anteriores.
path = 'data/treino_final.csv'
if isfile(path):
print('Carregando dataset...')
train_data_final = pd.read_csv(path, sep = ',')
else:
print('Salvando dataset...')
train_data_final.to_csv(path, sep = ',')
Carregando dataset...
train_data_final = train_data_final.drop('Unnamed: 0', axis = 1)
train_data_final.head()
Class | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ... | ('Componente ', 991) | ('Componente ', 992) | ('Componente ', 993) | ('Componente ', 994) | ('Componente ', 995) | ('Componente ', 996) | ('Componente ', 997) | ('Componente ', 998) | ('Componente ', 999) | ('Componente ', 1000) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | -0.000667 | -0.013015 | -0.007143 | -0.004988 | 0.002724 | 0.010163 | -0.007411 | -0.020190 | 0.001096 | 0.009950 |
1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0.016981 | 0.009618 | 0.023742 | 0.015233 | -0.023636 | -0.002040 | 0.001046 | 0.020775 | 0.006279 | 0.020099 |
2 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0.016981 | 0.009618 | 0.023742 | 0.015233 | -0.023636 | -0.002040 | 0.001046 | 0.020775 | 0.006279 | 0.020099 |
3 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | -0.000602 | 0.000908 | 0.003219 | 0.003609 | -0.002592 | -0.001493 | -0.004328 | -0.000660 | -0.000304 | 0.003618 |
4 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | -0.002130 | 0.000202 | 0.000001 | 0.000555 | -0.000001 | -0.001294 | -0.000886 | -0.002356 | -0.000172 | -0.001923 |
5 rows × 4261 columns
Iremos separar os nossos dados em treino, teste e validação. Onde os dados de validação serão retirados dos dados de treino e utilizados para otimização durante o treinamento. Já os dados de teste serão separados de forma que não interfiram no resultado final.
X = train_data_final.drop('Class', axis = 1)
y = train_data_final['Class'].values
# Separando o conjunto principal em treino e teste
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, test_size = 0.3, random_state = seed_)
# Separando o conjunto de treino em treino e validação
X_train_, X_validacao, y_train_, y_validacao = train_test_split(X_train, y_train, stratify = y_train, test_size = 0.2)
# Balanceando o conjunto de treino original
oversample = SMOTE(random_state = seed_)
X_train_resample, y_train_resample = oversample.fit_resample(X_train, y_train)
# Balanceando o conjunto de treino que foi separado em validacao
oversample = SMOTE(random_state = seed_)
X_train_resample_, y_train_resample_ = oversample.fit_resample(X_train_, y_train_)
print('Observaçoes em treino:', X_train_.shape[0])
print('Observaçoes em treino balanceado:', X_train_resample.shape[0])
print('Observaçoes em teste:', X_test.shape[0])
print('Observaçoes em validação:', X_validacao.shape[0])
Observaçoes em treino: 1859 Observaçoes em treino balanceado: 6003 Observaçoes em teste: 997 Observaçoes em validação: 465
# Convertendo dataframe para matriz esparsa
X_test_original = X_test.copy()
X_train = sparse.csr_matrix(X_train.values)
X_train_ = sparse.csr_matrix(X_train_.values)
X_train_resample = sparse.csr_matrix(X_train_resample.values)
X_train_resample_ = sparse.csr_matrix(X_train_resample_.values)
X_test = sparse.csr_matrix(X_test.values)
X_validacao = sparse.csr_matrix(X_validacao.values)
def distribuicao(data, colors = ['r', 'g', 'b', 'y', 'k'], verbose = False):
data2 = data.value_counts().sort_index()
data2.plot(kind = 'bar', color = colors, stacked = True)
plt.xlabel('Class')
plt.ylabel('Ocorrencias')
plt.show()
if verbose:
sorted_class = np.argsort(-data2.values)
for i in sorted_class:
print('Observações na classe', i + 1, ':',\
data2.values[i], '(',\
np.round((data2.values[i]/data.shape[0]*100), 3), '%)')
distribuicao(pd.DataFrame(y_test), verbose = True)
Observações na classe 7 : 286 ( 28.686 %) Observações na classe 4 : 206 ( 20.662 %) Observações na classe 1 : 170 ( 17.051 %) Observações na classe 2 : 136 ( 13.641 %) Observações na classe 6 : 82 ( 8.225 %) Observações na classe 5 : 73 ( 7.322 %) Observações na classe 3 : 27 ( 2.708 %) Observações na classe 9 : 11 ( 1.103 %) Observações na classe 8 : 6 ( 0.602 %)
distribuicao(pd.DataFrame(y_train), verbose = True)
Observações na classe 7 : 667 ( 28.701 %) Observações na classe 4 : 480 ( 20.654 %) Observações na classe 1 : 398 ( 17.126 %) Observações na classe 2 : 316 ( 13.597 %) Observações na classe 6 : 193 ( 8.305 %) Observações na classe 5 : 169 ( 7.272 %) Observações na classe 3 : 62 ( 2.668 %) Observações na classe 9 : 26 ( 1.119 %) Observações na classe 8 : 13 ( 0.559 %)
distribuicao(pd.DataFrame(y_train_resample), verbose = True)
Observações na classe 1 : 667 ( 11.111 %) Observações na classe 2 : 667 ( 11.111 %) Observações na classe 3 : 667 ( 11.111 %) Observações na classe 4 : 667 ( 11.111 %) Observações na classe 5 : 667 ( 11.111 %) Observações na classe 6 : 667 ( 11.111 %) Observações na classe 7 : 667 ( 11.111 %) Observações na classe 8 : 667 ( 11.111 %) Observações na classe 9 : 667 ( 11.111 %)
distribuicao(pd.DataFrame(y_validacao), verbose = True)
Observações na classe 7 : 133 ( 28.602 %) Observações na classe 4 : 96 ( 20.645 %) Observações na classe 1 : 80 ( 17.204 %) Observações na classe 2 : 63 ( 13.548 %) Observações na classe 6 : 39 ( 8.387 %) Observações na classe 5 : 34 ( 7.312 %) Observações na classe 3 : 12 ( 2.581 %) Observações na classe 9 : 5 ( 1.075 %) Observações na classe 8 : 3 ( 0.645 %)
Primeiramente iremos criar os modelos bases para entender qual comportamento é melhor para os nossos modelos preditivos. Os modelos bases irão ser criados com as seguintes configurações:
- Balanceamento em treino = True or False
- Calibracao = True or False
- Conjunto de calibracao: Treino ou validacao
Para as configuracoes acima, iremos utilizar os seguintes algoritmos: Regressão Logistica, Linear SVM, Random Forest, XGBoost e KNN.
configuracoes = []
def executaModelo(modelo, treino, teste, validacao, calibration = False):
try:
# Treina o modelo
modelo.fit(treino[0], treino[1])
if calibration:
# Instancia a calibração
calibration = CalibratedClassifierCV(base_estimator = modelo, method = 'sigmoid', cv = 3)
# Aplica a calibração
calibration.fit(validacao[0], validacao[1])
# Realiza as previsões
pred = calibration.predict_proba(teste[0])
else:
# Realiza as previsões de acordo com o tipo do modelo probabilistico ou não
try:
pred = modelo.predict_proba(teste[0])
except:
pred = modelo.predict(teste[0])
# Calcula a loss
loss = log_loss(teste[1], pred)
return loss
except:
print('Treino ignorado')
return -1
def executaModelos(modelo, data_treino_balanceado, data_treino_desbalanceado, data_validacao,\
data_treino_validacao_balanceado, data_treino_validacao_desbalanceado, data_teste, algoritmo):
global configuracoes
# Balanceamento em treino, calibracao, conjunto de calibracao em treino
print('Iniciando treino 1...')
loss = executaModelo(modelo, data_treino_balanceado, data_teste, data_treino_balanceado, True)
if loss != -1:
configuracoes.append({'Algoritmo': algoritmo, 'Balanceamento': True,\
'Calibracao': True, 'Conjunto Calibracao': 'Treino', 'Loss': loss})
# Balanceamento em treino, calibracao, conjunto de calibracao em validacao
print('Iniciando treino 2...')
loss = executaModelo(modelo, data_treino_validacao_balanceado, data_teste, data_validacao, True)
if loss != -1:
configuracoes.append({'Algoritmo': algoritmo, 'Balanceamento': True,\
'Calibracao': True, 'Conjunto Calibracao': 'Validacao', 'Loss': loss})
# Balanceamento em treino, sem calibracao
print('Iniciando treino 3...')
loss = executaModelo(modelo, data_treino_balanceado, data_teste, [], False)
if loss != -1:
configuracoes.append({'Algoritmo': algoritmo, 'Balanceamento': True,\
'Calibracao': False, 'Conjunto Calibracao': None, 'Loss': loss})
# Desbalanceamento em treino, calibracao, conjunto de calibracao em treino
print('Iniciando treino 4...')
loss = executaModelo(modelo, data_treino_desbalanceado, data_teste, data_treino_desbalanceado, True)
if loss != -1:
configuracoes.append({'Algoritmo': algoritmo, 'Balanceamento': False,\
'Calibracao': True, 'Conjunto Calibracao': 'Treino', 'Loss': loss})
# Desbalanceamento em treino, calibracao, conjunto de calibracao em validacao
print('Iniciando treino 5...')
loss = executaModelo(modelo, data_treino_validacao_desbalanceado, data_teste, data_validacao, True)
if loss != -1:
configuracoes.append({'Algoritmo': algoritmo, 'Balanceamento': False,\
'Calibracao': True, 'Conjunto Calibracao': 'Validacao', 'Loss': loss})
# Desbalanceamento em treino, sem calibracao
print('Iniciando treino 6...')
loss = executaModelo(modelo, data_treino_desbalanceado, data_teste, [], False)
if loss != -1:
configuracoes.append({'Algoritmo': algoritmo, 'Balanceamento': False,\
'Calibracao': False, 'Conjunto Calibracao': None, 'Loss': loss})
# Executando todas as configurações citadas para o algoritmo de Regressão Logistica
executaModelos(SGDClassifier(loss = 'log', class_weight = 'balanced', random_state = seed_),\
[X_train_resample, y_train_resample], [X_train, y_train], [X_validacao, y_validacao],\
[X_train_resample_, y_train_resample_], [X_train_, y_train_], [X_test, y_test], 'Regressão Logistica')
Iniciando treino 1... Iniciando treino 2... Iniciando treino 3... Iniciando treino 4... Iniciando treino 5... Iniciando treino 6...
# Executando todas as configurações citadas para o algoritmo de Linear SVM
executaModelos(SGDClassifier(loss = 'hinge', class_weight = 'balanced', random_state = seed_),\
[X_train_resample, y_train_resample], [X_train, y_train], [X_validacao, y_validacao],\
[X_train_resample_, y_train_resample_], [X_train_, y_train_], [X_test, y_test], ' Linear SVM')
Iniciando treino 1... Iniciando treino 2... Iniciando treino 3... Treino ignorado Iniciando treino 4... Iniciando treino 5... Iniciando treino 6... Treino ignorado
# Executando todas as configurações citadas para o algoritmo de KNN
executaModelos(KNeighborsClassifier(),\
[X_train_resample, y_train_resample], [X_train, y_train], [X_validacao, y_validacao],\
[X_train_resample_, y_train_resample_], [X_train_, y_train_], [X_test, y_test], 'KNN')
Iniciando treino 1... Iniciando treino 2... Iniciando treino 3... Iniciando treino 4... Iniciando treino 5... Iniciando treino 6...
# Executando todas as configurações citadas para o algoritmo de Random Forest
executaModelos(RandomForestClassifier(random_state = seed_),\
[X_train_resample, y_train_resample], [X_train, y_train], [X_validacao, y_validacao],\
[X_train_resample_, y_train_resample_], [X_train_, y_train_], [X_test, y_test], 'Random Forest')
Iniciando treino 1... Iniciando treino 2... Iniciando treino 3... Iniciando treino 4... Iniciando treino 5... Iniciando treino 6...
# Ordena os modelos de acordo com a menor Loss
sorted_configuracoes = sorted(configuracoes, key = lambda k: k['Loss'])
Abaixo visualizamos que os 2 melhores modelos são do mesmo algoritmo, Random Forest. A unica diferença entre eles é em relação ao balanceamento um possuo e o outro não.
Configuração do melhor modelo:
sorted_configuracoes_dt = pd.DataFrame(sorted_configuracoes)
sorted_configuracoes_dt.head()
Algoritmo | Balanceamento | Calibracao | Conjunto Calibracao | Loss | |
---|---|---|---|---|---|
0 | Regressão Logistica | False | False | None | 0.982431 |
1 | Regressão Logistica | False | True | Treino | 0.987791 |
2 | Linear SVM | False | True | Treino | 1.044961 |
3 | Regressão Logistica | True | False | None | 1.051835 |
4 | Regressão Logistica | True | True | Treino | 1.070837 |
def save_model(modelo):
shortFileName = '000'
fileName = 'models/0001.model'
fileObj = Path(fileName)
index = 1
while fileObj.exists():
index += 1
fileName = 'models/' + shortFileName + str(index) + '.model'
fileObj = Path(fileName)
# Salvar modelo
pickle.dump(modelo, open(fileName, 'wb'))
return fileName
def plot_general_report(modelo, y_true, y_pred, save = False):
# Calculando Confusion Matrix
cm = confusion_matrix(y_true, y_pred)
# Calculando Precision Matrix
precision = (cm/cm.sum(axis=0))
# Calculando Recall Matrix
recall = (((cm.T)/(cm.sum(axis=1))).T)
labels = range(1, 10)
# Plot da Confusion Matrix
print("-"*20, "Confusion matrix", "-"*20)
plt.figure(figsize = (20,7))
sns.heatmap(cm, annot = True, cmap = "YlGnBu", fmt = ".3f", xticklabels = labels, yticklabels = labels)
plt.xlabel('Classe Prevista')
plt.ylabel('Classe Real')
plt.show()
# Plot da Precision Matrix
print("-"*20, "Precision matrix", "-"*20)
plt.figure(figsize = (20,7))
sns.heatmap(precision, annot = True, cmap = "YlGnBu", fmt = ".3f", xticklabels = labels, yticklabels = labels)
plt.xlabel('Classe Prevista')
plt.ylabel('Classe Real')
plt.show()
# Plot da Recall Matrix
print("-"*20, "Recall matrix", "-"*20)
plt.figure(figsize = (20,7))
sns.heatmap(recall, annot = True, cmap = "YlGnBu", fmt = ".3f", xticklabels = labels, yticklabels = labels)
plt.xlabel('Classe Prevista')
plt.ylabel('Classe Real')
plt.show()
# Relatorio Macro/Micro
recall = round( recall_score(y_true, y_pred, average = 'macro', zero_division = 0), 4)
precision = round( precision_score(y_true, y_pred, average = 'macro', zero_division = 0), 4)
f1_score_ = round( f1_score(y_true, y_pred, average = 'macro', zero_division = 0), 4)
print('Macro Precision:', precision)
print('Macro Recall:', recall)
print('F1-Score:', f1_score_)
# Salvando modelo sem sobreescrever arquivos existentes
if save:
fileName = save_model(modelo)
return fileName
# Modelo
modelo = RandomForestClassifier(random_state = seed_)
modelo.fit(X_train, y_train)
# Calibração
calibration = CalibratedClassifierCV(base_estimator = modelo, method = 'sigmoid', cv = 3)
calibration.fit(X_train, y_train)
# Realiza as previsões
pred = calibration.predict(X_test)
pred_prob = calibration.predict_proba(X_test)
# Calcula a loss
loss = log_loss(y_test, pred_prob)
print('Loss:', round(loss, 4))
Loss: 1.2063
plot_general_report(calibration, y_test, pred, save = True)
-------------------- Confusion matrix --------------------
invalid value encountered in true_divide
-------------------- Precision matrix --------------------
-------------------- Recall matrix --------------------
Macro Precision: 0.6023 Macro Recall: 0.4003 F1-Score: 0.4376
'models/0001.model'
Com a escolha do melhor algoritmo como sendo Random Forest, iremos realizar o seu Tuning. Tentar encontrar os melhores hiperparametros.
def treina_GridSearchCV(modelo, params_, x_treino, y_treino, x_teste, y_teste,\
n_jobs = 20, cv = 5, refit = True, scoring = None, salvar_resultados = False,\
report_treino = False):
grid = GridSearchCV(modelo, params_, n_jobs = n_jobs, cv = cv, refit = refit, scoring = scoring)
print('Iniciando Treino...')
grid.fit(x_treino, y_treino)
print('Treino finalizado')
print('Realizando predições')
pred = grid.predict(x_teste)
modelo_ = grid.best_estimator_
print('Finalizando predições')
print(grid.best_params_)
target_names = range(1, 10)
print('-'*20, 'Report Para Dados de Teste', '-'*20)
plot_general_report(modelo, y_test, pred, save = True)
if report_treino:
print('-'*20, 'Report Para Dados de Treino', '-'*20)
pred_treino = grid.predict(x_treino)
plot_general_report(modelo, y_treino, pred_treino, save = False)
if salvar_resultados:
resultados_df = pd.DataFrame(grid.cv_results_)
return resultados_df
Verificando as métricas do modelo inicial tanto em treino e teste, é perceptivel que o nosso modelo sofre de overfitting, esse pode ser um dos problemas para uma baixa precisão em dados de teste. Assim iremos ter que podar a nossa arvore para atingir melhores resultados.
%%time
# Comparativo modelo base v1
params = {
'random_state': [seed_]
}
resultados = treina_GridSearchCV(RandomForestClassifier(), params, X_train, y_train, X_test, y_test, cv = 3,\
report_treino = True, salvar_resultados = True)
Iniciando Treino... Treino finalizado Realizando predições Finalizando predições {'random_state': 194} -------------------- Report Para Dados de Teste -------------------- -------------------- Confusion matrix --------------------
invalid value encountered in true_divide
-------------------- Precision matrix --------------------
-------------------- Recall matrix --------------------
Macro Precision: 0.5745 Macro Recall: 0.4131 F1-Score: 0.4445 -------------------- Report Para Dados de Treino -------------------- -------------------- Confusion matrix --------------------
-------------------- Precision matrix --------------------
-------------------- Recall matrix --------------------
Macro Precision: 1.0 Macro Recall: 1.0 F1-Score: 1.0 Wall time: 18.7 s
# Modelo v2
params = {
'n_estimators': [250, 500, 1000],
'criterion': ['gini', 'entropy']
}
modelo = RandomForestClassifier(random_state = seed_, class_weight = 'balanced', n_jobs = 4)
resultados = treina_GridSearchCV(modelo, params, X_train, y_train, X_test, y_test, cv = 3,\
report_treino = True, salvar_resultados = True)
Iniciando Treino... Treino finalizado Realizando predições Finalizando predições {'criterion': 'gini', 'n_estimators': 250} -------------------- Report Para Dados de Teste -------------------- -------------------- Confusion matrix --------------------
invalid value encountered in true_divide
-------------------- Precision matrix --------------------
-------------------- Recall matrix --------------------
Macro Precision: 0.5713 Macro Recall: 0.4558 F1-Score: 0.4867 -------------------- Report Para Dados de Treino -------------------- -------------------- Confusion matrix --------------------
-------------------- Precision matrix --------------------
-------------------- Recall matrix --------------------
Macro Precision: 1.0 Macro Recall: 1.0 F1-Score: 1.0
# Modelo v3
params = {
'n_estimators': [500, 1000],
'criterion': ['gini'],
'max_depth': [2, 4, 6],
'min_samples_leaf': [1, 2, 3],
'min_samples_split': [2, 3]
}
modelo = RandomForestClassifier(random_state = seed_, class_weight = 'balanced', n_jobs = 4)
resultados = treina_GridSearchCV(modelo, params, X_train, y_train, X_test, y_test, cv = 3,\
report_treino = True, salvar_resultados = True)
Iniciando Treino... Treino finalizado Realizando predições Finalizando predições {'criterion': 'gini', 'max_depth': 6, 'min_samples_leaf': 3, 'min_samples_split': 2, 'n_estimators': 1000} -------------------- Report Para Dados de Teste -------------------- -------------------- Confusion matrix --------------------
invalid value encountered in true_divide
-------------------- Precision matrix --------------------
-------------------- Recall matrix --------------------
Macro Precision: 0.4953 Macro Recall: 0.5382 F1-Score: 0.5017 -------------------- Report Para Dados de Treino -------------------- -------------------- Confusion matrix --------------------
-------------------- Precision matrix --------------------
-------------------- Recall matrix --------------------
Macro Precision: 0.7942 Macro Recall: 0.8827 F1-Score: 0.8194
# Modelo v4
params = {
'n_estimators': [500, 1000],
'criterion': ['gini'],
'max_depth': [4, 6, 8, 10],
'min_samples_leaf': [1, 2, 3],
'min_samples_split': [2, 4, 6]
}
modelo = RandomForestClassifier(random_state = seed_, class_weight = 'balanced', n_jobs = 4)
resultados = treina_GridSearchCV(modelo, params, X_train, y_train, X_test, y_test, cv = 3,\
report_treino = True, salvar_resultados = True)
Iniciando Treino... Treino finalizado Realizando predições Finalizando predições {'criterion': 'gini', 'max_depth': 8, 'min_samples_leaf': 3, 'min_samples_split': 2, 'n_estimators': 1000} -------------------- Report Para Dados de Teste -------------------- -------------------- Confusion matrix --------------------
invalid value encountered in true_divide
-------------------- Precision matrix --------------------
-------------------- Recall matrix --------------------
Macro Precision: 0.5021 Macro Recall: 0.5099 F1-Score: 0.4957 -------------------- Report Para Dados de Treino -------------------- -------------------- Confusion matrix --------------------
-------------------- Precision matrix --------------------
-------------------- Recall matrix --------------------
Macro Precision: 0.8379 Macro Recall: 0.9153 F1-Score: 0.8656
# Modelo v5
params = {
'n_estimators': [500, 1000],
'criterion': ['gini'],
'max_depth': [6, 8, 10],
'min_samples_leaf': [1, 2, 3],
'min_samples_split': [4, 6, 8],
'max_features': ['sqrt', 'log2']
}
modelo = RandomForestClassifier(random_state = seed_, class_weight = 'balanced', n_jobs = 4)
resultados = treina_GridSearchCV(modelo, params, X_train, y_train, X_test, y_test, cv = 3,\
report_treino = True, salvar_resultados = True)
Iniciando Treino... Treino finalizado Realizando predições Finalizando predições {'criterion': 'gini', 'max_depth': 8, 'max_features': 'sqrt', 'min_samples_leaf': 3, 'min_samples_split': 4, 'n_estimators': 1000} -------------------- Report Para Dados de Teste -------------------- -------------------- Confusion matrix --------------------
invalid value encountered in true_divide
-------------------- Precision matrix --------------------
-------------------- Recall matrix --------------------
Macro Precision: 0.5021 Macro Recall: 0.5099 F1-Score: 0.4957 -------------------- Report Para Dados de Treino -------------------- -------------------- Confusion matrix --------------------
-------------------- Precision matrix --------------------
-------------------- Recall matrix --------------------
Macro Precision: 0.8379 Macro Recall: 0.9153 F1-Score: 0.8656
# Modelo v6
params = {
'n_estimators': [500, 1000],
'criterion': ['gini'],
'max_depth': [6, 8, 10],
'min_samples_leaf': [1, 2, 3],
'min_samples_split': [4, 6, 8],
'max_features': ['sqrt', 'log2'],
'max_leaf_nodes': [2, 4, 6, 8]
}
modelo = RandomForestClassifier(random_state = seed_, class_weight = 'balanced', n_jobs = 4)
resultados = treina_GridSearchCV(modelo, params, X_train, y_train, X_test, y_test, cv = 3,\
report_treino = True, salvar_resultados = True)
Iniciando Treino... Treino finalizado Realizando predições Finalizando predições {'criterion': 'gini', 'max_depth': 8, 'max_features': 'sqrt', 'max_leaf_nodes': 8, 'min_samples_leaf': 2, 'min_samples_split': 6, 'n_estimators': 500} -------------------- Report Para Dados de Teste -------------------- -------------------- Confusion matrix --------------------
invalid value encountered in true_divide
-------------------- Precision matrix --------------------
-------------------- Recall matrix --------------------
Macro Precision: 0.3942 Macro Recall: 0.4474 F1-Score: 0.3521 -------------------- Report Para Dados de Treino -------------------- -------------------- Confusion matrix --------------------
-------------------- Precision matrix --------------------
-------------------- Recall matrix --------------------
Macro Precision: 0.5727 Macro Recall: 0.6452 F1-Score: 0.5294
# Modelo v7
modelo = RandomForestClassifier(random_state = seed_, class_weight = 'balanced', n_jobs = 4,\
criterion = 'gini', max_depth = 8, max_features = 'sqrt',\
min_samples_leaf = 4, min_samples_split = 6, n_estimators = 1000)
modelo.fit(X_train, y_train)
pred = modelo.predict(X_test)
print('-'*20, 'Report Para Dados de Teste', '-'*20)
plot_general_report(modelo, y_test, pred, save = True)
print('-'*20, 'Report Para Dados de Treino', '-'*20)
pred_treino = modelo.predict(X_train)
plot_general_report(modelo, y_train, pred_treino, save = False)
-------------------- Report Para Dados de Teste -------------------- -------------------- Confusion matrix --------------------
invalid value encountered in true_divide
-------------------- Precision matrix --------------------
-------------------- Recall matrix --------------------
Macro Precision: 0.5074 Macro Recall: 0.5305 F1-Score: 0.5095 -------------------- Report Para Dados de Treino -------------------- -------------------- Confusion matrix --------------------
-------------------- Precision matrix --------------------
-------------------- Recall matrix --------------------
Macro Precision: 0.828 Macro Recall: 0.9119 F1-Score: 0.8585
# Modelo v8 - SMOTE em Treino
modelo = RandomForestClassifier(random_state = seed_, class_weight = 'balanced', n_jobs = 4, criterion = 'gini',\
max_depth = 8, max_features = 'auto',\
min_samples_leaf = 4, min_samples_split = 6, n_estimators = 1000)
modelo.fit(X_train_resample, y_train_resample)
pred = modelo.predict(X_test)
print('-'*20, 'Report Para Dados de Teste', '-'*20)
plot_general_report(modelo, y_test, pred, save = True)
print('-'*20, 'Report Para Dados de Treino', '-'*20)
pred_treino = modelo.predict(X_train_resample)
plot_general_report(modelo, y_train_resample, pred_treino, save = False)
-------------------- Report Para Dados de Teste -------------------- -------------------- Confusion matrix --------------------
-------------------- Precision matrix --------------------
-------------------- Recall matrix --------------------
Macro Precision: 0.643 Macro Recall: 0.563 F1-Score: 0.5673 -------------------- Report Para Dados de Treino -------------------- -------------------- Confusion matrix --------------------
-------------------- Precision matrix --------------------
-------------------- Recall matrix --------------------
Macro Precision: 0.9349 Macro Recall: 0.934 F1-Score: 0.9341
# Modelo v9 - SMOTE em Treino ----- FINAL -----
modelo = RandomForestClassifier(random_state = seed_, class_weight = 'balanced', n_jobs = 4, criterion = 'entropy',\
max_depth = 8, max_features = 'auto',\
min_samples_leaf = 4, min_samples_split = 6, n_estimators = 1000)
modelo.fit(X_train_resample, y_train_resample)
pred = modelo.predict(X_test)
pred_prob = modelo.predict_proba(X_test)
# Calcula a loss
loss = log_loss(y_test, pred_prob)
print('-'*20, 'Report Para Dados de Teste', '-'*20)
print('Loss:', round(loss, 4))
plot_general_report(modelo, y_test, pred, save = True)
print('-'*20, 'Report Para Dados de Treino', '-'*20)
pred_treino = modelo.predict(X_train_resample)
plot_general_report(modelo, y_train_resample, pred_treino, save = False)
-------------------- Report Para Dados de Teste -------------------- Loss: 1.4352 -------------------- Confusion matrix --------------------
-------------------- Precision matrix --------------------
-------------------- Recall matrix --------------------
Macro Precision: 0.6436 Macro Recall: 0.5617 F1-Score: 0.5668 -------------------- Report Para Dados de Treino -------------------- -------------------- Confusion matrix --------------------
-------------------- Precision matrix --------------------
-------------------- Recall matrix --------------------
Macro Precision: 0.9441 Macro Recall: 0.943 F1-Score: 0.9431
Através da execução de diversos algoritmos e tuning foi observado que conforme o Log Loss aumentava conseguiamos uma maior precisão em algumas classes. Porém isso significa que estamos aumentando nossa precisão ao mesmo tempo que aumentamos a nossa taxa de incerteza sobre as previsões.
Assim sendo, foi optado por aumentar a precisão ao inves de diminuir o Log Loss. Um dos problemas observados ao longo do processo foi o overfitting que apesar de alto conseguimos reduzir um pouco.
Foi realizado testes usando dados balanceados e desbalanceados, com ou sem calibração, conjunto de validação... Foi alterado os parametros durante o pre processamento, aumentando e diminuindo o total de componentes. Também foram realizados testes com diferentes numeros de 'features' geradas pelo procedo TFIDF.
Sugestões de melhoria:
- Otimizar outros algoritmos como XGBoost e/ou SVM.
Hiper parametrização do melhor modelo:
- Algoritmo: Random Forest
- random_state = 194
- class_weight = 'balanced'
- n_jobs = 4
- criterion = 'entropy'
- max_depth = 8
- max_features = 'auto'
- min_samples_leaf = 4
- min_samples_split = 6
- n_estimators = 1000
#----- FINAL -----
modelo = RandomForestClassifier(random_state = seed_, class_weight = 'balanced', n_jobs = 4, criterion = 'entropy',\
max_depth = 8, max_features = 'auto',\
min_samples_leaf = 4, min_samples_split = 6, n_estimators = 1000)
modelo.fit(X_train_resample, y_train_resample)
pred = modelo.predict(X_test)
pred_prob = modelo.predict_proba(X_test)
# Calcula a loss
loss = log_loss(y_test, pred_prob)
print('Loss:', round(loss, 4))
plot_general_report(modelo, y_test, pred, save = True)
Loss: 1.4352 -------------------- Confusion matrix --------------------
-------------------- Precision matrix --------------------
-------------------- Recall matrix --------------------
Macro Precision: 0.6436 Macro Recall: 0.5617 F1-Score: 0.5668
'models/00011.model'
A analise do modelo realizada abaixa comprova que os componentes gerados pelo apartir do TFIDF são de extrema importância. Assim podemos tentar aumentar a qualidade dos nossos componentes ou da extração do TFIDF. O grande problema é em relação ao custo computacional gerado com o aumento dos anteriores.
# Criando o explainer do modelo
explainer = shap.TreeExplainer(modelo)
# Interpretação da predição 0
shap.initjs()
shap_data = X_test_original.iloc[0]
shap_values = explainer.shap_values(shap_data)
shap.force_plot(explainer.expected_value[1], shap_values[1], shap_data)
# Interpretação da predição 0 a 5
# Executar e após visualizar limpar a celula. Fica muito pesado no notebook
'''shap.initjs()
shap_data = X_test_original.iloc[0:5]
shap_values = explainer.shap_values(shap_data)
shap.force_plot(explainer.expected_value[1], shap_values[1], shap_data)'''
'shap.initjs()\nshap_data = X_test_original.iloc[0:5]\nshap_values = explainer.shap_values(shap_data)\nshap.force_plot(explainer.expected_value[1], shap_values[1], shap_data)'
# Interpretação do modelo em relação as 10 primeiras predições
shap.initjs()
shap_data = X_test_original.iloc[0:10]
shap_values = explainer.shap_values(shap_data)
shap.summary_plot(shap_values[1], shap_data)
Após verificar que o algoritmo RandomForest estava tendendo muito ao overfitting e com dificuldade de diminui-lo por hiperparametro. Foi optado por utilizar o XGBoost que já apresentou resultados melhores para casos similares.
# Modelo v1
dtrain = xgb.DMatrix(data = X_train_resample, label = y_train_resample)
dtest = xgb.DMatrix(data = X_test, label = y_test)
params = {
'objective': 'multi:softmax',
'num_class': 10,
'random_state': seed_
}
modelo = xgb.train(params = params, dtrain = dtrain)
pred = modelo.predict(dtest)
print('-'*20, 'Report Para Dados de Teste', '-'*20)
plot_general_report(modelo, y_test, pred, save = True)
print('-'*20, 'Report Para Dados de Treino', '-'*20)
pred_treino = modelo.predict(dtrain)
plot_general_report(modelo, y_train_resample, pred_treino, save = False)
[18:42:02] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.4.0/src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softmax' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. -------------------- Report Para Dados de Teste -------------------- -------------------- Confusion matrix --------------------
-------------------- Precision matrix --------------------
-------------------- Recall matrix --------------------
Macro Precision: 0.6208 Macro Recall: 0.555 F1-Score: 0.5536 -------------------- Report Para Dados de Treino -------------------- -------------------- Confusion matrix --------------------
-------------------- Precision matrix --------------------
-------------------- Recall matrix --------------------
Macro Precision: 0.9472 Macro Recall: 0.9465 F1-Score: 0.9465
# Modelo v2
dtrain = xgb.DMatrix(data = X_train_resample, label = y_train_resample)
dtest = xgb.DMatrix(data = X_test, label = y_test)
params = {
# Definições de ambiente de treino
'objective': 'multi:softmax',
'num_class': 10,
'random_state': seed_,
'nthread': 2,
# Hiperparametros a serem ajustados
'colsample_bynode': 1,
'colsample_bytree': 1,
'gamma': 0,
'learning_rate': 0.3,
'max_depth': 6,
'min_child_weight': 1,
'subsample': 0.8,
'colsample_bylevel': 1
}
modelo = xgb.train(params = params, dtrain = dtrain)
pred = modelo.predict(dtest)
print('-'*20, 'Report Para Dados de Teste', '-'*20)
plot_general_report(modelo, y_test, pred, save = True)
print('-'*20, 'Report Para Dados de Treino', '-'*20)
pred_treino = modelo.predict(dtrain)
plot_general_report(modelo, y_train_resample, pred_treino, save = False)
[18:42:18] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.4.0/src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softmax' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. -------------------- Report Para Dados de Teste -------------------- -------------------- Confusion matrix --------------------
-------------------- Precision matrix --------------------
-------------------- Recall matrix --------------------
Macro Precision: 0.5626 Macro Recall: 0.5512 F1-Score: 0.5462 -------------------- Report Para Dados de Treino -------------------- -------------------- Confusion matrix --------------------
-------------------- Precision matrix --------------------
-------------------- Recall matrix --------------------
Macro Precision: 0.9454 Macro Recall: 0.9447 F1-Score: 0.9447
# Modelo v3
dtrain = xgb.DMatrix(data = X_train_resample, label = y_train_resample)
dtest = xgb.DMatrix(data = X_test, label = y_test)
params = {
# Definições de ambiente de treino
'objective': 'multi:softmax',
'num_class': 10,
'random_state': seed_,
'nthread': 2,
# Hiperparametros a serem ajustados
'colsample_bynode': 1,
'colsample_bytree': 1,
'gamma': 0,
'learning_rate': 0.5,
'max_depth': 6,
'min_child_weight': 1,
'subsample': 0.8,
'colsample_bylevel': 1
}
modelo = xgb.train(params = params, dtrain = dtrain)
pred = modelo.predict(dtest)
print('-'*20, 'Report Para Dados de Teste', '-'*20)
plot_general_report(modelo, y_test, pred, save = True)
print('-'*20, 'Report Para Dados de Treino', '-'*20)
pred_treino = modelo.predict(dtrain)
plot_general_report(modelo, y_train_resample, pred_treino, save = False)
[18:43:14] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.4.0/src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softmax' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. -------------------- Report Para Dados de Teste -------------------- -------------------- Confusion matrix --------------------
-------------------- Precision matrix --------------------
-------------------- Recall matrix --------------------
Macro Precision: 0.615 Macro Recall: 0.5462 F1-Score: 0.5461 -------------------- Report Para Dados de Treino -------------------- -------------------- Confusion matrix --------------------
-------------------- Precision matrix --------------------
-------------------- Recall matrix --------------------
Macro Precision: 0.9521 Macro Recall: 0.9514 F1-Score: 0.9514
# Modelo v4
dtrain = xgb.DMatrix(data = X_train_resample, label = y_train_resample)
dtest = xgb.DMatrix(data = X_test, label = y_test)
params = {
# Definições de ambiente de treino
'objective': 'multi:softmax',
'num_class': 10,
'random_state': seed_,
'nthread': 2,
# Hiperparametros a serem ajustados
'colsample_bynode': 1,
'colsample_bytree': 1,
'gamma': 0.5,
'learning_rate': 0.5,
'max_depth': 6,
'min_child_weight': 1,
'subsample': 0.8,
'colsample_bylevel': 1
}
modelo = xgb.train(params = params, dtrain = dtrain)
pred = modelo.predict(dtest)
print('-'*20, 'Report Para Dados de Teste', '-'*20)
plot_general_report(modelo, y_test, pred, save = True)
print('-'*20, 'Report Para Dados de Treino', '-'*20)
pred_treino = modelo.predict(dtrain)
plot_general_report(modelo, y_train_resample, pred_treino, save = False)
[18:44:08] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.4.0/src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softmax' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. -------------------- Report Para Dados de Teste -------------------- -------------------- Confusion matrix --------------------
-------------------- Precision matrix --------------------
-------------------- Recall matrix --------------------
Macro Precision: 0.6213 Macro Recall: 0.5376 F1-Score: 0.5456 -------------------- Report Para Dados de Treino -------------------- -------------------- Confusion matrix --------------------
-------------------- Precision matrix --------------------
-------------------- Recall matrix --------------------
Macro Precision: 0.9524 Macro Recall: 0.9515 F1-Score: 0.9516
# Modelo v5
dtrain = xgb.DMatrix(data = X_train_resample, label = y_train_resample)
dtest = xgb.DMatrix(data = X_test, label = y_test)
params = {
# Definições de ambiente de treino
'objective': 'multi:softmax',
'num_class': 10,
'random_state': seed_,
'nthread': 2,
# Hiperparametros a serem ajustados
'colsample_bynode': 1,
'colsample_bytree': 1,
'gamma': 0.5,
'learning_rate': 0.5,
'max_depth': 10,
'min_child_weight': 1,
'subsample': 0.8,
'colsample_bylevel': 1
}
modelo = xgb.train(params = params, dtrain = dtrain)
pred = modelo.predict(dtest)
print('-'*20, 'Report Para Dados de Teste', '-'*20)
plot_general_report(modelo, y_test, pred, save = True)
print('-'*20, 'Report Para Dados de Treino', '-'*20)
pred_treino = modelo.predict(dtrain)
plot_general_report(modelo, y_train_resample, pred_treino, save = False)
[18:45:06] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.4.0/src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softmax' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. -------------------- Report Para Dados de Teste -------------------- -------------------- Confusion matrix --------------------
-------------------- Precision matrix --------------------
-------------------- Recall matrix --------------------
Macro Precision: 0.5723 Macro Recall: 0.5366 F1-Score: 0.5444 -------------------- Report Para Dados de Treino -------------------- -------------------- Confusion matrix --------------------
-------------------- Precision matrix --------------------
-------------------- Recall matrix --------------------
Macro Precision: 0.9561 Macro Recall: 0.9554 F1-Score: 0.9554
# Modelo v6
dtrain = xgb.DMatrix(data = X_train_resample, label = y_train_resample)
dtest = xgb.DMatrix(data = X_test, label = y_test)
params = {
# Definições de ambiente de treino
'objective': 'multi:softmax',
'num_class': 10,
'random_state': seed_,
'nthread': 2,
# Hiperparametros a serem ajustados
'colsample_bynode': 1,
'colsample_bytree': 1,
'gamma': 0.5,
'learning_rate': 0.5,
'max_depth': 8,
'min_child_weight': 1,
'subsample': 0.8,
'colsample_bylevel': 1
}
modelo = xgb.train(params = params, dtrain = dtrain)
pred = modelo.predict(dtest)
print('-'*20, 'Report Para Dados de Teste', '-'*20)
plot_general_report(modelo, y_test, pred, save = True)
print('-'*20, 'Report Para Dados de Treino', '-'*20)
pred_treino = modelo.predict(dtrain)
plot_general_report(modelo, y_train_resample, pred_treino, save = False)
[18:46:17] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.4.0/src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softmax' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. -------------------- Report Para Dados de Teste -------------------- -------------------- Confusion matrix --------------------
-------------------- Precision matrix --------------------
-------------------- Recall matrix --------------------
Macro Precision: 0.6196 Macro Recall: 0.522 F1-Score: 0.5365 -------------------- Report Para Dados de Treino -------------------- -------------------- Confusion matrix --------------------
-------------------- Precision matrix --------------------
-------------------- Recall matrix --------------------
Macro Precision: 0.9551 Macro Recall: 0.9545 F1-Score: 0.9545
Através do treino do algoritmo de XGBoost. Não foi possivel aumentar a precisçao do modelo, isso se da pelo sua alta complexidade, talvez seja melhor optar por modelos como SVM que podem lidar melhor com altas dimensionaliades.
Sugestões de melhoria:
- Otimizar outros algoritmos como Linear SVM e Regressão Logistica.
Hiper parametrização do melhor modelo:
- Algoritmo: XGBoost
- 'objective': 'multi:softmax'
- 'num_class': 10
- 'random_state': 194
- 'nthread': 2
- 'colsample_bynode': 1
- 'colsample_bytree': 1
- 'gamma': 0.5
- 'learning_rate': 0.5
- 'max_depth': 6
- 'min_child_weight': 1
- 'subsample': 0.8
- 'colsample_bylevel': 1
dtrain = xgb.DMatrix(data = X_train_resample, label = y_train_resample)
dtest = xgb.DMatrix(data = X_test, label = y_test)
params = {
# Definições de ambiente de treino
'objective': 'multi:softmax',
'num_class': 10,
'random_state': seed_,
'nthread': 2,
# Hiperparametros a serem ajustados
'colsample_bynode': 1,
'colsample_bytree': 1,
'gamma': 0.5,
'learning_rate': 0.5,
'max_depth': 6,
'min_child_weight': 1,
'subsample': 0.8,
'colsample_bylevel': 1
}
modelo = xgb.train(params = params, dtrain = dtrain)
pred = modelo.predict(dtest)
print('-'*20, 'Report Para Dados de Teste', '-'*20)
plot_general_report(modelo, y_test, pred, save = True)
print('-'*20, 'Report Para Dados de Treino', '-'*20)
pred_treino = modelo.predict(dtrain)
plot_general_report(modelo, y_train_resample, pred_treino, save = False)
[18:47:21] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.4.0/src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softmax' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior. -------------------- Report Para Dados de Teste -------------------- -------------------- Confusion matrix --------------------
-------------------- Precision matrix --------------------
-------------------- Recall matrix --------------------
Macro Precision: 0.6213 Macro Recall: 0.5376 F1-Score: 0.5456 -------------------- Report Para Dados de Treino -------------------- -------------------- Confusion matrix --------------------
-------------------- Precision matrix --------------------
-------------------- Recall matrix --------------------
Macro Precision: 0.9524 Macro Recall: 0.9515 F1-Score: 0.9516
O XGboost possui um comportamento diferente, onde as variaveis geradas pelo One Hot Enconde de 'Gene' e 'Variation' possuem um destaque um pouco maior. Porém o seu desempenho ainda não foi superior ao Random Forest.
# Criando o explainer do modelo
explainer = shap.TreeExplainer(modelo)
# Interpretação da predição 0
shap.initjs()
shap_data = X[0:1]
shap_values = explainer.shap_values(shap_data)
shap.force_plot(explainer.expected_value[1], shap_values[1], shap_data)
# Interpretação da predição 0 a 5
# Executar e após visualizar limpar a celula. Fica muito pesado no notebook
'''
shap.initjs()
shap_data = X[0:5]
shap_values = explainer.shap_values(shap_data)
shap.force_plot(explainer.expected_value[1], shap_values[1], shap_data)
'''
'\nshap.initjs()\nshap_data = X[0:5]\nshap_values = explainer.shap_values(shap_data)\nshap.force_plot(explainer.expected_value[1], shap_values[1], shap_data)\n'
# Interpretação do modelo em relação as 10 primeiras predições
shap.initjs()
shap_data = X[0:10]
shap_values = explainer.shap_values(shap_data)
shap.summary_plot(shap_values[1], shap_data)
Após testar dois modelos baseado em Arvores, percebemos que esses não se adaptam tão bem aos dados que possuimos. Com isso iremos tentar utilizar a Regressão Logistica. Visto que esse tende a se adaptar melhor a generalizações massivas, com probabilidades como TFIDF.
# Modelo Base v1
# Dados balanceados
modelo = SGDClassifier(loss = 'log', class_weight = 'balanced', random_state = seed_)
modelo.fit(X_train_resample, y_train_resample)
pred_proba = modelo.predict_proba(X_test)
pred = modelo.predict(X_test)
print('-'*20, 'Report Para Dados de Teste', '-'*20)
print('Log loss: ', log_loss(y_test, pred_proba))
plot_general_report(modelo, y_test, pred, save = True)
print('-'*20, 'Report Para Dados de Treino', '-'*20)
pred_treino = modelo.predict(X_train_resample)
plot_general_report(modelo, y_train_resample, pred_treino, save = False)
-------------------- Report Para Dados de Teste -------------------- Log loss: 1.0518353191685434 -------------------- Confusion matrix --------------------
-------------------- Precision matrix --------------------
-------------------- Recall matrix --------------------
Macro Precision: 0.5056 Macro Recall: 0.5497 F1-Score: 0.5213 -------------------- Report Para Dados de Treino -------------------- -------------------- Confusion matrix --------------------
-------------------- Precision matrix --------------------
-------------------- Recall matrix --------------------
Macro Precision: 0.9175 Macro Recall: 0.916 F1-Score: 0.9161
# Modelo v2
# Dados balanceados
# Hiperparametros
alpha = [10 ** x for x in range(-8, 5)]
for alpha_ in alpha:
modelo = SGDClassifier(loss = 'log', class_weight = 'balanced', random_state = seed_, alpha = alpha_, n_jobs = -1)
modelo.fit(X_train_resample, y_train_resample)
pred_proba = modelo.predict_proba(X_test)
print('Alpha =', alpha_, 'Log loss: ', log_loss(y_test, pred_proba))
Alpha = 1e-08 Log loss: 12.557240686123171 Alpha = 1e-07 Log loss: 9.948364272756928 Alpha = 1e-06 Log loss: 2.5393847350276366 Alpha = 1e-05 Log loss: 1.2293533914833659 Alpha = 0.0001 Log loss: 1.0518353191685434 Alpha = 0.001 Log loss: 1.3362256067301244 Alpha = 0.01 Log loss: 1.873132306957628 Alpha = 0.1 Log loss: 2.152957934375349 Alpha = 1 Log loss: 2.1931293608559512 Alpha = 10 Log loss: 2.1968277104569105 Alpha = 100 Log loss: 2.1971851209375917 Alpha = 1000 Log loss: 2.1972206351288825 Alpha = 10000 Log loss: 2.197224183179714
# Modelo v3
# Dados balanceados
modelo = SGDClassifier(loss = 'log', class_weight = 'balanced', random_state = seed_, alpha = 0.0001, n_jobs = -1,\
early_stopping = True, validation_fraction = 0.2)
modelo.fit(X_train_resample, y_train_resample)
pred_proba = modelo.predict_proba(X_test)
print('Log loss: ', log_loss(y_test, pred_proba))
Log loss: 1.0814224486621251
# Modelo v4
# Dados balanceados
modelo = SGDClassifier(loss = 'log', class_weight = 'balanced', random_state = seed_, alpha = 0.0001, n_jobs = -1,\
early_stopping = True, validation_fraction = 0.1)
modelo.fit(X_train_resample, y_train_resample)
pred_proba = modelo.predict_proba(X_test)
print('Log loss: ', log_loss(y_test, pred_proba))
Log loss: 1.071817021249009
# Modelo v5
# Dados desbalanceados
# Hiperparametros
alpha = [10 ** x for x in range(-8, 5)]
for alpha_ in alpha:
modelo = SGDClassifier(loss = 'log', class_weight = 'balanced', random_state = seed_, alpha = alpha_, n_jobs = -1)
modelo.fit(X_train, y_train)
pred_proba = modelo.predict_proba(X_test)
print('Alpha =', alpha_, 'Log loss: ', log_loss(y_test, pred_proba))
Alpha = 1e-08 Log loss: 10.43141520629411 Alpha = 1e-07 Log loss: 8.725013423752113 Alpha = 1e-06 Log loss: 4.316561702786419 Alpha = 1e-05 Log loss: 1.2704659684069648 Alpha = 0.0001 Log loss: 0.9824312299655253 Alpha = 0.001 Log loss: 1.2574183830138528 Alpha = 0.01 Log loss: 1.8002290019588107 Alpha = 0.1 Log loss: 2.123678064826091 Alpha = 1 Log loss: 2.189328056558808 Alpha = 10 Log loss: 2.196498909773525 Alpha = 100 Log loss: 2.1971955827075913 Alpha = 1000 Log loss: 2.197246372579398 Alpha = 10000 Log loss: 2.1972406385525165
# Modelo v6
# Dados desbalanceados
modelo = SGDClassifier(loss = 'log', class_weight = 'balanced', random_state = seed_, alpha = 0.0001, n_jobs = -1,\
early_stopping = True, validation_fraction = 0.1)
modelo.fit(X_train, y_train)
pred_proba = modelo.predict_proba(X_test)
print('Log loss: ', log_loss(y_test, pred_proba))
Log loss: 0.9759681812888219
# Analise modelo v6
pred = modelo.predict(X_test)
print('-'*20, 'Report Para Dados de Teste', '-'*20)
plot_general_report(modelo, y_test, pred, save = True)
print('-'*20, 'Report Para Dados de Treino', '-'*20)
pred_treino = modelo.predict(X_train_resample)
plot_general_report(modelo, y_train_resample, pred_treino, save = False)
-------------------- Report Para Dados de Teste -------------------- -------------------- Confusion matrix --------------------
-------------------- Precision matrix --------------------
-------------------- Recall matrix --------------------
Macro Precision: 0.5532 Macro Recall: 0.52 F1-Score: 0.5263 -------------------- Report Para Dados de Treino -------------------- -------------------- Confusion matrix --------------------
-------------------- Precision matrix --------------------
-------------------- Recall matrix --------------------
Macro Precision: 0.8536 Macro Recall: 0.8248 F1-Score: 0.8241
# Modelo v7
# Dados desbalanceados
modelo = SGDClassifier(loss = 'log', class_weight = 'balanced', random_state = seed_, alpha = 0.0001, n_jobs = -1,\
early_stopping = True, validation_fraction = 0.2)
modelo.fit(X_train, y_train)
pred_proba = modelo.predict_proba(X_test)
print('Log loss: ', log_loss(y_test, pred_proba))
Log loss: 0.9995458291033384
# Modelo Final
modelo = SGDClassifier(loss = 'log', class_weight = 'balanced', random_state = seed_, alpha = 0.0001, n_jobs = -1)
modelo.fit(X_train, y_train)
pred_proba = modelo.predict_proba(X_test)
print('Log loss: ', log_loss(y_test, pred_proba))
Log loss: 0.9824312299655253
A Regressão Logistica conseguiu apresentar o menor log loss, o que é positivo. Ainda em relação aos outros algoritmos foi perceptivel que a Regressão Logistica deu mais valor a variaveis geradas pelo One Hot Encoder. Comportamento esse que já tinhamos analisado no XGBoost.
# Criando o explainer do modelo
explainer = shap.LinearExplainer(modelo, X_train)
# Interpretação da predição 0
shap.initjs()
shap_data = X[0:1]
shap_values = explainer.shap_values(shap_data)
shap.force_plot(explainer.expected_value[1], shap_values[1], shap_data)
# Interpretação da predição 0 a 5
# Executar e após visualizar limpar a celula. Fica muito pesado no notebook
'''
shap.initjs()
shap_data = X[0:5]
shap_values = explainer.shap_values(shap_data)
shap.force_plot(explainer.expected_value[1], shap_values[1], shap_data)
'''
'\nshap.initjs()\nshap_data = X[0:5]\nshap_values = explainer.shap_values(shap_data)\nshap.force_plot(explainer.expected_value[1], shap_values[1], shap_data)\n'
# Interpretação do modelo em relação as 10 primeiras predições
shap.initjs()
shap_data = X[0:10]
shap_values = explainer.shap_values(shap_data)
shap.summary_plot(shap_values[1], shap_data)
Após um trabalho grande na analise exploratoria, pre-processamento, modelagem... Foi possivel chegar a 2 modelos como os melhores. O primeiro possui overfitting elevado mais apresenta maior precisão, gerado pelo algoritmo do Random Forest e analisado anteriormente. Posteriormente foi utilizado a Regressão Logistica por suas caracteristicas em maior generalização e trabalho com features probabilisticas, onde encontramos o menor Log Loss, porém a precisão não se satisfez.
Assim, não conseguimos concluir ambos os objetivos estabelecidos no mesmo modelo. Somente em modelos separados, ainda foi visto que seria possível continuar esse trabalho com novos parametros e uma maior intensidade no pre-processamento dos dados. Infelizmente devido aos limites computacionais da maquina utilizada para o projeto, não foi possivel realizar o mesmo.