Personalized Medicine (PM) [1] is promising in improving health care efficiently and safely because it makes it possible for patients to receive earlier diagnoses, risk assessments, and optimal treatments. However, PM in cancer treatment is still slowly developing due to a large amount of textual content based-medical literature and clinical observations needed to be manually analyzed. To address this issue and speed up the progress of PM, we propose an efficient text classifier to automatically classify the effect of genetic variants. We show that contributions from NLP community and specialists can definitely bring PM into a bright future.
The data comes from Kaggle competition and Memorial Sloan Kettering Cancer Center.
Two data files were used in this project. There are 3321 data points in total. Gene and Variation are categorical. Class contains 9 unique values.TEXT contains biomedical literature related to each class/ effect of genetic variants.
ID TEXTID Gene Variation ClassSteps:
(Stop words were removed for building machine learning classifiers, but not for building neural networks, as neural networks are trying to learn the semantic meaning and the meaning of a word depends on the context.)
Here is the EDA demo and some interesting findings.
Methods:
One-hot encoding gene and variation features
Features for training machine learning models: one-hot gene and variation + SVD truncated count vector and TFIDF vector of text.
Features for training neural networks: feature vectors transformed by pubmed2018_w2v_400D pre-trained model (Except for BERT).
Here is an evaluation of some text feature extraction methods.
Eight supervised machine learning methods were applied:
Here is an evaluation of model performance.
CNN [2] and BiLSTM [3]. RCNN, RNN+Attention, and BERT are worth trying.
Pre-trained word-embedding is chosen from: pubmed_w2v_400D
python 3.8
pytorch 1.7.0
```
Personalized-Medicine
├── eight-ml-classifiers
│ ├── images
│ │ ├── confusion-matrix
│ │ ├── learning-curve
│ │ ├── Accuracy_allmodel.png
│ │ ├── F1score_allmodels.png
│ │ └── logloss_allmodels.png
│ ├── README.md
│ ├── data-preprocessing_v1.py
│ ├── data-preprocessing_v2.py
│ ├── feature-extraction.py
│ ├── model-evaluation.py
│ ├── performance-of-ml-classifiers.ipynb
│ ├── train-models.py
│ ├── workflow-part1.ipynb
│ └── workflow-part2.ipynb
├── exploratory-data analysis
│ ├── images
│ │ ├── other_distribution
│ │ │ ├── dist_char.png
│ │ │ ├── dist_class.png
│ │ │ ├── dist_gene.png
│ │ │ ├── dist_variation.png
│ │ │ ├── dist_word.png
│ │ │ ├── gene_class.png
│ │ │ └── word_class.png
│ │ ├── uni_bi_trigram_distribution
│ │ │ ├── bi_c1.png
│ │ │ ├── ...
│ │ │ ├── bi_c9.png
│ │ │ ├── tri_c1.png
│ │ │ ├── ...
│ │ │ ├── tri_c9.png
│ │ │ ├── uni_c1.png
│ │ │ ├── ...
│ │ │ └── uni_c9.png
│ │ └── wordcloud_image
│ │ │ ├── wordCloud_class_1.png
│ │ │ ├── ...
│ │ │ ├── wordCloud_class_9.png
│ │ │ ├── wordCloud_not_strict.png
│ │ │ └── wordCloud_strict.png
│ ├── eda-demo.ipynb
│ ├── eda-gene-variation.py
│ ├── eda-text.py
│ └── resampling.ipynb
├── neural-nets
│ ├── image
│ │ ├── CE
│ │ ├── acc
│ │ ├── cm
│ │ ├── f1score
│ │ └── logloss
│ ├── models
│ │ ├── __pycache__
│ │ ├── CNN.py
│ │ └── BiLSTM.py
│ ├── LICENSE
│ ├── run.py
│ ├── train_eval.py
│ ├── utils.py
│ └── visualize.py
├── word-embedding-and-bow
│ ├── README.md
│ ├── bioconceptvec-rf.py
│ ├── biosentvec-rf.py
│ ├── biowordvec-rf.py
│ ├── glove-rf.py
│ ├── tfidf-count-rf.py
│ └── word2vec-rf.py
├── LICENSE.txt
└── README.md
```
[1] Personalized Medicine: Part 1: Evolution and Development into Theranostics
[2] Convolutional Neural Networks for Sentence Classification
[3] Recurrent Neural Network for Text Classification with Multi-Task Learning