Ectd Document Classification with Python and Machine Learning:
Intro:
Business Problem:
- Classify the ectd documents used in Health authority submission to corresponding section folders to map the documents automatically.
- In Machine Learning terminology, the business problem is a supervised multi-class classification where the target classes (section numbers) are known previously and form a finite set.
Background:
In order to market a drug and get relevant approval from state drug regulator, firms need to follow requisite guidelines and revert to authority using submission process of a drug which involve Ectd documents (electronic documents).
This involves a lot of document exchange between the drug seller and the regulator which typically happen as ectd documents (pdf). Most of the correspondence relates to rectifiable documentation, data maintenance, production quality, stability issues, microbial contamination and quality checks
Due to large volume of documents, it becomes difficult to manage the process effectively. Hence, an application is developed which helps to organise this submission process for all stakeholders.
Currently, this process is being handled manually by publishers who map the document to respective section folder in the application which results in more time to upload the files into respective folders which causes submission of files at the last minute.
Each document belongs to a specific section in the ectd folder structure.
Automating this process using Machine Learning would reduce human effort and time spent considerably in initial phase and significantly after complete implementation.
Data Description:
Input data is in the form of PDF documents. Hence, mostly the data is in unstructured text format.
Since there are a lot of sections and section specific data, PDF documents contain varied information pertaining to the drugs.
This could be data related but not restricted to:
- Structure, Properties, Nomenclature of drug.
- Impurities and Characteristics.
- Manufacturing site, Manufacturing process, Process validation procedure
- Containment of the finished drug.
- Stability data.
- Drug Substance (API) and Excipient information.
- Quality tests and check on contamination
The raw data had a lot of noise which made the classification very challenging.
For example: Documents contain chemical terminology like chemical names, molecular structures, and images etc.
Also, the level of detail in the pdf documents of a specific section and across sections varied greatly.
For example; Certain documents were only about 1–2 pages whereas a few documents stretched more than 100 pages as well.
Methodology:
- Random Forest classifier used for model building.
- Train/test split implementation for model validation.
- Data Randomization with python random function.
- Python data structures like lists, sets, pandas data frames, NumPy arrays.
- Made use of string functions, regular expression, lambda (single line function) for data cleaning/feature engineering.
- Lemmatization (converting to root word form) for improved feature extraction.
- Matplotlib and seaborn libraries used for visualizing confusion matrix.
- Count Vectorizer and Vocabulary used for feature selection.
- Model trained on text documents to predict class with 92+% score for 23 sections.
- Evaluation using Confusion matrix, Classification report and accuracy score.
- Predicted probabilities for each document label along with label shown as output.
Python Libraries Used:
- Scikit-Learn (sklearn)
- PyPDF2
- Pandas
- NumPy
- os
- Pickle
- re (Regular expressions)
- nltk
- Matplotlib
- Seaborn
Data Structures used:
- Lists and List functions
- List Comprehensions
- Pandas Dataframe
- Dictionaries
- Lambda functions
- Regular expressions
- NumPy arrays
Documents used:
- No of sections (target classes): 23 (for 32S)
- No of Training Documents: 120
- No of Test Documents: 30
- Total Documents: 150
Data Ingestion:
Input data is in the form of unstructured text-based pdf documents.
As such, I used PyPDF2 library to process these PDF files into text readable by Python.
I decided to extract data from first 3 pages for each pdf document.
Text data from each document was stored as a single row in a pandas data frame and subsequently output to excel.
Ground truth labels:
Added correct (True) labels for each document in excel after consulting from functional SME.
Imported this excel containing the text of each document along with true labels into python and stored as a pandas data frame which would serve as initial processed data.
Data Cleaning:
- Removing numeric data (conditionally)
- Stop words removal
- Geo-specific words.
- Most Common chemical names.
- Removing Punctuation.
Data Standardization:
- Tokenisation
- Stemming / lemmatization
Data Pre-processing:
- Feature Creation — Text Length
- Feature Binning — Text length
- Transformation — Feature binning.
Feature engineering:
Vectorizing the tokenised text before feeding to ML model.
Tested and implemented following vectorisation algorithms.
- Count Vectorizer (Count)
- N-gram Vectorizer
- Tf-Idf Vectorizer (Feature weighing)
Tf-Idf vectoriser produced marginally (1–2 %) better results compared to count vectorizer.
Model Selection
- Naive Bayes
- Random Forest
- Gradient Boosting
Model Evaluation:
- K- fold Cross validation
Model Building:
- Random Forest
Evaluation:
- Train/test split for model evaluation.
- Model trained on part of the data and another part used as holdout set (test set) for scoring / evaluation.
- Confusion Matrix: Instead of only relying on only accuracy, confusion matrix analysis is done for holistic view of model performance.
- Precision/ Recall / F1- Score (Classification report):
- Above metrics can be used to identify how the model is performing.
- Classification report module in sklearn.metrics builds an easy to understand report with precision, recall, f1-score and average weighted scores.
NOTE:
- Precision is the True positive rate while recall is the False positive rate.
- F1 Score serves as a good indicator of model performance.
- Mis-classification rate could also be additionally used as additional metric.
Deployment:
Trained model packaged and deployed as a service on Linux server for serving.
Model service is called by REST API request and returns a JSON response with the following details:
- PDF Document name
- Label (predicted section)
Model Retraining and Feedback Module:
To increase the accuracy, I built a retraining model piece which would essentially take feedback (true label from user) and use this data to retrain the model.
This would help the model to distinguish between close sections (For ex: 32s71,32s72 and 32s73 which hold stability data) which leads to a robust model.
Full training data = Retraining data + Training data
This Retraining module is implemented as a python script which can be run from a terminal console and enables a user to provide input.
Each feedback is stored as an additional record in pandas dataframe and saved as excel.
Also, ideally this script would run at a specified frequency (daily/weekly) and retrain the model.
The retrained model (saved as .mdl file) is used to predict the future cases.
Additionally, used pickle to save the trained model (.mdl) and vectoriser vocabulary (.pkl) for later use.
Results:
Initial results were good, however as more documents for sections were added, the accuracy dropped a bit.
- Naive Bayes algorithm was chosen initially to classify the documents and resultant accuracy was >80%.
- Random Forest was implemented which resulted in accuracy improvement to around 90% for 32S sections. Also, random forest output was important in getting the most important features (words) for the target sections.
- Gradient boosting algorithm increased the accuracy to a 92–93 %.
- Implemented K-fold Cross validation to increase the robustness of the model and reduce over fitting on training data.
Challenges:
- Blank documents with no data -> Target class is not known.
- Scanned documents (originally paper based but converted to electronic form).
- Images in pdf documents
- Identifying respective sections and key elements for all files is a time-consuming activity.
Conclusion/Final Thoughts:
Text data, by its very nature is unstructured and highly complex to analyse.
Moreover, the business problem here was not to extract entities (company names, drug names, chemicals etc.) but instead focus on mapping the documents to relevant sections folders in the application.
This made the problem very challenging, since there are more than 400 sections (target class) for the US region for all the submission modules (M1-M5).
One could look at Deep learning/NLP to implement a solution for a problem of this scale.
Resources:
Github Repo (with codes):
https://github.com/itznish/ECTD-Document-Classification-using-sklearn-and-ML