Thesis Open Access
Sreenivasa Rao Vuda (PhD); Ejegu Tefera(MSc)
The healthcare industry is one of the intensive and sensitive organization which is generating a massive amount of data with various formats. In order to manage and provides meaningful information from these data, it needs serious attention to analytic techniques and modern tools to enhance the quality of service and reduce cost. Stroke disease is the most common cause of global death. So, the early detection of stroke disease and continuous monitoring can reduce the mortality rate. However, the exponential growth of data from different sources such as medical data, patient history, streaming (real-time) system, wearable sensor devices, and others have become biggest challenges to perform advanced analytics using conventional techniques including prediction in order to generate right insight from data for a better decision. The combination of big data analytics and machine learning is an advanced technology that can have a significant impact on the healthcare sector especially early detection of stroke disease. This technology can be less expensive and more powerful. To overcome this challenge, in this study a healthcare data analytics framework for stroke disease prediction based on Apache Spark is proposed. The proposed framework is implemented using Apache Spark, which is a leading platform with its fast and large scale distributed computing performance for both batch and streaming data processing, through in-memory computations. We have implemented four scalable algorithms in the Spark ML: Decision Tree, Random Forest, Gradient Boosting Tree and Logistic Regression using stroke healthcare dataset that collected from a Medical Quality Improvement Consortium (MQIC) database with consultation of cardiologist from the local hospitals to make analysis and prediction of stroke disease. Thus, with one master node and two worker nodes stroke data analytics was performed and the performance of model evaluated and compared using performance metric like Confusion Matrix, Area under Curve (AUC). Based on the experiment result Decision Tree found to be the best with an accuracy of 94.3% and an AUC score of 99%, and also diabetes is identified as the major risk factor of stroke disease followed by hypertension. This study showed that Apache Spark with its scalable machine learning techniques can be used efficiently to model, predict stroke disease and identify risk factors earlier. The result of this study can be used as clinical decision supports by physician to help them to make a more consistent diagnosis of stroke disease.
DESIGNING HEALTHCARE DATA ANALYTICS FRAMEWORK BASED ON BIG DATA APPROACH IN CASE OF.pdf