Pediatric Big Data Analysis


Southern Illinois University School of Medicine’s (SIUSOM) mission is to assist the people of central and southern Illinois in meeting their health care needs through education, patient care, research, and service to the community. The last priority is relatively new addition to the School of Medicine. Service to the community expands the medical school’s goals beyond the traditional confines of the clinics and medicals schools and requires our physicians, medical students, and researchers to improve health by understanding the populations we serve.

SIUSOM serves 66 counties in central and southern Illinois. This service region constitutes over 25,000 square miles and over 2 million people. To better serve the community, SIUSOM must better understand our region, and this starts with better understanding the data and information of our own patients. SIUSOM prides itself on improving our systems of care, and each of our sites routinely participates in quality improvement to improve the way we deliver health care services.

However, SIUSOM has never tried to view the data in sum to understand if there are any unique patterns uncovered that correlates specific demographic information to disease and disease outcome. This type of data analytics could help us pinpoint the prevalence of certain disease to specific communities or a cluster of certain individuals that could drastically change the way that we treat and face disease. SIUSOM seeks a partnership with the University of Illinois – Springfield (UIS) to better understand our clinical data through the use of big data analytics.

Big data analytics is the process of examining large data sets containing a variety of data types to uncover hidden patterns, unknown correlations, market trends, etc. Big Data is not just about size but it is about finding insight from complex, heterogeneous, noisy and voluminous data. The analytical findings in the medical sector can help improve diagnostic and treatment capabilities, target and screen higher risk populations for certain disease, and implement more proactive community measures to decrease disease exposure.

The healthcare industry collects a large amount of data on a daily basis, including data related to patient clinical visit (such as Electronic Medical Records and Medical Imaging), patient profile, billing data, pharmaceutical and medical products, and patient behavior. However, compared to many other industries, the health care industry has been slower in adopting data analytics to derive new business models, improve care and efficiency, and reduce cost [1] [2]. McKinsey [3] identifies three major areas in healthcare which can largely benefit from big data analysis: 1) Clinical Operations: to identify optimal treatments and interventions specific to patient characteristics and symptoms , 2) Research and Development: to analyze clinical trials and patient records to predict harmful side-effects of a drug before it hits the market or to offer personalized medicine according to patient’s specific genetic variation, and 3) Public Health: to analyze disease patterns and predict disease outbreaks.

There is an active research movement working on leveraging patient data to answer clinical questions [4] including prediction of patient readmission [5] [6] [7], using data stream mining to make real-time diagnosis and treatment decisions [8], and analyzing disease patterns to track disease outbreaks [9]. The focus of the research proposed here is to use clinical data together with patient profile to understand the clinical needs of the pediatric population in regions served by SIUSOM.

A variety of methods have been used to perform big data analytics on medical data, each for different aims. Chief among them is a self-organizing map, an important type of artificial neural network (ANN) for data analytics and classification [10]. It is trained using unsupervised learning to produce a low-dimensional, discretized representation of the input space of training samples, called a map, and applies competitive learning and uses a neighborhood function to preserve the topological properties of the input space.

Data clustering or cluster analysis is an important field in the machine intelligence community, which has numerous applications in data analytics. Generally, clustering is to group a set of N samples into C clusters whose members are similar in some sense. This similarity between different samples is either a suitable distance based on numeric attributes, or directly in the form of pair-wise similarity or dissimilarity measurements [11]. Data clustering, especially using the k-means algorithm, is one that is widely used in the health care setting and can run on large data sets [12]. Used in sum, this technique will provide us with the computing tools to run our analysis and bring insight into the SIU Pediatric data.

Research Purpose and Goals

The collaboration between SIUSOM and UIS seeks to use big data analytics to identify significant patterns in the diagnosis and treatment of the SIUSOM’s service region of central and southern Illinois. However, in order to reach this goal, the collaborators understand that we must first start with a smaller population to best refine our statistical methods and algorithms to uncover the information and patterns that we seek.

The partnership will plan to complete three specific aims:

  1. Design computer based algorithms to find specific correlations between demographic information and disease unique to the pediatric patients that are served by SIUSOM.
  2. Concentrate on linkages between socioeconomic status, racial/ethnic demographics, zip code, and other demographic factors and specific pediatric diseases. Specific pediatric diseases include: failure to thrive, ADHD, asthma, obesity, developmental delay, cancer, and behavioral problems.
  3. Analyze these linkages in a manner that allows clinician researchers to reexamine our approaches and treatment plans on the patients that we serve.


Data Source

The initial dataset which will be examined for this analysis consists of clinical visits to the Department of Pediatrics at SIUSOM for multiple years. The dataset includes patient demographic identifiers, such as age, ethnicity, zip code, employment, reason for visit, etc. as well as information on diagnosis, billing code and insurance.

The initial goal of the data analysis is to discover hidden patterns and correlations between patient demographic identifiers and patient clinical visit information.

Project Timeline

The initial data analytics will be performed in five steps:









  1. Data Exploration and Cleaning: In this step, we will obtain IRB approval from SIUSOM and UIS as necessary. Multiple sources of patient data will be acquired from SIU, and then integrated including patient demographics, clinical diagnosis, prescription, insurance, and clinical notes. Data will be cleaned and reformatted for analysis by removing invalid records. Primitive statistical measures will be used to examine the characteristics of data. In addition, missing values will be handled through complete case analysis or imputation
  2. Data Reduction and Clinical Feature Extraction: The patient data source is very high dimension. The demographic data itself has about 200 identifiers, including some free text variables. In this step, a non-linear dimensionality reduction method, such as a Self Organizing Map or Principal Component Analysis, will be used to discover hidden structures in the patient data and extract a set of reduced features. The result of this step will be a feature map which presents a low dimensional visualization of the high dimensional clinical visit and demographic identifiers.
  3. Data Clustering: In this step, a standard clustering technique, such as Kmeans++, is used to perform two layers of clustering on the reduced set of features. In the first layer, the data are clustered based on clinical visit identifiers such as the reason for visit, diagnosis, prescription, etc. This will group together patients with similar clinical visits. Each group obtained from the first layer is then re-clustered into sub-groups but this time the clustering is based on the patient demographic identifiers such as gender, age, zip code, etc. The two-layer clustering process is illustrated in figure 1.
  4. data graphic
  5. Factors Analysis and Rule Mining: Once the two-layer clustering is performed, the significant factors within each outer cluster and all its sub-clusters can be derived using an automatic cluster labeling method. The significant factors of each cluster are the features whose values are significantly different compared to those in other clusters. One can draw association between the significant features of an outer cluster (i.e., a clinical visit cluster) and the ones in its inner sub-clusters (the demographic sub-clusters). For example, one may draw an association between patients who have recurring ear infection and the children under 2 who attend a specific day care. Such associations can be further validated through standard statistical tests.
  6. Integrating Big Data Findings with Clinical Care: As the final information and association are made, the team will examine how the patterns found in big data analytics coincide with the clinical practice of pediatricians at SIUSOM. The team will ask questions such as: 1) Are we appropriately diagnosing diseases based on geography and demographics? 2) Are we incorporating appropriate treatment methods to best treat these diseases? This analysis will hopefully allow SIU Pediatrics to better diagnose and treat diseases afflicting its pediatric population.


The time table is described as follows in Table I:

Big Data

Expected Results

Contribution to New Knowledge

Big data analytics has never been completed on any segment of the central and southern Illinois population. This analysis could lead to some very interesting findings on how disease affects certain specific communities and clusters of individuals in our region that have not previously been uncovered. This leads to new knowledge on the prevalence of disease of our region and the benefit or lack thereof of certain treatment modalities.

Further, big data analytics has been primarily used in the United States in urban large academic centers with little concentration or small cities or rural populations. The information and patterns that our group potentially uncovers in our pediatric population could be used to help other rural areas across the country reexamine their disease prevalence and treatment modalities to best improve the health of communities across the United States.

Benefit to External Communities

SIUSOM’s interest in big data analytics is purely to better understand the communities and people that we serve. The knowledge that we uncover will be strictly dedicated to better categorizing and identify disease and adapting our treatment modalities to best serve the health of patients. This change in disease diagnosis and management may even require providing more direct health care services beyond the clinical and hospital settings into the communities themselves. SIUSOM’s mission to better serve the community requires a new focus on population health strategies. Big data analytics provides the information necessarily for that focus to be achieved.

Applications for External Funding

The hope of this pilot project concentrating on SIU Health Care pediatric data is that the information that we obtain will give us the background information and preliminary computer algorithms to seek increased external funding. The patterns and information that we uncover directly for children would only be magnified when we incorporate the data and information from family doctors, general internal medicine doctors, medical subspecialists, and surgeons.

The collaborators believe that the information when collected and analyzed will be sought after by hospitals, public health departments, and physician groups across central and southern Illinois. This attention to our work may bring forward external funding from private hospital systems, private physician groups, and their foundations. Dr. Vohra has already had conversations with the Illinois Hospital Association about using the pilot data information for a more robust partnership with this organization.

Benefits to Professional Development

The impact and results of this study could have strong benefits to the professional development of both SIU and UIS Faculty. Dr. Vohra is a pediatrician with a background in law and public policy. The data that is uncovered in this study will help him move forward on adapting existing projects and creating new ones for SIU’s Children and Families Population Health and Policy Program. Dr. Khorsani is an assistant professor of computer science with expertise in Computational Intelligence and Big Data Analysis. Dr. Guo is an Assistant Professor of Computer Science with expertise in Medical Image Processing and Machine Learning. The study will benefit Dr. Khorasani and Dr. Guo towards their professional and scholarly development and motivate student involvement in research and scholarship. Ms. Fogleman will obtain knowledge in Big Data Analysis which furthers her education in bioinformatics in which she will obtain a master’s degree in May 2016. Ms. Fogleman’s position in the Population Health Science department will also benefit as she gains more experience with different types of population health studies.

Anticipated Intellectual Products

The SIU and UIS researchers plan to use the information that we obtain and its corresponding analysis to create intellectual products. The research collaborators see a number of opportunities to present our findings at national conferences focused on environmental health and computer analytics. Dr. Vohra recently presented a program on asthma at the Centers for Disease Control and Prevention’s Environmental Hazards and Health Effects Conference. This work would be a great addition to this conference. The information could also be presented at forums and conferences dedicated to the health of the central and southern Illinois region, rural health, and public health. In addition, the findings can be presented in a prestigious data analytic conference, such as IEEE International Conference on Big Data or IEEE International Conference on Biomedical and Health Informatics.

Furthermore, the research partners plan to publish our work in journals dedicated to environmental health, rural health, public health, and computer science. We plan to analyze our results with possible submissions to the Journal of Environmental Health, Journal of Public Health, and Journal of Big Data Research.


The collaboration between SIU and UIS has the ability to make a real and profound impact on our understanding of the health of children in central and southern Illinois. We plan to use our information to better understand disease, its outcomes, and the best treatment to help improve the health of SIU’s service region. Big data analytics will help SIUSOM, with its great partner UIS, to better serve the community of central and southern Illinois.


  1. W. Raghupathi and V. Raghupathi, "Big data analytics in healthcare: promise and potential," Health Information Science and Systems, 2014.
  2. J. Archenaa and E. M. Anita, "A Survey of Big Data Analytics in Healthcare and Government," Procedia Computer Science, pp. 408-413, 2015.
  3. M. C. B. B. James Manyika, J. Bughin, R. Dobbs, C. Roxburgh and A. H. Byers, "Big data: The next frontier for innovation, competition, and productivity," McKinsey Global Institute, 2011.
  4. M. Herland, "A review of data mining using big data in health informatics," Journal Of Big Data, pp. 1--35, 2014.
  5. I. Ouanes, C. Schwebel, A. F. C. Bruel and F. Philippart, "A model to predict short-term death or readmission after intensive care unit discharge," Prognosis and Outcomes, p. 422.e1–422.e9, 2012.
  6. A. Campbell, J. Cook, G. Adey and B. Cuthbertson, "Predicting death and readmission after intensive care discharge," British Journal of Anaesthesia , 2008.
  7. A. Fialho, F. Cismondi, S. Vieira, S. Reti, J. Sousa and S. Finkelstein, "Data mining using clinical physiology at discharge to predict ICU readmissions," Expert Systems with Applications, p. 13158–13165, 2012.
  8. Y. Zhang, S. Fong, J. Fiaidhi and S. Mohammed, "Real-time clinical decision support system with data stream mining," Journal of Biomedicine and Biotechnology, 2012.
  9. H. SI, G. DB, M. CL and B. JS, "Big Data Opportunities for Global Infectious Disease Surveillance," PLoS Med 10(4), 2013.
  10. T. Kohonen, "Self-organized formation of topologically correct feature maps," Biological Cybernetics, vol. 43, no. 1, pp. 59-69, 1982.
  11. Y. Guo and A. Sengur, "NCM: Neutrosophic c-means clustering algorithm," Pattern Recognition, vol. 48, no. 8, pp. 2710-2724, 2015.
  12. D. Arthur and S. Vassilvitskii, "k-means++: the advantages of careful seeding," in Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms, 2007.