Data-driven English & Spanish Text Simplification for Healthcare Using Corpus Statistics, Predictive Analytics, Machine Learning and User Studies.

Introduction: Limited health literacy is a barrier to accessing and understanding the health information that can empower patients, consumers and caregivers to monitor and manage their health, translate information into actionable and healthy behaviors, support decision-making, and guide healthcare consumption. Simplifying text can reduce this barrier and, potentially, reduce known disparities in health, but few tools exist to support writing simplified text with demonstrated impact on comprehension.

Approach: We use machine learning and predictive analytics to discover features indicative of difficult English and Spanish texts. We develop the necessary algorithms for semi-automated translation into simpler text. Our algorithms focus on lexical components, grammar, cohesion and presentation. We use a data-driven techniques: evidence-based approaches to discover the features and user testing to evaluate the impact of simplification. 

Funding Sources:

PI: G. Leroy, National Institutes of Health / National Library of Medicine (NIH/NLM),Evidence-based Strategy and Tool To Simplify Text for Patients and Consumers, NLM/NIH, $1.4M, 2015-2019.(R01LM011975)

PI: G. Leroy, National Institutes of Health / National Library of Medicine (NIH/NLM), Large-scale Evaluation of Text Features Affecting Perceived and Actual Text Difficulty, $132,300, 2011-2013.

PI: G. Leroy, with J. Cowie and S. Helmreich (New Mexico State University), National Science Foundation (NSF), U3-Understanding User Understanding, $99,988, 2007-2008.

Autism Surveillance through Text mining and Development of NLP for Mental Health.

Introduction: The University of Arizona is a partner in the ongoing surveillance of autism spectrum disorders and intellectual disability and monitors children aged 8 in Maricopa County. The surveillance is based on review of clinical and health records by qualified clinicians.

Approach: We are developing NLP algorithms to identify and extract signs and symptoms of autism and match them to the correct DSM diagnostic criteria. We leverage manually created lexicons but also word-embedding combined with a new parser. Our first parser is entirely rule based (with high precision, lower recall) and we are looking into automating the creation of its rules. We focus on the creation of human interpretable models to annotate EHR with DSM diagnostic criteria and machine learning for ASD case assignment.

In addition, we are also using a variety of algorithms for ASD case status assignment, with many classic algorithms providing good results and deep learning algorithms being tested for potential improvements of the results.

Funding Sources:

PI: G. Leroy, "EHR Gold Standard Dataset Creation for Autism Spectrum Disorders Surveillance Project", Eller Center for Management Innovations in Healthcare, $5k, 2017.

PIs: M. Kurzius-Spencer, S. Pettygrove, Arizona Developmental Disabilities Surveillance Program (ADDSP), CDC, $2M, 2015-2019.Arizona Developmental Disabilities Surveillance Program (ADDSP), CDC, $2M, 2015-2019.


Creating Language Resources for Occupational Medicine using Text Mining

We are evaluating how using natural language processing and text mining with social media text can contribute to the creations of data and language resources for occupational medication. We use Twitter, public and private Electronic Health Records (EHR) and existing data sources (e.g., UMLS) to evaluate and improve the current status.


Social Media

We are using data mining algorithms to predict gender and age on Twitter and use it to evaluate different tobacco industry targeting techniques. 

Funding Sources:

Eller College Small Research Grants, $2,000, 2016.