Data-driven Text Simplification Guidelines

Get the table with references here.

Existing Guidelines

New Guidelines

Comments/Comparison with other work

Readability Formulas:





No correlation found with actual comprehension/difficulty; only a correlation with perceived difficulty.

Word length

Term Frequency (E/S)

A stand-in for term difficulty, for English the Google Web Corpus [1] works well [2-4], for Spanish LexEsp [5] was helpful [6].

Sentence length

Short sentences (E/S)

Good connectors between sentences are important for flow [7].

Grammar Frequency (E/S)

A stand-in for grammar difficulty of a sentence [8], but specific guidance is needed on how to simplify [9]. Spanish uses more varied structures [10].

Grammar Rules

Grammar rules were extracted using a parallel corpus with expert simplifications [11].

Plain language:


Difficult to judge what is plain. Plain is not necessarily simple.

Simple words

Term Frequency (E/S)

See above.

Use more verbs and function words (E/S)

Simple text contains more verbs and function words and fewer nouns (for other parts-of-speech, see papers)  [10]

Definition Creation (E/S)

Based on word morphology, new definitions are suggested in English and Spanish [12].

Definition Insertion

The use of parenthesis (do you put the difficult term or the easy explanation between parentheses) is not straightforward [13].

Term specificity and ambiguity

Highly technical terms are more difficult [14]. This is not implemented as a separate feature in our tool: based on sampling, frequency (a later discovered feature) take care of this problem.

Short sentences

Sentence length cutoff point

Based on Wikipedia comparisons, most simplified text used shorter sentences [15].

Other Advice:



Short noun phrases

Only 4 or more nouns should be split up

Phrasing may be more important than length. Splitting noun phrases was only effective for very long phrases + when certain functions words were used + if it sounded more natural to a native speaker [16].

No double negative

Simplified negation

Three types of negation are identified by our parser and online editor [17].

Logical content.

Limit mixing of different topics

Topics are represented by lexical chains. Several features based on lexical chains matter; however, crossing lexical chains is the most important [18].

Missing grammatical elements

Make connections in text explicit (E/S) (audio) (paper ready for submission).




Training required.

 No concrete suggestions

Online editor

Our tool combines our feature and identification and translation algorithms [19, 20].

Suggestions of next words

Automation – Sentence Completion

Algorithms using deep learning to suggestion simple words during the writing process [21]

Personal Health Information: Deidentification Clean Up

This program will clean up the output from three de-identification algorithms commonly used for EHR.

The details on patterns recognized and replaced are available from a poster submitted to AMIA Fall Symposium 2020.

Text Simplification Editor

A first version of our text simplification tool is now available. This version includes lexical simplification, negation detection (you'd be surprised how often you use double negation, e.g., not illogical), grammar feedback, and a topic visualization section, ah and a few remaining issues that need fixing. All simplification suggestions are backed by user studies.

How does it work: 

  • Copy and paste your text and click 'simplify'. You will get suggestions that you can use.
  • The Lexical Chains tab shows how topics are distributed throughout the text. Try and get the same topics in the same paragraph.

This is a semi-automated simplification editor because with medical text, we have to ensure information remains correct. Nobody but a human can do this at the moment!

NegAIT - Free Negation Parser

This negation parser is a java application annotates English Text for Sentential, Morphological, and Double Negation. The input file is a text file, and the output file is in .xml format.

When using this parser, please refer to our paper:

  • P. Mukherjee, G. Leroy, D. Kauchak, S. Rajnarayanan, D. Diaz, N. Yuan, T. Pritchard, and S. Colina, "NegAIT: A New Parser for Medical Text Simplification Using Morphological, Sentential and Double Negation", Journal of Biomedical Informatics, Vol 69, May 2017, Pages 55–62,  2017. [Link to Paper]

After fixing a few issues, performance on the gold standard described in the paper is now:


This parser was developed as part of research supported by the National Library of Medicine of the National Institutes of Health under Award Number R01LM011975.  The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Definition Creation: Subsimplify

This tool creates new definitions where WordNet and the UMLS do not provide a definition. We found that it can add useful definition in about 30% of difficult words found as part of our text simplification work. These definitions are not 'ready to go' but help content writers simplify the text.

On GitHub:


Gold standard used to evaluate negation parser - NegAIT: excel file.


Tutorial and Presentations

HICSS Tutorial - Evaluation of Artifacts in Design Science, Gondy Leroy, Series on Scientific Inquiry and Research Methods, January 2019.

AMCIS 2017 - Doctoral Consortium Panel Presentation: How to give a good job talk - Your job talk.

RCIS 2019 - IEEE 13th Int. Conf. on Research Challenges in Information Science, May 29-20, Brussels, Belgium. Tutorial: Evaluating Artifacts using Experiments as part of the Design Science Framework. [Slides]




A first prototype version of our text simplification tool is now available. This first version includes only lexical algorithms and has a few known bugs. We expect the next version to be up by mid November 2018. (This is a beta version .... improvements are coming!)