Optimized negation detection in medical texts
Medical text mining enables the automated analysis of unstructured health data to extract relevant information. A key aspect of this process is the detection of negated findings. By combining rule-based approaches with machine learning methods, negations can be identified with high precision.

Negations: a challenge for text mining

Medical text mining refers to the automated analysis and processing of large volumes of unstructured text data in healthcare in order to extract patterns, relationships, and useful information. Negation detection plays a critical role in this context. During patient care, numerous examinations are performed, and documentation includes both the presence and the absence of symptoms and abnormal findings.

 

An example of negated findings is shown below:

“Patient shows no signs of meningism, normal pupillary light response, no indication of scleral icterus, oral mucosa and pharynx without irritation or abnormalities, thyroid not enlarged, nausea did not occur.”

Due to the complexity of natural language, additional challenges arise, such as double negations or pseudo-negations. In the following examples, findings are not negated, even though clear indicator words (“unremarkable”, “not”, “excluded”) appear in the same sentence:

“Apart from a mild skin rash, the physical examination was unremarkable. A tumor cannot be reliably excluded. Spleen and liver not palpable due to ascites.”

Negation detection with Apache UIMA Ruta

Rule-based approaches identify these complex text patterns using a large set of rules. For this purpose, we developed and released as open source a powerful NLP rule language: Apache UIMA Ruta. It enables text mining tasks to be implemented efficiently and in a short amount of time. Using this rule language, many negation patterns can be detected reliably.

Machine learning approach for improved results

Based on experience, we know that reliably detecting complex negation patterns leads to a high level of rule complexity, which can impact maintainability. For this reason, we complemented the rule-based approach with a machine learning method. Our team compiled and annotated a large internal training dataset consisting of English and German data. In addition, we incorporated data from the publicly available i2b2 2010 dataset.

For each diagnosis, the machine learning model receives contextual information, including the words preceding and following the diagnosis, whether the diagnosis appears in a list, and whether a negation indicator—such as “no” or “not”—is present. The remaining decision process is fully automated. Individual model errors were further addressed through targeted post-processing rules.

High-accuracy detection of diagnosis negation status

By combining rule-based and machine learning approaches, we were able to increase negation detection performance to an F1 score of 96 percent for English and 95 percent for German. A detailed error analysis further showed that the machine learning approach was also able to identify incorrect annotations in the gold standard.

Get more value from your health data