Analysing the use of frame semantics in extracting NLP-based information from EHR for cancer research

November 30, 2020

Analysing the use of frame semantics in extracting NLP-based information from EHR for cancer research

Summary and review of the paper, “A frame semantic overview of NLP-based information extraction for cancer-related EHR notes” published by Surabhi Datta, Elmer V. Bernstam and Kirk Roberts in Journal of Biomedical Informatics

Nowadays, there are lots of unstructured, free-text clinical data available in Electronic Health Records (EHR) and other systems which are very useful for medical research. However, the lack of a systematic structure duplicates the effort and time of every researcher to extract data and perform analysis. Stemming from this, Surabhi Datta and his team (the team) reviewed papers on NLP for EHR notes related to cancer and attempted to standardise the usual information extracted from EHR, then, they proposed a conceptual framework for general-purpose cancer-related NLP tasks.

Overview

To understand the context of this paper, we can get a rough idea by looking at the title. There are 4 key elements as shown in the title which are 1) frame semantic, 2) NLP-based, 3) cancer-related and 4) EHR notes.

1) Frame semantics: it is a linguistic theory developed by Charles J. Fillmore. This theory postulates the meaning of most words should be interpreted based on where the words locate and their relations with other words. This interpretation process will consolidate the words into a theme which is the frame. For example, “There is a single lesion in segment 7 measuring 1.5 x 1.9 cm” evokes a frame, TUMOR DESCRIPTION, with 3 frame elements present which are COUNT (“single”), ANATOMICAL SITE(“segment 7”) and SIZE (“1.5 x 1.9 cm”)

2) NLP-based: only articles with keywords, ‘natural language processing’ / ‘NLP’ in the article were selected to study.

3) Cancer-related: only articles with ‘cancer’ / ‘tumor’ / ‘oncology’ in the title were selected to study.

4) EHR notes: Electronic Health Record notes refers to an electronic version of health-related data of an individual (not confined to medical treatment for illness), stored and retrieved by different healthcare providers including doctors and other healthcare professionals for healthcare-related purposes.

Methodology

Besides the aforementioned criteria, the study only considered papers published from Jan 2020 to Sep 2018 to capture more recent NLP analysis of the cancer information. Some additional criteria were applied to exclude non-frameable classification such as document-level text classification as this kind of classification is not considered as information extraction NLP method and cannot be amended to frame representation. Papers based on unspecific concept / term extraction were also eliminated. The “unspecific” term here refers to phases that do not attempt to connect to wider context, for instance, extracting the phrase “breast cancer” regardless of the differentiation between an asserted (i.e. “has breast cancer”) and a negated concept (i.e. “has no breast cancer”). The “unspecific” term also refers to unspecific extraction methods which means concept recognition or named entity recognition using MetaMap or cTAKES.

After that, the papers were undergone rigorous reviewing process and the proposed frames were approved by both cancer informatics expert (EVB) & practicing oncologist with informatics experience (FMB) to ensure medical validity.

Proposed Framework

The team consolidated common information extracted from the selected papers and organised them into frames with frame elements (as shown in the above picture). Frames with similar purpose are assigned with the same colours and are organised in a manner that similar frames appear together and with similar colours. The chart starts from the frame, “CANCER DIAGNOSIS”, which is the most common frame extracted (with 36 associated articles).

With reference to Berkeley FrameNet, the team used 3 relation types, 1) inherits, 2) an element of and 3) associated with, to represent the relations between frames.

Limitations

The framework was constructed based on the selected articles, the chart covers some common cancers which are also the research interests, but there are still some common cancers missed such as lung cancer, lymphoma, kidney cancer, and leukemia. Besides, as mentioned by the authors, more investigation was needed to link the frames to existing ontologies like NCI Thesaurus and SNOMED-CT.

Key Takeaways

The study pointed out the need of a similar FrameNet database as the Berkeley FrameNet database is essential for the NLP study of clinical notes. Although no database was constructed to accomplish the framework established in the study, the proposed framework provided an authoritative set of semantic frames for cancer NLP tasks which is beneficial for future research or the development of a medical FrameNet database. Even without the database, the frames and the association of the frames can be used as reference for study of cancer research. The frame elements and the frames can be considered as the starting points to brainstorm for important data points from EHR for the research or analysis related to cancer.

If you find my article useful, please endorse my skills on my linkedIn page to encourage me to write more articles.

Originally published on cydalytics.blogspot.com

cyda