Text - Noise Removal

Noise refers to those unnecessary information. Noise removal is a filtering process so that all useful texts are remained.

Remove Text File Headers and Footers

When the input file is pdf or docx, there are headers and footers. Since they are no use for helping natural language processing, we have to remove them.

Remove HTML / XML Markup and Metadata
Markup allows us to specify in a machine-readable manner the structure of documents and how they should be presented, while metadata allows us to add different types of information about the digital document (i.e. descriptive, administrative, etc.). Thus, markup adds structure and metadata adds content.

Extract Data from Other Formats (e.g. Json)
Different files have their own data format and structures. Json is one of the common file extensions we would explore. To extract the data from complicated data format is the key of noise removal.