Noise Removal | Remove Text File Headers and Footers

Noise refers to those unnecessary information. Noise removal is a filtering process so that all useful texts are remained.

When the input file is pdf or docx, there are headers and footers. In most of the time, they have no use in helping natural language processing, and we have to remove them. Common headers and footers include title, page number, date and author.

#text #preprocessing