Fundamental Techniques - Import (R): htm2txt

Package:
htm2txt

Functionality:
Convert Html into Text.

Description:
Convert a html document to simple plain texts by removing all html tags. This package utilizes regular expressions to strip off html tags. It also offers gettxt() and browse() function, which enables you to get or browse texts at a certain web page.

Demonstration:
The input data is a html file and examples of html code.
At the end of this demonstration, you will what options should be specified in order to import data from website in R.

Function to test (default settings):
browse(URL, ...)
gettxt(URL, encoding = "UTF-8", ...)
htm2txt(htm, list = "\n• ", pagebreak = "\n\n----------\n\n")

Input file:
https://cydalytics.blogspot.com/
and some html codes
##################
library(htm2txt) #
##################

# scrape the text from the website
text_gettxt = gettxt("https://cydalytics.blogspot.com/")
str(text_gettxt)
##  chr "Skip to main content\n\nSubscribe\n\nSubscribe to this blog\n\nFollow by Email\n\ncyda\n\nMenu\n\n• Home\n• Cyd"| __truncated__

# display the text from the website
text_browse = browse("https://cydalytics.blogspot.com/")
## Skip to main content
## 
## Subscribe
## 
## Subscribe to this blog
## 
## Follow by Email
## 
## cyda
## 
## Menu
## 
## • Home
## • Cydademia
## • Hackathons
## • Projects
## • About Us
## 
## More…
## 
## Posts
## 
## Featured Post
## 
## October 01, 2018
## 
## Fundamental Techniques - Import (R): jsonlite
## 
## Package:
## jsonlite
## Functionality:
## Convert R objects to/from JSON
## Description:
## These functions are used to convert between JSON data and R objects. The toJSON and fromJSON functions use a class based mapping, which follows conventions outlined in this paper: https://arxiv.org/abs/1403.2805 (also available as vignette).
## Demonstration:
## The input data is a json file.
## At the end of this demonstration, you will what options should be specified in order to import json data in R.
## Function to test (default settings):
## fromJSON(txt, simplifyVector = TRUE, simplifyDataFrame = simplifyVector, simplifyMatrix = simplifyVector, flatten = FALSE, ...)
## Input file:
## https://api.github.com/users/hadley/repos
## ###################library(jsonlite)##################### read jsonjson_data=fromJSON("https://api.github.com/users/hadley/repos",flatten= T)
## head(json_data[,1:5]) ## id node_id name full_name private ## 1 40423928 MDEwOlJlcG9…
## 
## Post a Comment
## 
## Read more
## 
## Latest Posts
## 
## September 23, 2018
## 
## Fundamental Techniques - Import (R): textreadr & readtext
## 
## Post a Comment
## 
## September 17, 2018
## 
## Fundamental Techniques - Import (R): readxl
## 
## Post a Comment
## 
## September 16, 2018
## 
## Data Visualization Tips (Power BI): Convert categorical variables to dummy variables
## 
## Post a Comment
## 
## September 11, 2018
## 
## Fundamental Techniques - Import & Export (R): xlsx
## 
## Post a Comment
## 
## September 08, 2018
## 
## What is Deep Learning?
## 
## Post a Comment
## 
## September 01, 2018
## 
## What is Machine Learning?
## 
## Post a Comment
## 
## Older Posts
## 
## Powered by Blogger
## 
## Created by cyda - Yeung Wong & Carrie Lo
## 
## cyda
## 
## An analytics site disclosing you the scene behind the data
## 
## Menu
## 
## • Home
## • Cydademia
## • Hackathons
## • Projects
## • About Us
## 
## LinkedIn
## 
## • Carrie Lo
## • Yeung Wong
## 
## Github - cydalytics
## 
## • Stock Price Scraping
## • Image_Tag_Processing
## • Weibo Posts Topic Classification
str(text_browse)
##  NULL

browse can only be used for displaying the plain text of a url. You cannot store the data in the displayed structure into R. Still, it is good for checking whether the scraped data is correct or not
# remove html tag
text1 = htm2txt("<html><body>html texts</body></html>")
text1
## [1] "html texts"

text2 = htm2txt(c("Hello<p>World", "Goodbye<br>Friends"))
text2
## [1] "Hello\n\nWorld"   "Goodbye\nFriends"

text3 = htm2txt("<p>Menu:</p><ul></li>Coffee</li><li>Tea</li></ul>", list = "\n- ")
text3
## [1] "Menu:\n\nCoffee\n- Tea"
text4 = htm2txt("Page 1<hr>Page 2", pagebreak = "\n\n[NEW PAGE]\n\n")
text4
## [1] "Page 1\n\n[NEW PAGE]\n\nPage 2"

Summary:
From the above examples, all the html markups and tags are removed and the outputs are stored in a string form. The original struture of the data is also kept. For exmaple, the order list structure of text3 can still be found but in the form of different rows with the expression, "\n". This imported data is very beautiful and good for using after some text preprocessing.

Comments

  1. The development of artificial intelligence (AI) has propelled more programming architects, information scientists, and different experts to investigate the plausibility of a vocation in machine learning. Notwithstanding, a few newcomers will in general spotlight a lot on hypothesis and insufficient on commonsense application. IEEE final year projects on machine learning In case you will succeed, you have to begin building machine learning projects in the near future.

    Projects assist you with improving your applied ML skills rapidly while allowing you to investigate an intriguing point. Furthermore, you can include projects into your portfolio, making it simpler to get a vocation, discover cool profession openings, and Final Year Project Centers in Chennai even arrange a more significant compensation.


    Data analytics is the study of dissecting crude data so as to make decisions about that data. Data analytics advances and procedures are generally utilized in business ventures to empower associations to settle on progressively Python Training in Chennai educated business choices. In the present worldwide commercial center, it isn't sufficient to assemble data and do the math; you should realize how to apply that data to genuine situations such that will affect conduct. In the program you will initially gain proficiency with the specialized skills, including R and Python dialects most usually utilized in data analytics programming and usage; Python Training in Chennai at that point center around the commonsense application, in view of genuine business issues in a scope of industry segments, for example, wellbeing, promoting and account.

    ReplyDelete
  2. Looking for best Tamil typing online tool make use of our site to enjoy Tamil typing and directly share on your social media handle. Tamil font Free Download

    ReplyDelete

Post a Comment