Fundamental Techniques - Import (R): htm2txt

Package:
htm2txt

Functionality:
Convert Html into Text.

Description:
Convert a html document to simple plain texts by removing all html tags. This package utilizes regular expressions to strip off html tags. It also offers gettxt() and browse() function, which enables you to get or browse texts at a certain web page.

Demonstration:
The input data is a html file and examples of html code.
At the end of this demonstration, you will what options should be specified in order to import data from website in R.

Function to test (default settings):
browse(URL, ...)
gettxt(URL, encoding = "UTF-8", ...)
htm2txt(htm, list = "\n• ", pagebreak = "\n\n----------\n\n")

Input file:
https://cydalytics.blogspot.com/
and some html codes
##################
library(htm2txt) #
##################

# scrape the text from the website
text_gettxt = gettxt("https://cydalytics.blogspot.com/")
str(text_gettxt)
##  chr "Skip to main content\n\nSubscribe\n\nSubscribe to this blog\n\nFollow by Email\n\ncyda\n\nMenu\n\n• Home\n• Cyd"| __truncated__

# display the text from the website
text_browse = browse("https://cydalytics.blogspot.com/")
## Skip to main content
## 
## Subscribe
## 
## Subscribe to this blog
## 
## Follow by Email
## 
## cyda
## 
## Menu
## 
## • Home
## • Cydademia
## • Hackathons
## • Projects
## • About Us
## 
## More…
## 
## Posts
## 
## Featured Post
## 
## October 01, 2018
## 
## Fundamental Techniques - Import (R): jsonlite
## 
## Package:
## jsonlite
## Functionality:
## Convert R objects to/from JSON
## Description:
## These functions are used to convert between JSON data and R objects. The toJSON and fromJSON functions use a class based mapping, which follows conventions outlined in this paper: https://arxiv.org/abs/1403.2805 (also available as vignette).
## Demonstration:
## The input data is a json file.
## At the end of this demonstration, you will what options should be specified in order to import json data in R.
## Function to test (default settings):
## fromJSON(txt, simplifyVector = TRUE, simplifyDataFrame = simplifyVector, simplifyMatrix = simplifyVector, flatten = FALSE, ...)
## Input file:
## https://api.github.com/users/hadley/repos
## ###################library(jsonlite)##################### read jsonjson_data=fromJSON("https://api.github.com/users/hadley/repos",flatten= T)
## head(json_data[,1:5]) ## id node_id name full_name private ## 1 40423928 MDEwOlJlcG9…
## 
## Post a Comment
## 
## Read more
## 
## Latest Posts
## 
## September 23, 2018
## 
## Fundamental Techniques - Import (R): textreadr & readtext
## 
## Post a Comment
## 
## September 17, 2018
## 
## Fundamental Techniques - Import (R): readxl
## 
## Post a Comment
## 
## September 16, 2018
## 
## Data Visualization Tips (Power BI): Convert categorical variables to dummy variables
## 
## Post a Comment
## 
## September 11, 2018
## 
## Fundamental Techniques - Import & Export (R): xlsx
## 
## Post a Comment
## 
## September 08, 2018
## 
## What is Deep Learning?
## 
## Post a Comment
## 
## September 01, 2018
## 
## What is Machine Learning?
## 
## Post a Comment
## 
## Older Posts
## 
## Powered by Blogger
## 
## Created by cyda - Yeung Wong & Carrie Lo
## 
## cyda
## 
## An analytics site disclosing you the scene behind the data
## 
## Menu
## 
## • Home
## • Cydademia
## • Hackathons
## • Projects
## • About Us
## 
## LinkedIn
## 
## • Carrie Lo
## • Yeung Wong
## 
## Github - cydalytics
## 
## • Stock Price Scraping
## • Image_Tag_Processing
## • Weibo Posts Topic Classification
str(text_browse)
##  NULL

browse can only be used for displaying the plain text of a url. You cannot store the data in the displayed structure into R. Still, it is good for checking whether the scraped data is correct or not
# remove html tag
text1 = htm2txt("<html><body>html texts</body></html>")
text1
## [1] "html texts"

text2 = htm2txt(c("Hello<p>World", "Goodbye<br>Friends"))
text2
## [1] "Hello\n\nWorld"   "Goodbye\nFriends"

text3 = htm2txt("<p>Menu:</p><ul></li>Coffee</li><li>Tea</li></ul>", list = "\n- ")
text3
## [1] "Menu:\n\nCoffee\n- Tea"
text4 = htm2txt("Page 1<hr>Page 2", pagebreak = "\n\n[NEW PAGE]\n\n")
text4
## [1] "Page 1\n\n[NEW PAGE]\n\nPage 2"

Summary:
From the above examples, all the html markups and tags are removed and the outputs are stored in a string form. The original struture of the data is also kept. For exmaple, the order list structure of text3 can still be found but in the form of different rows with the expression, "\n". This imported data is very beautiful and good for using after some text preprocessing.

Comments

  1. Replies
    1. IntelliMindz is the best IT Training in Bangalore with placement, offering 200 and more software courses with 100% Placement Assistance.

      R Programming Online Course
      R Programming Course in Bangalore
      R Programming Training in Chennai

      Delete
  2. Looking for best Tamil typing online tool make use of our site to enjoy Tamil typing and directly share on your social media handle. Tamil font Free Download

    ReplyDelete
  3. You have made your points in a smart way. I am impressed with how interesting you have been able to present this content. Thanks for sharing nice information.

    Best Institute for C++ Training Course in Delhi, India
    C++ Training Institute in Delhi

    ReplyDelete
  4. Upgrade Your Skills with Python Training Course in Delhi with Placement Support Also. SOL Technologies Solutions is one of the Best Certified Python Training Center in Delhi, Noida & Gurgaon.

    Upgrade your Skill with Learn Python Training Course in Delhi

    ReplyDelete
  5. Thank you so much for sharing this worth able content with us. The concept taken here will be useful for my future programs and i will surely implement them in my study. Keep blogging article like this.

    Best Institute for Software Testing Training Institute in Delhi, India
    ISO Certified Oracle Testing Training Institute in Delhi, India

    ReplyDelete
  6. Nice article, its very informative content..thanks for sharing...Waiting for the next update.

    what is swift language?
    Advantages of swift programming language

    ReplyDelete
  7. This awesome post and very very informative. Eagerly waiting for next update .


    Parkav InfoTech offer is IOS app Development Company in TamilNadu, we develop iOS application to make your business propel forward.Parkav developers have experience in creating iPhone and iPads with great performance and security for best user experience.



    ReplyDelete
  8. Thanks for writing blog, your blogs are very nice and knowledgable. If anyone want to know more about pyhton or want to learn can contact me at 9311002620 or can visit our website
    Sas Training Institute In Delhi
    Advance Excel Training Institute In Delhi
    Python Training Institute In Delhi

    ReplyDelete
  9. The information you have updated is very good and useful, please update further.
    Nidhi Company Registration in India

    ReplyDelete
  10. You completely match our expectation and the variety of our information.
    data scientist course

    ReplyDelete
  11. Thank you so much for sharing these amazing tips. I must say you are an unbelievable writer, I like the way that you describe things. Please keep sharing.
    Generation of Programming Languages
    Basics of Programming Language For Beginners
    How To Learn app programming and Launch Your App in 3 Months
    Learn Basics of Python For Machine Learning

    ReplyDelete
  12. Have to work? need of money but have no experience certificate. Get in touch with us we provide experience certificate in Mumbai 100% genuine certificate in Mumbai. It will help it your courier. So don’t be late. Get your experience letter now. For experience letter in Mumbai contact at 9599119376 or can visit our website at https://experiencecertificates.com/experience-certificate-provider-in-mumbai.html

    ReplyDelete
  13. You have made your points in a smart way. I am impressed with how interesting you have been able to present this content. Thanks for sharing nice information. Otherwise if any One want to Make Genuine Experience Certificate Contact Us-9599119376.

    Top Genuine Experience Certificate Provider in Delhi, NCR
    Experience Certificate Providers in Bangalore- Education, the Problem Solver
    Leading Consultancy Who Provide Experience Certificate Providers in Pune

    ReplyDelete
  14. The article on R programming is so accurate and just what is required to help new learner or people who are interested in the field, if you want more information you can also check out
    data science course

    ReplyDelete
  15. Nice tutorial. Thanks for sharing the valuable information. It’s really helpful. Who want to learn this blog most helpful. Otherwise if any One Want to Make Genuine Experience Certificate with Compete Verification Support So Contact Here-9599119376 or Visit Website

    Genuine Experience Certificate with Complete Verification Support

    ReplyDelete
  16. You've written a fantastic article. This article provided me with some useful knowledge. Thank you for providing this information.

    Top Consultancy Experience Certificate Providers in Bangalore, India
    Best Genuine Experience Certificate Providers in Delhi, India

    ReplyDelete
  17. Thank you for sharing this great post its very helpful but if anyone looking for make career in SAS so join with us For further more details contact here +91-9311002620 or visit website https://www.htsindia.com/Courses/business-analytics/sas-training-institute-in-delhi

    ReplyDelete
  18. Nice blog, very informative content.Thanks for sharing, waiting for the next update…
    Web-Based Applications of Java
    What is Java Programming?

    ReplyDelete
  19. Thank you for this valuable Content , Please keep sharing this type of blog.
    apart from this if someone is looking for the best Data Science Training Institute in Delhi
    High Technologies Solutions is one of the best training Institute in Delhi.
    call us for more details +919311002620

    ReplyDelete

Post a Comment