Fundamental Techniques - Import & Export (R): xlsx


Package: 
xlsx

Functionality:
Read, write, format Excel 2007 and Excel 97/2000/XP/2003 files

Description: 
The xlsx package gives programatic control of Excel files using R. A high level API allows the user to read a sheet of an xlsx document into a data.frame and write a data.frame to a file. Lower level functionality permits the direct manipulation of sheets, rows and cells. For example, the user has control to set colors, fonts, data formats, add borders, hide/unhide sheets, add/remove rows, add/remove sheets, etc.

Demonstration:
The input data involves English text, number, Traditional Chinese text and Simplified Chinese text.
At the end of this demonstration, you will know what options should be specified in order to import and export data with different format of context in R.

Function to test (default settings): 
read.xlsx (file, sheetIndex, sheetName=NULL, rowIndex=NULL, startRow=NULL, endRow=NULL, colIndex=NULL,  as.data.frame=TRUE, header=TRUE, colClasses=NA, keepFormulas=FALSE, encoding="unknown", password=NULL, ...)

read.xlsx2 (file, sheetIndex, sheetName=NULL, startRow=1, colIndex=NULL, endRow=NULL, as.data.frame=TRUE, header=TRUE, colClasses="character", password=NULL, ...)

write.xlsx (x, file, sheetName="Sheet1", col.names=TRUE, row.names=TRUE, append=FALSE, showNA=TRUE, password=NULL)

write.xlsx2 (x, file, sheetName="Sheet1",col.names=TRUE, row.names=TRUE, append=FALSE, password=NULL, ...)

Input file: 
Reference_Sample.xlsx

Code:
#############
library(xlsx)
#############
# read excel xlsx - method 1 

xlsx_df = read.xlsx(file="Reference_Sample.xlsx", sheetName="Sample_Sheet", header=T, stringsAsFactors=F, encoding="UTF-8")


str(xlsx_df)


header=T retrieves the title of xlsx file in R. 
stringsAsFactors=F is used instead of T as we do not want R to treat the character inputs as factors.
You can find that as the encoding is set to "UTF-8", so the Chinese characters are shown in UTF code. If encoding is not set, garbled code is shown.


If you want to display Chinese characters or specific characters in other languages, the locale should be set beforehand. In this example, local = "Chinese" should be specified. 
*special note: local = "cht" can only display traditional Chinese text while local = "Chinese" can display both Traditional and Simplified Chinese words.

Sys.setlocale(category = "LC_ALL", locale = "Chinese")
# read excel xlsx - method 1 

xlsx_df = read.xlsx(file="Reference_Sample.xlsx", sheetName="Sample_Sheet", header=T, stringsAsFactors=F, encoding="UTF-8")


# read excel xlsx - method 2
xlsx_df2 = read.xlsx2(file="Reference_Sample.xlsx", sheetName="Sample_Sheet", header=T, stringsAsFactors=F)


You can find that both methods, i.e. read.xlsx and read.xlsx2 show data in same format. As the encoding is default to be "UTF-8", so if the locale is set properly, i.e. locale = "Chinese", same output should be shown. Yet, when you check the structure of xlsx_df2, you can find something different.

str(xlsx_df2)


The structure of all variables become character. It may cause inconvenience if you want to do calculation later.
When using read.xlsx and read.xlsx2, if column names contain space, all the spaces are changed to "."
Thus, when you do selection on columns by using column names, you need to use the new names, i.e. "Traditional.Chinese" and "Simplified.Chinese" in this case. 

The following section is the demonstration for writing excel output.

# write excel xlsx - method 1 
write.xlsx(xlsx_df, file="Output1.xlsx", sheetName="Sample_Sheet", row.names=F, showNA=F)


row.names = F is indicated to remove the row index used in R.
showNA = F is used so blanks will remain as blanks instead of being replaced by "NA"  in the output file.
However, the column names remain as those shown in R if the column names are not defined again before exporting, so you will still find those "." in the column names.

# write excel xlsx - method 2
write.xlsx2(xlsx_df2, file="Output2.xlsx", sheetName="Sample_Sheet", row.names=F, showNA=F)


The output of using write.xlsx2 shows similar output, but the structure of all columns of xlsx_df2 are character, so the output will retain the same structure, and you can see that the numbers on column B become characters.

Summary:
read.xlsx is recommended as the original data structure is not changed, but you need to define the encoding method. (encoding = "UTF-8" is commonly used.)
For writing excel file, the output files of both write.xlsx and write.xlsx2 are quite similar, so you can use either one.
One limitation of using xlsx is that it only supports the file with .xlsx extension, if you want to read xls file, readxl can be used (will be talked about in the next article).

Comments