Sunday, February 26, 2017

File I/O, Transforms, and Filtering

File input and output in the R language has many functions that trace to read.table(). Data files commonly use comma separated value (csv) format but other separator characters are also found in use, such as tab separated and pipe separated, and the read.table()function provides the option to specify custom separator characters for read or writing file formats. The read.table() function can also directly read in a file over a network over the HTTP protocol when provided a url.

If the programmer does not want to supply the R script with a filename and path the file.choose() function can be nested within the read.table() or read.csv() function. The file.choose() function will provide an operating system file selection dialog to browse to a file which then gets passed to read.table() or read.csv() for processing.

The partner function to read.table() is write.table(), with the notable exception that write.table cannot write a file over HTTP or PUT a file.

To explore file I/O and to experiment with the Plyr library we will read in a provided file, determine the average test score broken down by sex for the data set and then write the resulting table to a new file.

Then we will filter the data set based on names containing the letter 'i' (in either case) and write the subset of data based on that filter to a new file.

hw8 <- read.csv(file.choose())

# 1) import dataset.txt
# 2) Run a mean using Sex as the category (use plyr package for this operation), 
# then write the resulting output to a file. 
# 3) test the DataSet.txt as a dataframe for names whos name contains the 
# letter i, then create a new data set with those names, Write those names
# to a file separated by comma’s (CSV)

library(plyr)

mean_by_sex = ddply(hw8, "Sex", transform, Grade.Average=mean(Grade))
write.table(mean_by_sex, file="Mean_by_sex.txt")

the_i_s = subset(hw8, grepl("[iI]", Name))
write.table(the_i_s, file="contains_i.csv", sep=",")

The results of using the ddply() function transform combined with the mean function applied to the column of Grade values selected by the column of Sex is the below table:


Using the R language implementation of the grep function with the regular expression to match upon either the upper or lower case letter 'i" is the below table:


These two tables were output to file in different formats, however, with the average score table output directly as a raw table and the "i" filtered table output in csv format. The contents of these files were viewed to confirm the format as written to disk:





No comments:

Post a Comment