Data Science and Visualization Explorations: R Language: Importing Data and Performing Basic Descriptive Statistical Measures

R provides functions for reading in data locally from a file or remotely over a network. The specifics of the read.csv() function documentation is accessible from R Studio by typing in help(read.csv) in the Console window or from the Help tab. From here we learn that read.csv is a version of the read.table function and a sister function to read.csv2, read.delim, and read.delim2, all of which read in a file and create a data frame.

read.csv(file, header = TRUE, sep = ",", quote = "\"",

dec = ".", fill = TRUE, comment.char = "", ...)

The file object is an abstraction that allows for passing more than just a local csv or text file for reading. When we look up ‘file’ in Help the documentation advises “Functions to create, open and close connections, i.e. ‘generalized files’, such as possibly compressed files, URLs, pipes, etc.” This means we can pass read.csv a local file, a URL, pipe, socket, or a compressed file of type GZ, BZ, XZ, or ZIP. The ability to directly process compressed files is attractive as it saves disk storage space, especially since text based csv files often contain a lot of redundancies that compress efficiently.

Using the American Community Survey data sample for Oregon referenced in the assignment we can directly import that CSV formatted data over HTTP into R and the read.csv function will attempt to understand the formatting and header rows and populate a data frame. To assign the data frame to a variable we can use the ‘<-‘ symbol combination or ‘=’. Which to use is a matter of style preference but the R user community seems to have a preference for assignments to be made with ‘<-‘ so as not to confuse equality checks performed with ‘=’.

Reading in the file R Studio provides a quick method to inspect the data as it was interpreted from the Environment tab. From this it can be seen that 7811 rows (observations) of data consisting of 14 variables per row have been assigned to the data frame named ‘acs’ with a preview of each column variable and the data type R determined that variable type (int, factor, etc):

Clicking directly on the variable from the Environment tab will display the entire data frame in table format in a new tab.

Another function in R to quickly inspect and review a data frame is summary(), responding with basic statistical measures on each column/variable:

Having confirmed that the data is read in, assigned column variable names, and the types for each column seem correct it is possible to begin exploring and manipulating the data. For example, all of the values for number_children can be accessed by first specifying the dataframe variable name followed by a $ and the name of the column/variable:

The summary(acs) output already displayed the mean value for number_children but it is also possible to directly calculate it using the mean() function:

It is also possible to directly calculate the mode, variance, and standard deviation:

The response for mode() was “numeric” but this seems odd. The mode should return the value that occurs most often in the dataset and for the values in the vector acs$number_children seems like there probably is one value occurring more than others. A definitive answer can be found by plotting the vector in a histogram:

From this it can be seen that the value ‘0’ is the most frequently occurring value making 0 the mode of acs$number_children. So what is going on here? Does the mode() function actually refer to some other than the statistical measure of mode? From within the console window’s autocomplete function it can be seen that the function mode() currently in the R environment will return the type or storage mode of an object, which explains the response of ‘numeric’ when passing it the vector acs$number_children. Looking up mode in the Help window provides further details of this mode() function.

It appears that there is no default R language function for the statistical mode but certainly there is an existing implementation in a R package that could be included into the current environment. A quick search reveals the package DescTools and looking through the documentation, available from https://cran.r-project.org/web/packages/DescTools/DescTools.pdf , indicates Mode() calculated the most frequent value of a variable. This package can be installed from the console with install.packages() and then needs to be explicitly included as a library in the current environment with library

This library’s version of mode requires calling it with Mode(), not to be confused with mode():

Data Science and Visualization Explorations

Sunday, January 15, 2017

R Language: Importing Data and Performing Basic Descriptive Statistical Measures

No comments:

Post a Comment