R provides functions for reading in data locally from a file
or remotely over a network. The specifics of the read.csv() function
documentation is accessible from R Studio by typing in help(read.csv) in the
Console window or from the Help tab. From here we learn that read.csv is a
version of the read.table function and a sister function to read.csv2,
read.delim, and read.delim2, all of which read in a file and create a data
frame.
read.csv(file, header = TRUE, sep = ",", quote =
"\"",
dec =
".", fill = TRUE, comment.char = "", ...)
The file object is an abstraction that allows for passing
more than just a local csv or text file for reading. When we look up ‘file’ in
Help the documentation advises “Functions to create, open and close
connections, i.e. ‘generalized files’, such as possibly compressed files, URLs,
pipes, etc.” This means we can pass read.csv a local file, a URL, pipe, socket,
or a compressed file of type GZ, BZ, XZ, or ZIP. The ability to directly process compressed files is attractive as it saves disk storage space,
especially since text based csv files often contain a lot of redundancies that
compress efficiently.
Using the American Community Survey data sample for Oregon
referenced in the assignment we can directly import that CSV formatted data
over HTTP into R and the read.csv function will attempt to understand the
formatting and header rows and populate a data frame. To assign the data frame
to a variable we can use the ‘<-‘ symbol combination or ‘=’. Which to use is
a matter of style preference but the R user community seems to have a
preference for assignments to be made with ‘<-‘ so as not to confuse
equality checks performed with ‘=’.
Reading in the file R Studio provides a quick method to
inspect the data as it was interpreted from the Environment tab. From this it
can be seen that 7811 rows (observations) of data consisting of 14 variables
per row have been assigned to the data frame named ‘acs’ with a preview of each
column variable and the data type R determined that variable type (int, factor,
etc):
Clicking directly on the variable from the Environment tab
will display the entire data frame in table format in a new tab.
Another function in R to quickly inspect and review a data
frame is summary(), responding with basic statistical measures on each
column/variable:
Having confirmed that the data is read in, assigned column
variable names, and the types for each column seem correct it is possible to
begin exploring and manipulating the data. For example, all of the values for
number_children can be accessed by first specifying the dataframe variable name
followed by a $ and the name of the column/variable:
The summary(acs) output already displayed the mean value for
number_children but it is also possible to directly calculate it using the
mean() function:
It is also possible to directly calculate the mode,
variance, and standard deviation:
The response for mode() was “numeric” but this seems odd.
The mode should return the value that occurs most often in the dataset and for
the values in the vector acs$number_children seems like there probably is one
value occurring more than others. A definitive answer can be found by plotting
the vector in a histogram:
From this it can be seen that the value ‘0’ is the most
frequently occurring value making 0 the mode of acs$number_children. So what is
going on here? Does the mode() function actually refer to some other than the
statistical measure of mode? From within the console window’s autocomplete
function it can be seen that the function mode() currently in the R environment
will return the type or storage mode of an object, which explains the response
of ‘numeric’ when passing it the vector acs$number_children. Looking up mode in
the Help window provides further details of this mode() function.
It appears that there is no default R language function for
the statistical mode but certainly there is an existing implementation in a R
package that could be included into the current environment. A quick search
reveals the package DescTools and looking through the documentation, available
from https://cran.r-project.org/web/packages/DescTools/DescTools.pdf ,
indicates Mode() calculated the most frequent value of a variable. This package
can be installed from the console with install.packages() and then needs to be
explicitly included as a library in the current environment with library
This library’s version of mode requires calling it with
Mode(), not to be confused with mode():
No comments:
Post a Comment