library(here)
here() starts at /builds/the-mitr/r-edu
<- read.csv(here("data", "class-01.csv"))
classroom_data #View(class.data)
Descriptive statistics is the term used when we are trying to describe the data in a summarised form. But this sounds like a circular definition. Let us clarify what we mean by this with an example.
Suppose we have the following data of 30 students in a classroom. Their name, roll number, age, gender, marks in three subjects (English, Science and Mathematics).
Roll No | Gender | English | Science | Mathematics |
---|---|---|---|---|
1 | Male | 88 | 92 | 84 |
2 | Female | 75 | 70 | 68 |
3 | Male | 65 | 60 | 72 |
4 | Female | 55 | 59 | 58 |
5 | Male | 45 | 49 | 51 |
6 | Female | 80 | 78 | 82 |
7 | Male | 42 | 47 | 40 |
8 | Female | 67 | 69 | 70 |
9 | Male | 85 | 87 | 89 |
10 | Female | 90 | 85 | 92 |
11 | Male | 33 | 38 | 36 |
12 | Female | 58 | 55 | 60 |
13 | Male | 60 | 62 | 63 |
14 | Female | 71 | 75 | 73 |
15 | Male | 48 | 45 | 42 |
16 | Female | 64 | 61 | 60 |
17 | Male | 50 | 48 | 46 |
18 | Female | 66 | 63 | 65 |
19 | Male | 38 | 44 | 41 |
20 | Female | 69 | 68 | 67 |
21 | Male | 77 | 79 | 80 |
22 | Female | 82 | 85 | 84 |
23 | Male | 43 | 47 | 45 |
24 | Female | 53 | 56 | 54 |
25 | Male | 91 | 93 | 95 |
26 | Female | 40 | 44 | 39 |
27 | Male | 87 | 89 | 86 |
28 | Female | 49 | 51 | 50 |
29 | Male | 46 | 42 | 48 |
30 | Female | 55 | 53 | 57 |
What does this table tell us at a glance? Typically such data are called as raw data. This data is arranged in a manner (by roll numbers) which does not tell us about the numbers which we are concerned with. For example, consider the following questions:
What are the highest and lowest marks in each subject?
Which subject has the highest average? Which has the lowest?
How many students are scoring below the passing mark (35) in each subject?
What proportion of students are scoring between 35 and 50 (just passing)?
How many students scored 80 or above in at least one subject?
What is the distribution of total scores (English + Science + Mathematics), also by gender?
What are the average scores in each subject, also by gender?
What is the average mark in each subject (English, Science, Mathematics)?
Some of these questions can be answered by sorting data in a different way. Right now it is being sorted by a label, the roll number, which doesn’t have any inherent meaning, it is neither ordinal or nominal. Well in some cases the roll number maybe based on alphabetical names, or order in which students were admitted to the class, but for our present case, it is just an identifier of the student.
Some questions can be answered by sorting and taking counts below or above a particular threshold (marks = 35, marks = 80 etc.). In remaining questions we need to take means of the data. Still further we need to group the data and then take the means.
Let us load this data in R. We will use the csv (comma-separated value) file. To load this file you can use this command
library(here)
here() starts at /builds/the-mitr/r-edu
<- read.csv(here("data", "class-01.csv"))
classroom_data #View(class.data)
This command will import the data into R and store it in an object called classroom_data
. Let us understand what each line and command does. The first line library(here)
uses a R package called here
which allows relative paths to be given. The package constructs file paths relative to the project root (usually where your .Rproj file or Quarto project root is present. This way problems of working in different directories is sorted. (Your computer folder structure will not be same as mine.) More on this later.
To actually read the csv file we use the function read.csv(...)
which is a base R function, that means it comes with default installation of R and you do not need any library to be installed to use it. Then the actual folders and files: "data"
is the name of the folder and "class-01.csv"
is the name of the CSV file inside this folder. Finally class.data <- ...
stores the result of reading the CSV file into an object called classroom_data
.
As a good researcher organising data in a meaningful way is a habit that you should develop. This includes how to name and organise your files and folders. Following are some best practices on how to give meaningful filenames in R (for both you and R language).
Do’s
Use descriptive, meaningful names.
-student_scores.csv
, survey_data_2023.xlsx
Use only lowercase letters, numbers, hyphens (-
), or underscores (_
) instead of spaces.
my_data.csv
, class-01.csv
My_Data.csv
Use consistent naming conventions.
student_marks.csv
student-marks.csv
Include the correct file extension. Applications will not understand the type of file otherwise.
.csv
, .xlsx
, .R
, .qmd
, .txt
, etc.Use here()
or file.path()
to refer to file paths
Example:
read.csv(here("data", "class-01.csv"))
Don’ts
student scores.csv
student_scores.csv
or student-scores.csv
! @ # $ % ^ & * ( ) ~ ? > < ,
2023_report.csv
report_2023.csv
read.csv("C:/Users/yourname/Documents/file.csv")
here()
insteadData.csv
and data.csv
are treated differently, so be careful about letter cases in file names. It is best to use all lowercase letters to avoid any confusion.Now that we have our data loaded in R as classroom_data
let us see this data. To see the data in R we simply type the object name.
classroom_data
Roll.No Gender English Science Mathematics
1 1 Male 88 92 84
2 2 Female 75 70 68
3 3 Male 65 60 72
4 4 Female 55 59 58
5 5 Male 45 49 51
6 6 Female 80 78 82
7 7 Male 42 47 40
8 8 Female 67 69 70
9 9 Male 85 87 89
10 10 Female 90 85 92
11 11 Male 33 38 36
12 12 Female 58 55 60
13 13 Male 60 62 63
14 14 Female 71 75 73
15 15 Male 48 45 42
16 16 Female 64 61 60
17 17 Male 50 48 46
18 18 Female 66 63 65
19 19 Male 38 44 41
20 20 Female 69 68 67
21 21 Male 77 79 80
22 22 Female 82 85 84
23 23 Male 43 47 45
24 24 Female 53 56 54
25 25 Male 91 93 95
26 26 Female 40 44 39
27 27 Male 87 89 86
28 28 Female 49 51 50
29 29 Male 46 42 48
30 30 Female 55 53 57
Now the object containing our data classroom_data
is stored in a particular manner. It has the class as data.frame
. You can check this using the command class(...)
.
class(classroom_data)
[1] "data.frame"
In R, a dataframe is a fundamental data structure used to store data as a table, just like a spreadsheet. Each column in a dataframe can hold values of different type of variables such as numbers, text, or categories and each row represents an observation or record. Note that in a dataframe all columns must contain the same number of rows.
There are various functions that are useful to access, manipulate and transform the datatable. We will see some of them now.
Now each column of the dataframe can have different data types. For example the first column is the Roll number, let us check its data type. For this we will need to tell the class(...)
function that we only want to know the data type of the first column. There are two basic ways of doing this. Both can be useful depending on the situation.
Case 1: By column name
In this case we specify the column name of the dataframe. For example,
class(classroom_data$Roll.No)
[1] "integer"
Note the syntax. We used a $
sign after the dataframe name and then added the column name. The autocomplete feature in R studio helps a lot in this. Once the dataframe name is typed and $
added, it will show all the column names as shown below.
You can just select the required column.
Case 2: By column number
In this case we specify the column number of the dataframe. There are multiple ways to achieve this
class(classroom_data[[1]])
[1] "integer"
#returns the first column as a vector
class(classroom_data[, 1])
[1] "integer"
#same as above, returns the first column
Note that we use [[...]]
two square brackets to specify the column. In the second case, [, 1]
we are selecting the first column. This format can also be used to select rows or specific cells. The table below shows how we can access different elements of the dataframe by this format.
Syntax | Meaning | Returns |
---|---|---|
df[i, j] |
i-th row and j-th column | A single value, the i-j-th cell. |
df[i, ] |
i-th row, all columns | A row (as a data frame) |
df[, j] |
all rows, j-th column | A column (vector or data frame) |
df[, 2:5] |
all rows, columns 2 to 5 | A sub-data-frame |
df[[j]] |
j-th column | A vector (drops dimensions) |
df[, "name"] |
column by name | Same as df[[j]] if exact match |
The commands in the table above can be used in a variety of ways to extract required data, either as columns or as sub-datasets. For example, if we only want a table that has roll number and science scores we can again use the column names or column numbers.
Let us check the output of this command which takes out values of Roll.No (Column 1) and science scores (Column 4).
c(1,4)] classroom_data[,
Roll.No Science
1 1 92
2 2 70
3 3 60
4 4 59
5 5 49
6 6 78
7 7 47
8 8 69
9 9 87
10 10 85
11 11 38
12 12 55
13 13 62
14 14 75
15 15 45
16 16 61
17 17 48
18 18 63
19 19 44
20 20 68
21 21 79
22 22 85
23 23 47
24 24 56
25 25 93
26 26 44
27 27 89
28 28 51
29 29 42
30 30 53
What we have done here is to create a vector using the “combine” or “concatenate” function c(...)
. This function will create a vector with entries that we have given, namely 1 and 4. Thus our command [, c(1,4)]
will give us
All rows (because nothing is specified before the comma),
Only columns 1 and 4 (because c(1, 4) selects these two columns by position).
We can use similar syntax `c(...)`
to get very specific data out of a dataframe.We can also use column/row names instead of numbers. In this the name string has to be inside quotes "name".
Note we cannot write [,1,4]
it will give an unexpected answer.
But note that there are a total of three columns two columns of roll number and one for Science. This is clearly wrong. We had asked only for 2 columns! What is happening? The thing is that R adds row numbers also while displaying the data in this case. Notice that there is no column name in the first column. This feature is useful at times. But if we do not want this row numbers we can suppress this using
print(classroom_data[, c(1, 4)], row.names = FALSE)
Roll.No Science
1 92
2 70
3 60
4 59
5 49
6 78
7 47
8 69
9 87
10 85
11 38
12 55
13 62
14 75
15 45
16 61
17 48
18 63
19 44
20 68
21 79
22 85
23 47
24 56
25 93
26 44
27 89
28 51
29 42
30 53
Enough of R commands, let us do some statistics on the data set. At the beginning we asked some questions about the classroom data.
Let us start with the first one
For this we have two dedicated functions in R min(...)
and max(...)
to find out minimum and maximum values in a given data. The syntax is min(x)
where x
is set of the data values. In our case for each subject, we have column names. These can be seen with the command names(...)
:
names(classroom_data)
[1] "Roll.No" "Gender" "English" "Science" "Mathematics"
Thus, to get maximum values of English scores we can write
max(classroom_data$English)
[1] 91
Similarly, for minimum value we can write
min(classroom_data$English)
[1] 33
We can also save the output of the commands like min(...)
to a specific variable. For example, let us store the minimum and maximum scores in English with variables eng_min
and eng_max
<- min(classroom_data$English)
eng_min <- max(classroom_data$English) eng_max
Note that the above code block does not produce any results. This is because we have only computed and stored the minimum and maximum values to the variables eng_min
and eng_max.
To see these values we can just type the variable names.
eng_min
[1] 33
eng_max
[1] 91
Now let us see how we can present the results in a nice table using package called as knitr(...).
For this we will use slightly different approach
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
|>
classroom_data summarise(
science_min = min(Science),
science_max = max(Science),
science_mean = mean(Science)
|> knitr::kable(caption = "Minimum and Maximum Scores by Subject") )
science_min | science_max | science_mean |
---|---|---|
38 | 93 | 63.13333 |
Here we have used one of the most powerful functions in R. The summarise(...)
(also works with summarize(...)
in American English) function. It is used create summary statistics from a data frame. summarise(...)
collapses multiple values into a single summary value. In our case we are applying it to a column in a dataframe.
Another useful function is the summary(...)
function. This function gives a statistical summary of different columns in a dataframe. The type of summary depends on the datatype of the column. For example, to get summary of classroom_data
we can use
summary(classroom_data)
Roll.No Gender English Science
Min. : 1.00 Length:30 Min. :33.00 Min. :38.00
1st Qu.: 8.25 Class :character 1st Qu.:48.25 1st Qu.:48.25
Median :15.50 Mode :character Median :62.00 Median :60.50
Mean :15.50 Mean :62.40 Mean :63.13
3rd Qu.:22.75 3rd Qu.:76.50 3rd Qu.:77.25
Max. :30.00 Max. :91.00 Max. :93.00
Mathematics
Min. :36.00
1st Qu.:48.50
Median :61.50
Mean :63.23
3rd Qu.:78.25
Max. :95.00
This statistical summary has the minimum, median, mean, maximum and first and third quantiles for numerical data. For text data it gives length of the column (30) and type of data.
hist(classroom_data$English)
Central Tendency and Variability
Measures of Center
Mode: mode, crude mode, and refined mode.
Median: median, rough median, and exact median.
Mean: mean, grouped mean, weighted mean, pooled mean, and mean of dichotomous variable.
Other order measures: midextreme (midrange), midhinge, trimean, and biweight.
Other means: trimmed mean, winsorized mean, and midmean; geometric mean, harmonic mean, generalized mean, and quadratic mean.
Measures of Spread
Numeric: mean deviation (average deviation), population variance, population standard deviation, sample variance, sample standard deviation, pooled variance, variance of dichotomous variable, coefficient of variation (coefficient of relative variation), and Gini’s mean difference.
Ordinal: range, interquartile range (midspread), quartile deviation (semi-interquartile range, quartile range), coefficient of quartile variation, median absolute deviation, coefficient of dispersion, and Leik’s D. Nominal: variation ratio, index of diversity, index of qualitative variation, entropy, and standardized entropy.
Sampling distributions: variance of sampling distribution of means, standard error of the mean (standard deviation of sampling distribution of means), standard error of a proportion, and sampling error.
Statistical Graphics for Univariate and Bivariate Data
Statistical Graphics for Visualizing Multivariate Data
# Relating Statistics and Experimental Design : An Introduction
Methods of Randomization in Experimental Design
Research Designs
Introduction to Survey Sampling
Achievement Testing: Recent Advances
Using Published Data: Errors and Remedies
Secondary Analysis of Survey Data
Bayesian Statistical Inference
Cluster Analysis
Models for Innovation Diffusion
Meta-Analysis: Quantitative Methods for Research Synthesis
Multiple Comparisons
Analysis of nominal data
Analysis of ordinal data