4  stats-basic

5 Descriptive Statistics

Descriptive statistics is the term used when we are trying to describe the data in a meaningful way. This usually means presenting the data in a summarised or a visual-graphical form. Let us clarify what we mean by this with an example.

Suppose you are a teacher. You conducted tests in English, Science and Mathematics and have the following data of 30 students. Their name, roll number, age, gender, marks in three subjects (English, Science and Mathematics).

Roll No Gender English Science Mathematics
1 Male 88 92 84
2 Female 75 70 68
3 Male 65 60 72
4 Female 55 59 58
5 Male 45 49 51
6 Female 80 78 82
7 Male 42 47 40
8 Female 67 69 70
9 Male 85 87 89
10 Female 90 85 92
11 Male 33 38 36
12 Female 58 55 60
13 Male 60 62 63
14 Female 71 75 73
15 Male 48 45 42
16 Female 64 61 60
17 Male 50 48 46
18 Female 66 63 65
19 Male 38 44 41
20 Female 69 68 67
21 Male 77 79 80
22 Female 82 85 84
23 Male 43 47 45
24 Female 53 56 54
25 Male 91 93 95
26 Female 40 44 39
27 Male 87 89 86
28 Female 49 51 50
29 Male 46 42 48
30 Female 55 53 57

What does this table tell us at a glance? Not much, apart from telling the individual scores of the particular roll numbers. Typically such data are called as raw data. This data is arranged in a certain manner (by roll numbers) which does not tell us about the information we are concerned with. For example, consider the following questions:

  1. What are the highest and lowest marks in each subject?

  2. Which subject has the highest average? Which has the lowest?

  3. How many students are scoring below the passing mark (35) in each subject?

  4. What proportion of students are scoring between 35 and 50 (just passing)?

  5. How many students scored 80 or above in at least one subject?

  6. What is the distribution of total scores (English + Science + Mathematics), also by gender?

  7. What are the average scores in each subject, also by gender?

  8. What is the average mark in each subject (English, Science, Mathematics)?

  9. What is the ranking of students according to these test results?

We will see how we can organise, transform and compute this raw data to give us answers to above questions.

Some of these questions can be answered by sorting data in a different way. Right now it is being sorted by a label, the roll number, which doesn’t have any inherent meaning, it is neither ordinal or nominal. Well in some cases the roll number maybe based on alphabetical names, or order in which students were admitted to the class, but for our present case, it is just an identifier of the student.

Some questions can be answered by sorting and taking counts below or above a particular threshold (marks = 35, marks = 80 etc.). In remaining questions we need to take means of the data. Still further we need to group the data and then take the means.

⁉️ Identify which questions can be answered how? {#sec-⁉️-identify-which-questions-can-be-answered-how}

Let us load this data in R. We will use the csv (comma-separated value) file. To load this file you can use this command

This command will import the data into R and store it in an object called classroom_data. Let us understand what each line and command does. The first line library(here) uses a R package called here which allows relative paths to be given. The package constructs file paths relative to the project root (usually where your .Rproj file or Quarto project root is present. This way problems of working in different directories is sorted. (Your computer folder structure will not be same as mine.) More on this later.

To actually read the csv file we use the function read.csv(...) which is a base R function, that means it comes with default installation of R and you do not need any library to be installed to use it. Then the actual folders and files: "data" is the name of the folder and "class-01.csv" is the name of the CSV file inside this folder. Finally class.data <- ... stores the result of reading the CSV file into an object called classroom_data.

Now that we have our data loaded in R as classroom_data let us see this data. To see the data in R we simply type the object name.

Now the object containing our data classroom_data is stored in a particular manner. It has the class as data.frame. You can check this using the command class(...).

library(readr)
classroom_data <- read_csv("data/class-data.csv")
Rows: 30 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Gender
dbl (4): Roll No, English, Science, Mathematics

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# class(classroom_data)

In R, a dataframe is a fundamental data structure used to store data as a table, just like a spreadsheet. Each column in a dataframe can hold values of different type of variables such as numbers, text, or categories and each row represents an observation or record. Note that in a dataframe all columns must contain the same number of rows.

There are various functions that are useful to access, manipulate and transform the datatable. We will see some of them now.

Now each column of the dataframe can have different data types. For example the first column is the Roll number, let us check its data type. For this we will need to tell the class(...) function that we only want to know the data type of the first column. There are two basic ways of doing this. Both can be useful depending on the situation.

Case 1: By column name

In this case we specify the column name of the dataframe. For example,

class(classroom_data$Roll.No)
Warning: Unknown or uninitialised column: `Roll.No`.
[1] "NULL"

Note the syntax. We used a $ sign after the dataframe name and then added the column name. The autocomplete feature in R studio helps a lot in this. Once the dataframe name is typed and $ added, it will show all the column names as shown below.

Autocomplete feature in R studio.

You can just select the required column.

Case 2: By column number

In this case we specify the column number of the dataframe. There are multiple ways to achieve this

class(classroom_data[[1]])
[1] "numeric"
#returns the first column as a vector


class(classroom_data[, 1])
[1] "tbl_df"     "tbl"        "data.frame"
#same as above, returns the first column

Note that we use [[...]] two square brackets to specify the column. In the second case, [, 1] we are selecting the first column. This format can also be used to select rows or specific cells. The table below shows how we can access different elements of the dataframe by this format.

Syntax Meaning Returns
df[i, j] i-th row and j-th column A single value, the i-j-th cell.
df[i, ] i-th row, all columns A row (as a data frame)
df[, j] all rows, j-th column A column (vector or data frame)
df[, 2:5] all rows, columns 2 to 5 A sub-data-frame
df[[j]] j-th column A vector (drops dimensions)
df[, "name"] column by name Same as df[[j]] if exact match

The commands in the table above can be used in a variety of ways to extract required data, either as columns or as sub-datasets. For example, if we only want a table that has roll number and science scores we can again use the column names or column numbers.

Let us check the output of this command which takes out values of Roll.No (Column 1) and science scores (Column 4).

 classroom_data[, c(1,4)]
# A tibble: 30 × 2
   `Roll No` Science
       <dbl>   <dbl>
 1         1      92
 2         2      70
 3         3      60
 4         4      59
 5         5      49
 6         6      78
 7         7      47
 8         8      69
 9         9      87
10        10      85
# ℹ 20 more rows

What we have done here is to create a vector using the “combine” or “concatenate” function c(...) . This function will create a vector with entries that we have given, namely 1 and 4. Thus our command [, c(1,4)] will give us

  • All rows (because nothing is specified before the comma),

  • Only columns 1 and 4 (because c(1, 4) selects these two columns by position).

We can use similar syntax `c(...)` to get very specific data out of a dataframe.We can also use column/row names instead of numbers. In this the name string has to be inside quotes "name".

Note we cannot write [,1,4] it will give an unexpected answer.

Task

Use column names instead of column numbers to get the same subset of two columns.

But note that there are a total of three columns two columns of roll number and one for Science. This is clearly wrong. We had asked only for 2 columns! What is happening? The thing is that R adds row numbers also while displaying the data in this case. Notice that there is no column name in the first column. This feature is useful at times. But if we do not want this row numbers we can suppress this using

print(classroom_data[, c(1, 4)], row.names = FALSE)
# A tibble: 30 × 2
   `Roll No` Science
       <dbl>   <dbl>
 1         1      92
 2         2      70
 3         3      60
 4         4      59
 5         5      49
 6         6      78
 7         7      47
 8         8      69
 9         9      87
10        10      85
# ℹ 20 more rows

Enough of R commands, let us do some statistics on the data set. At the beginning we asked some questions about the classroom data.

Let us start with the first one

What are the highest and lowest marks in each subject?

For this we have two dedicated functions in R min(...) and max(...) to find out minimum and maximum values in a given data. The syntax is min(x) where x is set of the data values. In our case for each subject, we have column names. These can be seen with the command names(...):

names(classroom_data)
[1] "Roll No"     "Gender"      "English"     "Science"     "Mathematics"

Thus, to get maximum values of English scores we can write

max(classroom_data$English)
[1] 91

Similarly, for minimum value we can write

min(classroom_data$English)
[1] 33
Task

Find out minimum and maximum values of Science and Mathematics scores. For science scores use the column name method, and for mathematics use column number method.

We can also save the output of the commands like min(...) to a specific variable. For example, let us store the minimum and maximum scores in English with variables eng_min and eng_max

eng_min <- min(classroom_data$English)
eng_max <- max(classroom_data$English)

Note that the above code block does not produce any results. This is because we have only computed and stored the minimum and maximum values to the variables eng_min and eng_max. To see these values we can just type the variable names.

eng_min
[1] 33
eng_max
[1] 91
Task

Store min and max values of Science and Mathematics scores in suitably named variables.

Now let us see how we can present the results in a nice table using package called as knitr(...). For this we will use slightly different approach

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ purrr     1.0.2
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
summary_science <- classroom_data |>
  summarise(
    Min = min(Science),
    Max = max(Science),
    Mean = mean(Science)
  )

knitr::kable(summary_science, caption = "Science Scores Summary")
Science Scores Summary
Min Max Mean
38 93 63.13333

Here we have used one of the most powerful functions in R. The summarise(...) function, also works with summarize(...) in American English. It is used create summary statistics from a data frame. summarise(...) collapses multiple values into a single summary value. In our case we are applying it to a column in a dataframe.

Also we have added another function mean(...) to the set, this gives us the mean value of the set (in this case the science scores). The mean(...) function gives 5 decimal points in the computation. We need to round this to 2 decimals. For this we can use the round(...) function. The round(x, digits = n) function takes two arguments, first the numeric object x to be rounded and second, the number of decimal places needed n. The default number of decimal places to round to is 0. For example, `round(3.14159, digits = 2)` gives us 3.14.

To create a summary table for all subjects, the code can be like as follows:

summary_scores <- classroom_data |>
  summarise(
    Science_Min = min(Science),
    Science_Max = max(Science),
    Science_Mean = round(mean(Science),2),
    English_Min = min(English),
    English_Max = max(English),
    English_Mean = round(mean(English),2),
    Maths_Min = min(Mathematics),
    Maths_Max = max(Mathematics),
    Maths_Mean = round(mean(Mathematics),2)
  ) |>
  pivot_longer(everything(),
               names_to = c("Subject", ".value"),
               names_sep = "_")

knitr::kable(summary_scores, caption = "Scores Summary for All Subjects")
Scores Summary for All Subjects
Subject Min Max Mean
Science 38 93 63.13
English 33 91 62.40
Maths 36 95 63.23

We can make this block of computations more efficient by using loops. Suppose you have 30 columns in your table, you are certainly not going to write each individual column name. The same table can be computed as under

summary_all <- classroom_data |>
  summarise(across(
    .cols = c(Science, English, Mathematics),
    .fns = list(
      Min = min,
      Max = max,
      Mean = ~round(mean(.x), 2)  # rounding here
    ),
    .names = "{.col}_{.fn}"
  )) |>
  pivot_longer(everything(),
               names_to = c("Subject", ".value"),
               names_sep = "_")

knitr::kable(summary_all, caption = "Scores Summary for All Subjects")
Scores Summary for All Subjects
Subject Min Max Mean
Science 38 93 63.13
English 33 91 62.40
Mathematics 36 95 63.23

Let us understand this across() applies min, max, and mean to the selected columns (Science, English, Mathematics)

.names = "{.col}_{.fn} ensures consistent naming for pivot_longer to separate by _

pivot_longer() then reshapes it to a tidy format for your table.

The pivot_longer() function from the tidyr package is used to transform data from wide format to long format. It collapses multiple columns into key-value pairs, which is especially useful for tidy data principles, where each variable should have its own column and each observation its own row. The syntax is as follows pivot_longer(data, cols, names_to, values_to). Here

  • data: the data frame.
  • cols: columns to be collapsed.
  • names_to: name of the new column that will hold the former column names.
  • values_to: name of the column that will hold the values.

Another useful function is the summary(...) function. This function gives a statistical summary of different columns in a dataframe. The type of summary depends on the datatype of the column. For example, to get summary of classroom_data we can use

summary(classroom_data)
    Roll No         Gender             English         Science     
 Min.   : 1.00   Length:30          Min.   :33.00   Min.   :38.00  
 1st Qu.: 8.25   Class :character   1st Qu.:48.25   1st Qu.:48.25  
 Median :15.50   Mode  :character   Median :62.00   Median :60.50  
 Mean   :15.50                      Mean   :62.40   Mean   :63.13  
 3rd Qu.:22.75                      3rd Qu.:76.50   3rd Qu.:77.25  
 Max.   :30.00                      Max.   :91.00   Max.   :93.00  
  Mathematics   
 Min.   :36.00  
 1st Qu.:48.50  
 Median :61.50  
 Mean   :63.23  
 3rd Qu.:78.25  
 Max.   :95.00  

This statistical summary has the minimum, median, mean, maximum and first and third quantiles for numerical data. For text data it gives length of the column (30) and type of data. Check this summary against the data that we have calculated earlier.

We can also use the computations or objects right in the text. For example, we can call the mean the mean score of science that we have computed earlier, in this sentence. To use the inline R values we simple put them in quotes like this (this quote is the left of “1” of your keboard).

hist(classroom_data$English)

6 One Variable Analysis

6.1 Frequency distributions

6.2 Central Tendencies

Central Tendency and Variability

Measures of Center

Mode: mode, crude mode, and refined mode.

Median: median, rough median, and exact median.

Mean: mean, grouped mean, weighted mean, pooled mean, and mean of dichotomous variable.

Other order measures: midextreme (midrange), midhinge, trimean, and biweight.

Other means: trimmed mean, winsorized mean, and midmean; geometric mean, harmonic mean, generalized mean, and quadratic mean.

Measures of Spread

Numeric: mean deviation (average deviation), population variance, population standard deviation, sample variance, sample standard deviation, pooled variance, variance of dichotomous variable, coefficient of variation (coefficient of relative variation), and Gini’s mean difference.

Ordinal: range, interquartile range (midspread), quartile deviation (semi-interquartile range, quartile range), coefficient of quartile variation, median absolute deviation, coefficient of dispersion, and Leik’s D. Nominal: variation ratio, index of diversity, index of qualitative variation, entropy, and standardized entropy.

Sampling distributions: variance of sampling distribution of means, standard error of the mean (standard deviation of sampling distribution of means), standard error of a proportion, and sampling error.

6.3 Summaries: Tabular and Graphical

Statistical Graphics for Univariate and Bivariate Data

Statistical Graphics for Visualizing Multivariate Data

# Relating Statistics and Experimental Design : An Introduction

Methods of Randomization in Experimental Design 

6.4 Spread

6.5 Tables

6.6 Graphs

6.7 Normal and abnormal?

7 Two or more Variables

7.1 Tables

8 Inferential Statistics

8.1 Reliability and validity

Research Designs

Introduction to Survey Sampling

Achievement Testing: Recent Advances

Using Published Data: Errors and Remedies

Secondary Analysis of Survey Data

Bayesian Statistical Inference

Cluster Analysis

Models for Innovation Diffusion 

Meta-Analysis: Quantitative Methods for Research Synthesis

Multiple Comparisons 

Analysis of nominal data

Analysis of ordinal data