4  stats-basic

5 Descriptive Statistics

Descriptive statistics is the term used when we are trying to describe the data in a summarised form. But this sounds like a circular definition. Let us clarify what we mean by this with an example.

Suppose we have the following data of 30 students in a classroom. Their name, roll number, age, gender, marks in three subjects (English, Science and Mathematics).

Roll No Gender English Science Mathematics
1 Male 88 92 84
2 Female 75 70 68
3 Male 65 60 72
4 Female 55 59 58
5 Male 45 49 51
6 Female 80 78 82
7 Male 42 47 40
8 Female 67 69 70
9 Male 85 87 89
10 Female 90 85 92
11 Male 33 38 36
12 Female 58 55 60
13 Male 60 62 63
14 Female 71 75 73
15 Male 48 45 42
16 Female 64 61 60
17 Male 50 48 46
18 Female 66 63 65
19 Male 38 44 41
20 Female 69 68 67
21 Male 77 79 80
22 Female 82 85 84
23 Male 43 47 45
24 Female 53 56 54
25 Male 91 93 95
26 Female 40 44 39
27 Male 87 89 86
28 Female 49 51 50
29 Male 46 42 48
30 Female 55 53 57

What does this table tell us at a glance? Typically such data are called as raw data. This data is arranged in a manner (by roll numbers) which does not tell us about the numbers which we are concerned with. For example, consider the following questions:

  1. What are the highest and lowest marks in each subject?

  2. Which subject has the highest average? Which has the lowest?

  3. How many students are scoring below the passing mark (35) in each subject?

  4. What proportion of students are scoring between 35 and 50 (just passing)?

  5. How many students scored 80 or above in at least one subject?

  6. What is the distribution of total scores (English + Science + Mathematics), also by gender?

  7. What are the average scores in each subject, also by gender?

What is the average mark in each subject (English, Science, Mathematics)?

Some of these questions can be answered by sorting data in a different way. Right now it is being sorted by a label, the roll number, which doesn’t have any inherent meaning, it is neither ordinal or nominal. Well in some cases the roll number maybe based on alphabetical names, or order in which students were admitted to the class, but for our present case, it is just an identifier of the student.

Some questions can be answered by sorting and taking counts below or above a particular threshold (marks = 35, marks = 80 etc.). In remaining questions we need to take means of the data. Still further we need to group the data and then take the means.

⁉️ Identify which questions can be answered how? {#sec-⁉️-identify-which-questions-can-be-answered-how}

Let us load this data in R. We will use the csv (comma-separated value) file. To load this file you can use this command

library(here)
here() starts at /builds/the-mitr/r-edu
classroom_data <- read.csv(here("data", "class-01.csv"))
#View(class.data)

This command will import the data into R and store it in an object called classroom_data. Let us understand what each line and command does. The first line library(here) uses a R package called here which allows relative paths to be given. The package constructs file paths relative to the project root (usually where your .Rproj file or Quarto project root is present. This way problems of working in different directories is sorted. (Your computer folder structure will not be same as mine.) More on this later.

To actually read the csv file we use the function read.csv(...) which is a base R function, that means it comes with default installation of R and you do not need any library to be installed to use it. Then the actual folders and files: "data" is the name of the folder and "class-01.csv" is the name of the CSV file inside this folder. Finally class.data <- ... stores the result of reading the CSV file into an object called classroom_data.

As a good researcher organising data in a meaningful way is a habit that you should develop. This includes how to name and organise your files and folders. Following are some best practices on how to give meaningful filenames in R (for both you and R language).

Do’s

  1. Use descriptive, meaningful names.
    -student_scores.csv, survey_data_2023.xlsx

  2. Use only lowercase letters, numbers, hyphens (-), or underscores (_) instead of spaces.

    • Recommended: my_data.csv, class-01.csv
    • Avoid: My_Data.csv
  3. Use consistent naming conventions.

    • For example:
      • Snake case: student_marks.csv
      • Kebab case: student-marks.csv
  4. Include the correct file extension. Applications will not understand the type of file otherwise.

    • Use .csv, .xlsx, .R, .qmd, .txt, etc.
  5. Use here() or file.path() to refer to file paths

    • Example:

      read.csv(here("data", "class-01.csv"))

Don’ts

  1. Do not use spaces in filenames. I cannot stress this enough. Avoid spaces in filenames to save yourself from lot of troubles later.
    • Do not use: student scores.csv
    • Preferred: student_scores.csv or student-scores.csv
  2. Avoid special characters and punctuation. Some of these characters are reserved and can cause problems.
    • Do not use: ! @ # $ % ^ & * ( ) ~ ? > < ,
  3. Do not start filenames with numbers, R does not like that.
    • Do not use: 2023_report.csv
    • Preferred: report_2023.csv
  4. Do not hardcode absolute path
    • Do not use: read.csv("C:/Users/yourname/Documents/file.csv")
    • Use relative paths or here() instead
  5. Do not rely on case-insensitivity
    • Data.csv and data.csv are treated differently, so be careful about letter cases in file names. It is best to use all lowercase letters to avoid any confusion.

Now that we have our data loaded in R as classroom_data let us see this data. To see the data in R we simply type the object name.

classroom_data
   Roll.No Gender English Science Mathematics
1        1   Male      88      92          84
2        2 Female      75      70          68
3        3   Male      65      60          72
4        4 Female      55      59          58
5        5   Male      45      49          51
6        6 Female      80      78          82
7        7   Male      42      47          40
8        8 Female      67      69          70
9        9   Male      85      87          89
10      10 Female      90      85          92
11      11   Male      33      38          36
12      12 Female      58      55          60
13      13   Male      60      62          63
14      14 Female      71      75          73
15      15   Male      48      45          42
16      16 Female      64      61          60
17      17   Male      50      48          46
18      18 Female      66      63          65
19      19   Male      38      44          41
20      20 Female      69      68          67
21      21   Male      77      79          80
22      22 Female      82      85          84
23      23   Male      43      47          45
24      24 Female      53      56          54
25      25   Male      91      93          95
26      26 Female      40      44          39
27      27   Male      87      89          86
28      28 Female      49      51          50
29      29   Male      46      42          48
30      30 Female      55      53          57

Now the object containing our data classroom_data is stored in a particular manner. It has the class as data.frame. You can check this using the command class(...).

class(classroom_data)
[1] "data.frame"

In R, a dataframe is a fundamental data structure used to store data as a table, just like a spreadsheet. Each column in a dataframe can hold values of different type of variables such as numbers, text, or categories and each row represents an observation or record. Note that in a dataframe all columns must contain the same number of rows.

There are various functions that are useful to access, manipulate and transform the datatable. We will see some of them now.

Now each column of the dataframe can have different data types. For example the first column is the Roll number, let us check its data type. For this we will need to tell the class(...) function that we only want to know the data type of the first column. There are two basic ways of doing this. Both can be useful depending on the situation.

Case 1: By column name

In this case we specify the column name of the dataframe. For example,

class(classroom_data$Roll.No)
[1] "integer"

Note the syntax. We used a $ sign after the dataframe name and then added the column name. The autocomplete feature in R studio helps a lot in this. Once the dataframe name is typed and $ added, it will show all the column names as shown below.

Autocomplete feature in R studio.

You can just select the required column.

Case 2: By column number

In this case we specify the column number of the dataframe. There are multiple ways to achieve this

class(classroom_data[[1]])
[1] "integer"
#returns the first column as a vector


class(classroom_data[, 1])
[1] "integer"
#same as above, returns the first column

Note that we use [[...]] two square brackets to specify the column. In the second case, [, 1] we are selecting the first column. This format can also be used to select rows or specific cells. The table below shows how we can access different elements of the dataframe by this format.

Syntax Meaning Returns
df[i, j] i-th row and j-th column A single value, the i-j-th cell.
df[i, ] i-th row, all columns A row (as a data frame)
df[, j] all rows, j-th column A column (vector or data frame)
df[, 2:5] all rows, columns 2 to 5 A sub-data-frame
df[[j]] j-th column A vector (drops dimensions)
df[, "name"] column by name Same as df[[j]] if exact match

The commands in the table above can be used in a variety of ways to extract required data, either as columns or as sub-datasets. For example, if we only want a table that has roll number and science scores we can again use the column names or column numbers.

Let us check the output of this command which takes out values of Roll.No (Column 1) and science scores (Column 4).

 classroom_data[, c(1,4)]
   Roll.No Science
1        1      92
2        2      70
3        3      60
4        4      59
5        5      49
6        6      78
7        7      47
8        8      69
9        9      87
10      10      85
11      11      38
12      12      55
13      13      62
14      14      75
15      15      45
16      16      61
17      17      48
18      18      63
19      19      44
20      20      68
21      21      79
22      22      85
23      23      47
24      24      56
25      25      93
26      26      44
27      27      89
28      28      51
29      29      42
30      30      53

What we have done here is to create a vector using the “combine” or “concatenate” function c(...) . This function will create a vector with entries that we have given, namely 1 and 4. Thus our command [, c(1,4)] will give us

  • All rows (because nothing is specified before the comma),

  • Only columns 1 and 4 (because c(1, 4) selects these two columns by position).

We can use similar syntax `c(...)` to get very specific data out of a dataframe.We can also use column/row names instead of numbers. In this the name string has to be inside quotes "name".

Note we cannot write [,1,4] it will give an unexpected answer.

Task

Use column names instead of column numbers to get the same subset of two columns.

But note that there are a total of three columns two columns of roll number and one for Science. This is clearly wrong. We had asked only for 2 columns! What is happening? The thing is that R adds row numbers also while displaying the data in this case. Notice that there is no column name in the first column. This feature is useful at times. But if we do not want this row numbers we can suppress this using

print(classroom_data[, c(1, 4)], row.names = FALSE)
 Roll.No Science
       1      92
       2      70
       3      60
       4      59
       5      49
       6      78
       7      47
       8      69
       9      87
      10      85
      11      38
      12      55
      13      62
      14      75
      15      45
      16      61
      17      48
      18      63
      19      44
      20      68
      21      79
      22      85
      23      47
      24      56
      25      93
      26      44
      27      89
      28      51
      29      42
      30      53

Enough of R commands, let us do some statistics on the data set. At the beginning we asked some questions about the classroom data.

Let us start with the first one

What are the highest and lowest marks in each subject?

For this we have two dedicated functions in R min(...) and max(...) to find out minimum and maximum values in a given data. The syntax is min(x) where x is set of the data values. In our case for each subject, we have column names. These can be seen with the command names(...):

names(classroom_data)
[1] "Roll.No"     "Gender"      "English"     "Science"     "Mathematics"

Thus, to get maximum values of English scores we can write

max(classroom_data$English)
[1] 91

Similarly, for minimum value we can write

min(classroom_data$English)
[1] 33
Task

Find out minimum and maximum values of Science and Mathematics scores. For science scores use the column name method, and for mathematics use column number method.

We can also save the output of the commands like min(...) to a specific variable. For example, let us store the minimum and maximum scores in English with variables eng_min and eng_max

eng_min <- min(classroom_data$English)
eng_max <- max(classroom_data$English)

Note that the above code block does not produce any results. This is because we have only computed and stored the minimum and maximum values to the variables eng_min and eng_max. To see these values we can just type the variable names.

eng_min
[1] 33
eng_max
[1] 91
Task

Store min and max values of Science and Mathematics scores in suitably named variables.

Now let us see how we can present the results in a nice table using package called as knitr(...). For this we will use slightly different approach

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
classroom_data |>
  summarise(
    science_min = min(Science),
    science_max = max(Science),
    science_mean = mean(Science)
  ) |> knitr::kable(caption = "Minimum and Maximum Scores by Subject")
Minimum and Maximum Scores by Subject
science_min science_max science_mean
38 93 63.13333

Here we have used one of the most powerful functions in R. The summarise(...) (also works with summarize(...) in American English) function. It is used create summary statistics from a data frame. summarise(...) collapses multiple values into a single summary value. In our case we are applying it to a column in a dataframe.

Another useful function is the summary(...) function. This function gives a statistical summary of different columns in a dataframe. The type of summary depends on the datatype of the column. For example, to get summary of classroom_data we can use

summary(classroom_data)
    Roll.No         Gender             English         Science     
 Min.   : 1.00   Length:30          Min.   :33.00   Min.   :38.00  
 1st Qu.: 8.25   Class :character   1st Qu.:48.25   1st Qu.:48.25  
 Median :15.50   Mode  :character   Median :62.00   Median :60.50  
 Mean   :15.50                      Mean   :62.40   Mean   :63.13  
 3rd Qu.:22.75                      3rd Qu.:76.50   3rd Qu.:77.25  
 Max.   :30.00                      Max.   :91.00   Max.   :93.00  
  Mathematics   
 Min.   :36.00  
 1st Qu.:48.50  
 Median :61.50  
 Mean   :63.23  
 3rd Qu.:78.25  
 Max.   :95.00  

This statistical summary has the minimum, median, mean, maximum and first and third quantiles for numerical data. For text data it gives length of the column (30) and type of data.

hist(classroom_data$English)

6 One Variable Analysis

6.1 Frequency distributions

6.2 Central Tendencies

Central Tendency and Variability

Measures of Center

Mode: mode, crude mode, and refined mode.

Median: median, rough median, and exact median.

Mean: mean, grouped mean, weighted mean, pooled mean, and mean of dichotomous variable.

Other order measures: midextreme (midrange), midhinge, trimean, and biweight.

Other means: trimmed mean, winsorized mean, and midmean; geometric mean, harmonic mean, generalized mean, and quadratic mean.

Measures of Spread

Numeric: mean deviation (average deviation), population variance, population standard deviation, sample variance, sample standard deviation, pooled variance, variance of dichotomous variable, coefficient of variation (coefficient of relative variation), and Gini’s mean difference.

Ordinal: range, interquartile range (midspread), quartile deviation (semi-interquartile range, quartile range), coefficient of quartile variation, median absolute deviation, coefficient of dispersion, and Leik’s D. Nominal: variation ratio, index of diversity, index of qualitative variation, entropy, and standardized entropy.

Sampling distributions: variance of sampling distribution of means, standard error of the mean (standard deviation of sampling distribution of means), standard error of a proportion, and sampling error.

6.3 Summaries: Tabular and Graphical

Statistical Graphics for Univariate and Bivariate Data

Statistical Graphics for Visualizing Multivariate Data

# Relating Statistics and Experimental Design : An Introduction

Methods of Randomization in Experimental Design 

6.4 Spread

6.5 Tables

6.6 Graphs

6.7 Normal and abnormal?

7 Two or more Variables

7.1 Tables

8 Inferential Statistics

8.1 Reliability and validity

Research Designs

Introduction to Survey Sampling

Achievement Testing: Recent Advances

Using Published Data: Errors and Remedies

Secondary Analysis of Survey Data

Bayesian Statistical Inference

Cluster Analysis

Models for Innovation Diffusion 

Meta-Analysis: Quantitative Methods for Research Synthesis

Multiple Comparisons 

Analysis of nominal data

Analysis of ordinal data