4 stats-basic

5 Descriptive Statistics

Descriptive statistics is the term used when we are trying to describe the data in a summarised form. But this sounds like a circular definition. Let us clarify what we mean by this with an example.

Suppose we have the following data of 30 students in a classroom. Their name, roll number, age, gender, marks in three subjects (English, Science and Mathematics).

Roll No	Gender	English	Science	Mathematics
1	Male	88	92	84
2	Female	75	70	68
3	Male	65	60	72
4	Female	55	59	58
5	Male	45	49	51
6	Female	80	78	82
7	Male	42	47	40
8	Female	67	69	70
9	Male	85	87	89
10	Female	90	85	92
11	Male	33	38	36
12	Female	58	55	60
13	Male	60	62	63
14	Female	71	75	73
15	Male	48	45	42
16	Female	64	61	60
17	Male	50	48	46
18	Female	66	63	65
19	Male	38	44	41
20	Female	69	68	67
21	Male	77	79	80
22	Female	82	85	84
23	Male	43	47	45
24	Female	53	56	54
25	Male	91	93	95
26	Female	40	44	39
27	Male	87	89	86
28	Female	49	51	50
29	Male	46	42	48
30	Female	55	53	57

What does this table tell us at a glance? Typically such data are called as raw data. This data is arranged in a manner (by roll numbers) which does not tell us about the numbers which we are concerned with. For example, consider the following questions:

What are the highest and lowest marks in each subject?
Which subject has the highest average? Which has the lowest?
How many students are scoring below the passing mark (35) in each subject?
What proportion of students are scoring between 35 and 50 (just passing)?
How many students scored 80 or above in at least one subject?
What is the distribution of total scores (English + Science + Mathematics), also by gender?
What are the average scores in each subject, also by gender?

What is the average mark in each subject (English, Science, Mathematics)?

Some of these questions can be answered by sorting data in a different way. Right now it is being sorted by a label, the roll number, which doesn’t have any inherent meaning, it is neither ordinal or nominal. Well in some cases the roll number maybe based on alphabetical names, or order in which students were admitted to the class, but for our present case, it is just an identifier of the student.

Some questions can be answered by sorting and taking counts below or above a particular threshold (marks = 35, marks = 80 etc.). In remaining questions we need to take means of the data. Still further we need to group the data and then take the means.

⁉️ Identify which questions can be answered how? {#sec-⁉️-identify-which-questions-can-be-answered-how}

Let us load this data in R. We will use the csv (comma-separated value) file. To load this file you can use this command

library(here)

here() starts at /builds/the-mitr/r-edu

classroom_data <- read.csv(here("data", "class-01.csv"))
#View(class.data)

This command will import the data into R and store it in an object called classroom_data. Let us understand what each line and command does. The first line library(here) uses a R package called here which allows relative paths to be given. The package constructs file paths relative to the project root (usually where your .Rproj file or Quarto project root is present. This way problems of working in different directories is sorted. (Your computer folder structure will not be same as mine.) More on this later.

To actually read the csv file we use the function read.csv(...) which is a base R function, that means it comes with default installation of R and you do not need any library to be installed to use it. Then the actual folders and files: "data" is the name of the folder and "class-01.csv" is the name of the CSV file inside this folder. Finally class.data <- ... stores the result of reading the CSV file into an object called classroom_data.

As a good researcher organising data in a meaningful way is a habit that you should develop. This includes how to name and organise your files and folders. Following are some best practices on how to give meaningful filenames in R (for both you and R language).

Do’s

Use descriptive, meaningful names.
-student_scores.csv, survey_data_2023.xlsx
Use only lowercase letters, numbers, hyphens (-), or underscores (_) instead of spaces.
- Recommended: my_data.csv, class-01.csv
- Avoid: My_Data.csv
Use consistent naming conventions.
- For example:
  - Snake case: student_marks.csv
  - Kebab case: student-marks.csv
Include the correct file extension. Applications will not understand the type of file otherwise.
- Use .csv, .xlsx, .R, .qmd, .txt, etc.
Use here() or file.path() to refer to file paths
- Example:
```
read.csv(here("data", "class-01.csv"))
```

Don’ts

Do not use spaces in filenames. I cannot stress this enough. Avoid spaces in filenames to save yourself from lot of troubles later.
- Do not use: student scores.csv
- Preferred: student_scores.csv or student-scores.csv
Avoid special characters and punctuation. Some of these characters are reserved and can cause problems.
- Do not use: ! @ # $ % ^ & * ( ) ~ ? > < ,
Do not start filenames with numbers, R does not like that.
- Do not use: 2023_report.csv
- Preferred: report_2023.csv
Do not hardcode absolute path
- Do not use: read.csv("C:/Users/yourname/Documents/file.csv")
- Use relative paths or here() instead
Do not rely on case-insensitivity
- Data.csv and data.csv are treated differently, so be careful about letter cases in file names. It is best to use all lowercase letters to avoid any confusion.

Now that we have our data loaded in R as classroom_data let us see this data. To see the data in R we simply type the object name.

classroom_data

   Roll.No Gender English Science Mathematics
1        1   Male      88      92          84
2        2 Female      75      70          68
3        3   Male      65      60          72
4        4 Female      55      59          58
5        5   Male      45      49          51
6        6 Female      80      78          82
7        7   Male      42      47          40
8        8 Female      67      69          70
9        9   Male      85      87          89
10      10 Female      90      85          92
11      11   Male      33      38          36
12      12 Female      58      55          60
13      13   Male      60      62          63
14      14 Female      71      75          73
15      15   Male      48      45          42
16      16 Female      64      61          60
17      17   Male      50      48          46
18      18 Female      66      63          65
19      19   Male      38      44          41
20      20 Female      69      68          67
21      21   Male      77      79          80
22      22 Female      82      85          84
23      23   Male      43      47          45
24      24 Female      53      56          54
25      25   Male      91      93          95
26      26 Female      40      44          39
27      27   Male      87      89          86
28      28 Female      49      51          50
29      29   Male      46      42          48
30      30 Female      55      53          57

Now the object containing our data classroom_data is stored in a particular manner. It has the class as data.frame. You can check this using the command class(...).

class(classroom_data)

[1] "data.frame"

In R, a dataframe is a fundamental data structure used to store data as a table, just like a spreadsheet. Each column in a dataframe can hold values of different type of variables such as numbers, text, or categories and each row represents an observation or record. Note that in a dataframe all columns must contain the same number of rows.

There are various functions that are useful to access, manipulate and transform the datatable. We will see some of them now.

Now each column of the dataframe can have different data types. For example the first column is the Roll number, let us check its data type. For this we will need to tell the class(...) function that we only want to know the data type of the first column. There are two basic ways of doing this. Both can be useful depending on the situation.

Case 1: By column name

In this case we specify the column name of the dataframe. For example,

class(classroom_data$Roll.No)

[1] "integer"

Note the syntax. We used a $ sign after the dataframe name and then added the column name. The autocomplete feature in R studio helps a lot in this. Once the dataframe name is typed and $ added, it will show all the column names as shown below.

You can just select the required column.

Case 2: By column number

In this case we specify the column number of the dataframe. There are multiple ways to achieve this

class(classroom_data[[1]])

[1] "integer"

#returns the first column as a vector


class(classroom_data[, 1])

[1] "integer"

#same as above, returns the first column

Note that we use [[...]] two square brackets to specify the column. In the second case, [, 1] we are selecting the first column. This format can also be used to select rows or specific cells. The table below shows how we can access different elements of the dataframe by this format.

Syntax	Meaning	Returns
`df[i, j]`	i-th row and j-th column	A single value, the i-j-th cell.
`df[i, ]`	i-th row, all columns	A row (as a data frame)
`df[, j]`	all rows, j-th column	A column (vector or data frame)
`df[, 2:5]`	all rows, columns 2 to 5	A sub-data-frame
`df[[j]]`	j-th column	A vector (drops dimensions)
`df[, "name"]`	column by name	Same as `df[[j]]` if exact match

The commands in the table above can be used in a variety of ways to extract required data, either as columns or as sub-datasets. For example, if we only want a table that has roll number and science scores we can again use the column names or column numbers.

Let us check the output of this command which takes out values of Roll.No (Column 1) and science scores (Column 4).

 classroom_data[, c(1,4)]

   Roll.No Science
1        1      92
2        2      70
3        3      60
4        4      59
5        5      49
6        6      78
7        7      47
8        8      69
9        9      87
10      10      85
11      11      38
12      12      55
13      13      62
14      14      75
15      15      45
16      16      61
17      17      48
18      18      63
19      19      44
20      20      68
21      21      79
22      22      85
23      23      47
24      24      56
25      25      93
26      26      44
27      27      89
28      28      51
29      29      42
30      30      53

What we have done here is to create a vector using the “combine” or “concatenate” function c(...) . This function will create a vector with entries that we have given, namely 1 and 4. Thus our command [, c(1,4)] will give us

All rows (because nothing is specified before the comma),
Only columns 1 and 4 (because c(1, 4) selects these two columns by position).

We can use similar syntax `c(...)` to get very specific data out of a dataframe.We can also use column/row names instead of numbers. In this the name string has to be inside quotes "name".

Note we cannot write [,1,4] it will give an unexpected answer.

Task

Use column names instead of column numbers to get the same subset of two columns.

But note that there are a total of three columns two columns of roll number and one for Science. This is clearly wrong. We had asked only for 2 columns! What is happening? The thing is that R adds row numbers also while displaying the data in this case. Notice that there is no column name in the first column. This feature is useful at times. But if we do not want this row numbers we can suppress this using

print(classroom_data[, c(1, 4)], row.names = FALSE)

 Roll.No Science
       1      92
       2      70
       3      60
       4      59
       5      49
       6      78
       7      47
       8      69
       9      87
      10      85
      11      38
      12      55
      13      62
      14      75
      15      45
      16      61
      17      48
      18      63
      19      44
      20      68
      21      79
      22      85
      23      47
      24      56
      25      93
      26      44
      27      89
      28      51
      29      42
      30      53

Enough of R commands, let us do some statistics on the data set. At the beginning we asked some questions about the classroom data.

Let us start with the first one

What are the highest and lowest marks in each subject?

For this we have two dedicated functions in R min(...) and max(...) to find out minimum and maximum values in a given data. The syntax is min(x) where x is set of the data values. In our case for each subject, we have column names. These can be seen with the command names(...):

names(classroom_data)

[1] "Roll.No"     "Gender"      "English"     "Science"     "Mathematics"

Thus, to get maximum values of English scores we can write

max(classroom_data$English)

[1] 91

Similarly, for minimum value we can write

min(classroom_data$English)

[1] 33

Task

Find out minimum and maximum values of Science and Mathematics scores. For science scores use the column name method, and for mathematics use column number method.

We can also save the output of the commands like min(...) to a specific variable. For example, let us store the minimum and maximum scores in English with variables eng_min and eng_max

eng_min <- min(classroom_data$English)
eng_max <- max(classroom_data$English)

Note that the above code block does not produce any results. This is because we have only computed and stored the minimum and maximum values to the variables eng_min and eng_max. To see these values we can just type the variable names.

eng_min

[1] 33

eng_max

[1] 91

Task

Store min and max values of Science and Mathematics scores in suitably named variables.

Now let us see how we can present the results in a nice table using package called as knitr(...). For this we will use slightly different approach

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

classroom_data |>
  summarise(
    science_min = min(Science),
    science_max = max(Science),
    science_mean = mean(Science)
  ) |> knitr::kable(caption = "Minimum and Maximum Scores by Subject")

Minimum and Maximum Scores by Subject
science_min	science_max	science_mean
38	93	63.13333

Here we have used one of the most powerful functions in R. The summarise(...) (also works with summarize(...) in American English) function. It is used create summary statistics from a data frame. summarise(...) collapses multiple values into a single summary value. In our case we are applying it to a column in a dataframe.

Another useful function is the summary(...) function. This function gives a statistical summary of different columns in a dataframe. The type of summary depends on the datatype of the column. For example, to get summary of classroom_data we can use

summary(classroom_data)

    Roll.No         Gender             English         Science     
 Min.   : 1.00   Length:30          Min.   :33.00   Min.   :38.00  
 1st Qu.: 8.25   Class :character   1st Qu.:48.25   1st Qu.:48.25  
 Median :15.50   Mode  :character   Median :62.00   Median :60.50  
 Mean   :15.50                      Mean   :62.40   Mean   :63.13  
 3rd Qu.:22.75                      3rd Qu.:76.50   3rd Qu.:77.25  
 Max.   :30.00                      Max.   :91.00   Max.   :93.00  
  Mathematics   
 Min.   :36.00  
 1st Qu.:48.50  
 Median :61.50  
 Mean   :63.23  
 3rd Qu.:78.25  
 Max.   :95.00

This statistical summary has the minimum, median, mean, maximum and first and third quantiles for numerical data. For text data it gives length of the column (30) and type of data.

hist(classroom_data$English)

6 One Variable Analysis

6.1 Frequency distributions

6.2 Central Tendencies

Central Tendency and Variability

Measures of Center

Mode: mode, crude mode, and refined mode.

Median: median, rough median, and exact median.

Mean: mean, grouped mean, weighted mean, pooled mean, and mean of dichotomous variable.

Other order measures: midextreme (midrange), midhinge, trimean, and biweight.

Other means: trimmed mean, winsorized mean, and midmean; geometric mean, harmonic mean, generalized mean, and quadratic mean.

Measures of Spread

Numeric: mean deviation (average deviation), population variance, population standard deviation, sample variance, sample standard deviation, pooled variance, variance of dichotomous variable, coefficient of variation (coefficient of relative variation), and Gini’s mean difference.

Ordinal: range, interquartile range (midspread), quartile deviation (semi-interquartile range, quartile range), coefficient of quartile variation, median absolute deviation, coefficient of dispersion, and Leik’s D. Nominal: variation ratio, index of diversity, index of qualitative variation, entropy, and standardized entropy.

Sampling distributions: variance of sampling distribution of means, standard error of the mean (standard deviation of sampling distribution of means), standard error of a proportion, and sampling error.

6.3 Summaries: Tabular and Graphical

Statistical Graphics for Univariate and Bivariate Data

Statistical Graphics for Visualizing Multivariate Data

# Relating Statistics and Experimental Design : An Introduction

Methods of Randomization in Experimental Design

6.4 Spread

6.5 Tables

6.6 Graphs

6.7 Normal and abnormal?

7 Two or more Variables

7.1 Tables

8 Inferential Statistics

8.1 Reliability and validity

Research Designs

Introduction to Survey Sampling

Achievement Testing: Recent Advances

Using Published Data: Errors and Remedies

Secondary Analysis of Survey Data

Bayesian Statistical Inference

Cluster Analysis

Models for Innovation Diffusion

Meta-Analysis: Quantitative Methods for Research Synthesis

Multiple Comparisons

Analysis of nominal data

Analysis of ordinal data