install.packages("learnbayes")
install.packages("MASS")
2 Data and its variations
Any research is based on data. Data can come in various forms. Depending on how you approach it data can be classified according to various schemes.
Analysing each type of data requires a different framework and methodology. Also, the methods for collecting each type of data may be different.
The word data is a plural form of Latin word datum which means literally ‘something given’, neuter past participle of dare ‘give’. Though plural, in modern use it is treated as mass noun which takes a singular verb. For example, sentences such as “data was collected via online questionnaire” are now widely accepted.
But let us ask ourselves this fundamental question: What is data? and How do we get data?
2.1 What is data?
Humans have been collecting and storing data in various forms since antiquity. The data stored in physical format such as inscriptions, papyrus rolls, cuneiform tablets or even oral traditions allowed knowledge to be passed over generations. In our present computer driven world, data is usually stored in a digital format.
At a very basic level data is information about the objects that we want to understand. Depending on what the field of study is the object of study may be a single cell, a single student, a group of students, a classroom, a school, a district, a state, a country, a document, an interview, a group discussion or the entire universe as in cosmology!
We select some features of this object of study and try to measure them. These “features” can be anything that is “measurable”. For example, it maybe “handedness” in a classroom of students, or their scores on some exam, or their heights. Thus each object of interest can give us several measurements. These measurements are termed as variables.
Now this is a fundamental limitation on our ability to get information. For example, we may want to understand thinking processes in the minds of children, but we have only access to what they say and do. This is our data that we can obtain either by experimentation or observation. This data does not tell us directly what we want to know, this is where idea of data analysis comes into picture. We infer from this data about the object of study by analysing this data.
We measure what we can…
The issue is that we can only measure what we can and we build models either mathematical or conceptual based on our analysis of this data. These “models” then help us create a framework for understanding the object of analysis and its variables in relation to each other other contextual variables. This of course is a crude and over simplified description of the actual process.
What type of measurements
Are needed to collect your data? If it is secondary data think about how it was collected.
But underlying philosophy that guides such an approach comes from philosophy of science. We assume that by studying and observing the world around us we can collect some data which can be analysed to create models to understand the world. If you ask what is a model, we can perhaps answer:
A postulated structure which approximately could have led to data.
Such models help us think about the world in abstract concepts, and helps us reveal patterns, establish relationships and predict outcomes. But we should be remember that all models are tentative.
This is particular idea is very well captured by this quote by George Bos
All models are wrong. Some are useful.
What Box means here that we are creating approximate and simplified versions of reality using the data that we have. Thus we can never create a “correct” version of reality. History of science is full of examples which exemplify this. For example, before Einstein’s theory of relativity which made velocity of light to be a constant independent of source of light or the observer as a postulate, everyone believed that velocity of light was dependent on velocity of the source of light and the observer.
Let us look at Figure Figure 2.1 to understand this point a bit more concretely. Figure Figure 2.1 shows a miniature version of the solar system which has electric motors to show the revolution of planets around the sun. Now this model is a representation of the solar system and it is incorrect about so many things (can you tell me which).

But at the same time it provides a concrete way for students to understand various aspects of movements of planets. Thus just because a model is wrong doesn’t mean that it is not useful. This is true for all models that we create to describe the world around us.
Exercise
Think about other models which you use which are approximate or may be wrong.
2.2 Types of Data
Data can be of two primary types: observational and experimental.
Data can be broadly classified into two primary types: observational and experimental. Each type serves distinct purposes, offers unique insights, and has specific applications in educational contexts.
Observational data is collected by observing subjects in their natural environment without manipulating variables. This approach is particularly useful for understanding behaviours, interactions, and processes as they occur naturally. Observational data is non-intrusive, capturing real-world behaviours and interactions, and can be qualitative or quantitative, including descriptive notes or numerical measurements.
For example, researchers may conduct classroom observations to analyse teacher-student interactions, classroom dynamics, and student engagement. Video recordings of classrooms have been analysed to tailor teaching strategies that improve student performance. A study might record how teachers provide feedback during lessons and assess its impact on student motivation.
Researchers might record how children interact with peers during free play to understand early socialization patterns. Linguistic development studies can also utilize observational data by examining parent-child interactions at home to study language skills development. For instance, counting the number of words spoken by parents and analysing sentence complexity can reveal links to a child’s vocabulary growth.
The main thing is that in observational data the researcher does not control how things happen, they just tend to observe and record what is happening.
On the other hand, experimental data is collected by manipulating one or more variables under controlled conditions to establish cause-and-effect relationships. This approach is essential for testing hypotheses and determining the efficacy of interventions. Experimental data is characterised by controlled conditions where variables are manipulated while others are held constant, allowing researchers to determine whether changes in one variable directly affect another. Randomization is often employed to ensure unbiased assignment of treatments or interventions.
In educational research, for example, alternating between retrieval practice (quizzes) and restudy (reading answers) has been shown to improve student exam performance. Testing whether interactive learning modules increase retention compared to traditional lectures is another example of experimental data usage. Intervention studies may manipulate teaching methods—such as comparing group discussions versus individual assignments—to measure their impact on student engagement. A study might test whether incorporating gamification into lesson plans improves motivation among middle school students. Action research conducted by teachers systematically tests solutions to classroom problems, such as whether peer tutoring enhances math scores by introducing new teaching aids (e.g., visual tools) and measuring their effect on comprehension.
Another dimension for the classification of data is quantitative and qualitative. Within these, data can be further categorised based on its measurement level and characteristics.
Numerical, or quantitative data, consists of values that can be measured and expressed in numbers. It is used to perform mathematical computations and statistical analyses.
Discrete data represents countable values that take distinct, separate numbers, such as the number of students enrolled in a class, the number of questions answered correctly in a test, or the number of research papers published by a faculty member.
Continuous data, on the other hand, can take any value within a given range and is typically measured rather than counted. Examples include students’ heights and weights recorded in a physical education class, time taken by students to complete an online assessment, and the average marks obtained by students in an exam.
Categorical, or qualitative data, represents groups, labels, or classifications that do not have a numerical value but may be counted for frequency analysis.
Nominal data consists of categories that do not have a meaningful order or ranking, such as the different subjects chosen by students (e.g., Mathematics, Science, History), the types of schools (e.g., Government, Private, International), or students’ preferred learning styles (e.g., Visual, Auditory, Kinesthetic).
Ordinal data represents categories that have a meaningful order but do not have equal intervals between them. Examples include student performance ratings (such as Excellent, Good, Average, Poor), Likert scale responses in surveys (such as Strongly Agree, Agree, Neutral, Disagree, Strongly Disagree), and socio-economic status categories (such as Low, Middle, High).
Another important data type in educational research is time-series data, which is collected at regular intervals to observe trends over time. Examples include the annual dropout rate in secondary schools over the past decade, monthly attendance rates in a school over a year, and the number of students enrolling in higher education institutions each year.
Additionally, spatial data includes geographical or locational information, such as mapping the distribution of literacy rates across different states, identifying regions with the highest school dropout rates, and analysing school accessibility in rural and urban areas using Geographic Information Systems (GIS).
Understanding the different types of data is crucial for selecting appropriate analysis methods and ensuring accurate interpretation of research findings. Choosing the correct type of data ensures that we as researchers can derive meaningful and reliable insights from the data.
Now that we have refreshed some of the basic ideas about data and its types let us look at some data using R.
Exercise
Identify the variables (dependent/independent) and their types in the above examples.
2.3 Inbuilt datasets and Importing Data
To make statistical computations meaningful we will need data to work with. R has several excellent libraries which provide data about various measurements. Along with these datasets, we will also see how to import data from external sources such as spreadsheets, csv files etc.
Let us install some packages which provide several datasets. These are the learnbayes
and MASS
packages. To install them use the install.packages()
function as shown below.
Exercise
Load these two libraries using the
library()
function.
Now within the LearnBayes
and MASS
libraries we have several datasets. To see the datasets in a library we will use data()
command. But using use data()
will give all the datasets installed on your R system.
To know more about MASS
visit this link.
The learn more about LearnBayes
visit this link.
# load libraries
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(LearnBayes)
library(MASS)
Attaching package: 'MASS'
The following object is masked from 'package:dplyr':
select
To see a datasets in a specific library we specify to data()
that we want. For example the code block below will print the datasets in the LearnBayes
package. I will use package/library interchangeably.
data(package = "LearnBayes")
<- data(package = "LearnBayes")$results[,3]
dataset_list print(dataset_list)
[1] "achievement" "baseball.1964" "bermuda.grass" "birdextinct"
[5] "birthweight" "breastcancer" "calculus.grades" "cancermortality"
[9] "chemotherapy" "darwin" "donner" "election"
[13] "election.2008" "footballscores" "hearttransplants" "iowagpa"
[17] "jeter2004" "marathontimes" "puffin" "schmidt"
[21] "sluggerdata" "soccergoals" "stanfordheart" "strikeout"
[25] "studentdata"
To see the MASS
datasets
# view datasets in a particular library
#
data(package = "MASS")
This command will list all the datasets in the LearnBayes
library. We will use some of the datasets to explore. We will use the dataset studentdata
. To load this dataset we use data()
command
# load the dataset
data("studentdata")
data("survey")
data("election")
str(election)
'data.frame': 67 obs. of 5 variables:
$ county : Factor w/ 67 levels "Alachua","Baker",..: 1 2 3 4 5 6 7 8 9 10 ...
$ perot : int 8072 667 5922 819 25249 38964 630 7783 7244 3281 ...
$ gore : int 47300 2392 18850 3075 97318 386518 2155 29645 25525 14632 ...
$ bush : int 34062 5610 38637 5414 115185 177279 2873 35426 29766 41736 ...
$ buchanan: int 262 73 248 65 570 789 90 182 270 186 ...
summary(election)
county perot gore bush
Alachua : 1 Min. : 316 Min. : 788 Min. : 1316
Baker : 1 1st Qu.: 1072 1st Qu.: 3058 1st Qu.: 4748
Bay : 1 Median : 3739 Median : 14167 Median : 20206
Bradford: 1 Mean : 7221 Mean : 43400 Mean : 43423
Brevard : 1 3rd Qu.: 8700 3rd Qu.: 45982 3rd Qu.: 56544
Broward : 1 Max. :38964 Max. :386518 Max. :289492
(Other) :61
buchanan
Min. : 9.0
1st Qu.: 46.5
Median : 120.0
Mean : 259.1
3rd Qu.: 285.5
Max. :3407.0
2.4 “Seeing” the dataset
To “see” how this dataset is structured there are several commands that we can use.
str()
Let us start with the str()
command which shows us the structure of a dataframe.
# to see the structure of the dataset
str(studentdata)
'data.frame': 657 obs. of 11 variables:
$ Student: int 1 2 3 4 5 6 7 8 9 10 ...
$ Height : num 67 64 61 61 70 63 61 64 66 65 ...
$ Gender : Factor w/ 2 levels "female","male": 1 1 1 1 2 1 1 1 1 2 ...
$ Shoes : num 10 20 12 3 4 NA 12 25 30 10 ...
$ Number : int 5 7 2 6 5 3 3 4 3 7 ...
$ Dvds : num 10 5 6 40 6 5 53 20 40 22 ...
$ ToSleep: num -2.5 1.5 -1.5 2 0 1 1.5 0.5 -0.5 2.5 ...
$ WakeUp : num 5.5 8 7.5 8.5 9 8.5 7.5 7.5 7 8.5 ...
$ Haircut: num 60 0 48 10 15 25 35 25 30 12 ...
$ Job : num 30 20 0 0 17.5 0 20 0 25 0 ...
$ Drink : Factor w/ 3 levels "milk","pop","water": 3 2 1 3 2 3 3 2 3 1 ...
str(survey)
'data.frame': 237 obs. of 12 variables:
$ Sex : Factor w/ 2 levels "Female","Male": 1 2 2 2 2 1 2 1 2 2 ...
$ Wr.Hnd: num 18.5 19.5 18 18.8 20 18 17.7 17 20 18.5 ...
$ NW.Hnd: num 18 20.5 13.3 18.9 20 17.7 17.7 17.3 19.5 18.5 ...
$ W.Hnd : Factor w/ 2 levels "Left","Right": 2 1 2 2 2 2 2 2 2 2 ...
$ Fold : Factor w/ 3 levels "L on R","Neither",..: 3 3 1 3 2 1 1 3 3 3 ...
$ Pulse : int 92 104 87 NA 35 64 83 74 72 90 ...
$ Clap : Factor w/ 3 levels "Left","Neither",..: 1 1 2 2 3 3 3 3 3 3 ...
$ Exer : Factor w/ 3 levels "Freq","None",..: 3 2 2 2 3 3 1 1 3 3 ...
$ Smoke : Factor w/ 4 levels "Heavy","Never",..: 2 4 3 2 2 2 2 2 2 2 ...
$ Height: num 173 178 NA 160 165 ...
$ M.I : Factor w/ 2 levels "Imperial","Metric": 2 1 NA 2 2 1 1 2 2 2 ...
$ Age : num 18.2 17.6 16.9 20.3 23.7 ...
Note how the str()
command gives the variable names for each column and their class.
Which classes of datatypes can you identify in each set? Use the dim()
function to find out the dimensions of the data. What do you expect the dimensions of the above dataset (they are tables after all)?
head()
and tail()
The head()
and tail()
commands in R are used to quickly view the beginning and end of a dataset, respectively. These functions are particularly helpful when working with large dataframes or vectors, allowing users to inspect a small portion of the data without displaying the entire dataset.
head()
: Displays the first 6 rows (by default) of a dataframe or vector. This is useful when you want to get a quick look at the structure or the first few entries of the data.
head(studentdata)
Student Height Gender Shoes Number Dvds ToSleep WakeUp Haircut Job Drink
1 1 67 female 10 5 10 -2.5 5.5 60 30.0 water
2 2 64 female 20 7 5 1.5 8.0 0 20.0 pop
3 3 61 female 12 2 6 -1.5 7.5 48 0.0 milk
4 4 61 female 3 6 40 2.0 8.5 10 0.0 water
5 5 70 male 4 5 6 0.0 9.0 15 17.5 pop
6 6 63 female NA 3 5 1.0 8.5 25 0.0 water
tail()
: Displays the last 6 rows (by default) of a dataframe or vector. It’s useful for checking the most recent or final entries of your data.
tail(survey)
Sex Wr.Hnd NW.Hnd W.Hnd Fold Pulse Clap Exer Smoke Height M.I
232 Male 18.0 16.0 Right R on L NA Right Some Never 180.34 Imperial
233 Female 18.0 18.0 Right L on R 85 Right Some Never 165.10 Imperial
234 Female 18.5 18.0 Right L on R 88 Right Some Never 160.00 Metric
235 Female 17.5 16.5 Right R on L NA Right Some Never 170.00 Metric
236 Male 21.0 21.5 Right R on L 90 Right Some Never 183.00 Metric
237 Female 17.6 17.3 Right R on L 85 Right Freq Never 168.50 Metric
Age
232 20.750
233 17.667
234 16.917
235 18.583
236 17.167
237 17.750
glimpse
The glimpse()
function in R, provided by the dplyr package, offers a quick and concise overview of the structure of a dataframe or tibble. Unlike head()
and tail()
, which show only a subset of rows, glimpse()
displays both the data types of each column and a preview of the data itself. This makes it especially useful for quickly understanding the composition and types of variables in a dataset.
glimpse()
: Provides a transposed view of the dataframe, showing each column’s name, type, and a preview of its values. It helps to inspect the data structure in a compact form, particularly when dealing with wide datasets.
See Knuth (1984) for additional discussion of literate programming.
1 + 1
[1] 2
Sorting Data: Collection and Analysis
Missing Data
Preparing Data for Analysis: From Raw to Ready