2  Data and its variations

Any research is based on data. Data can come in various forms. Depending on how you approach it data can be classified according to various schemes.

Data comes in various formats. Cartoon by Manfred Steger CC by SA.

Analysing each type of data requires a different framework and methodology. Also, the methods for collecting each type of data may be different.

The word data is a plural form of Latin word datum which means literally ‘something given’, neuter past participle of dare ‘give’. Though plural, in modern use it is treated as mass noun which takes a singular verb. For example, sentences such as “data was collected via online questionnaire” are now widely accepted.

But let us ask ourselves this fundamental question: What is data? and How do we get data?

2.1 What is data?

Humans have been collecting and storing data in various forms since antiquity. The data stored in physical format such as inscriptions, papyrus rolls, cuneiform tablets or even oral traditions allowed knowledge to be passed over generations. In our present computer driven world, data is usually stored in a digital format.

At a very basic level data is information about the objects that we want to understand. Depending on what the field of study is the object of study may be a single cell, a single student, a group of students, a classroom, a school, a district, a state, a country, a document, an interview, a group discussion or the entire universe as in cosmology!

We select some features of this object of study and try to measure them. These “features” can be anything that is “measurable”. For example, it maybe “handedness” in a classroom of students, or their scores on some exam, or their heights. Thus each object of interest can give us several measurements. These measurements are termed as variables.

A mind map describing data and its relation to objects of interest.

Data and variables from objects of interest.

Now this is a fundamental limitation on our ability to get information. For example, we may want to understand thinking processes in the minds of children, but we have only access to what they say and do. This is our data that we can obtain either by experimentation or observation. This data does not tell us directly what we want to know, this is where idea of data analysis comes into picture. We infer from this data about the object of study by analysing this data.

We measure what we can…

The issue is that we can only measure what we can and we build models either mathematical or conceptual based on our analysis of this data. These “models” then help us create a framework for understanding the object of analysis and its variables in relation to each other other contextual variables. This of course is a crude and over simplified description of the actual process.

A cartoon showing measuring by tape

We measure what we can. Cartoon by Manfred Steger.

What type of measurements

Are needed to collect your data? If it is secondary data think about how it was collected. 

But underlying philosophy that guides such an approach comes from philosophy of science. We assume that by studying and observing the world around us we can collect some data which can be analysed to create models to understand the world. If you ask what is a model, we can perhaps answer:

A postulated structure which approximately could have led to data.

Such models help us think about the world in abstract concepts, and helps us reveal patterns, establish relationships and predict outcomes. But we should be remember that all models are tentative.

This is particular idea is very well captured by this quote by George Bos

All models are wrong. Some are useful.

What Box means here that we are creating approximate and simplified versions of reality using the data that we have. Thus we can never create a “correct” version of reality. History of science is full of examples which exemplify this. For example, before Einstein’s theory of relativity which made velocity of light to be a constant independent of source of light or the observer as a postulate, everyone believed that velocity of light was dependent on velocity of the source of light and the observer.

Let us look at Figure Figure 2.1 to understand this point a bit more concretely. Figure Figure 2.1 shows a miniature version of the solar system which has electric motors to show the revolution of planets around the sun. Now this model is a representation of the solar system and it is incorrect about so many things (can you tell me which).

A miniature solar system model.
Figure 2.1: A Solar System Model at a Government School near Jaipur, photo by Rafikh Shaikh, 2023.

But at the same time it provides a concrete way for students to understand various aspects of movements of planets. Thus just because a model is wrong doesn’t mean that it is not useful. This is true for all models that we create to describe the world around us.


Exercise

Think about other models which you use which are approximate or may be wrong.


2.2 Types of Data

Data can be of two primary types: observational and experimental.

Data can be broadly classified into two primary types: observational and experimental. Each type serves distinct purposes, offers unique insights, and has specific applications in educational contexts.

Observational data is collected by observing subjects in their natural environment without manipulating variables. This approach is particularly useful for understanding behaviours, interactions, and processes as they occur naturally. Observational data is non-intrusive, capturing real-world behaviours and interactions, and can be qualitative or quantitative, including descriptive notes or numerical measurements.

For example, researchers may conduct classroom observations to analyse teacher-student interactions, classroom dynamics, and student engagement. Video recordings of classrooms have been analysed to tailor teaching strategies that improve student performance. A study might record how teachers provide feedback during lessons and assess its impact on student motivation.

Researchers might record how children interact with peers during free play to understand early socialization patterns. Linguistic development studies can also utilize observational data by examining parent-child interactions at home to study language skills development. For instance, counting the number of words spoken by parents and analysing sentence complexity can reveal links to a child’s vocabulary growth.

The main thing is that in observational data the researcher does not control how things happen, they just tend to observe and record what is happening.

On the other hand, experimental data is collected by manipulating one or more variables under controlled conditions to establish cause-and-effect relationships. This approach is essential for testing hypotheses and determining the efficacy of interventions. Experimental data is characterised by controlled conditions where variables are manipulated while others are held constant, allowing researchers to determine whether changes in one variable directly affect another. Randomization is often employed to ensure unbiased assignment of treatments or interventions.

In educational research, for example, alternating between retrieval practice (quizzes) and restudy (reading answers) has been shown to improve student exam performance. Testing whether interactive learning modules increase retention compared to traditional lectures is another example of experimental data usage. Intervention studies may manipulate teaching methods—such as comparing group discussions versus individual assignments—to measure their impact on student engagement. A study might test whether incorporating gamification into lesson plans improves motivation among middle school students. Action research conducted by teachers systematically tests solutions to classroom problems, such as whether peer tutoring enhances math scores by introducing new teaching aids (e.g., visual tools) and measuring their effect on comprehension.

Another dimension for the classification of data is quantitative and qualitative. Within these, data can be further categorised based on its measurement level and characteristics.

Numerical, or quantitative data, consists of values that can be measured and expressed in numbers. It is used to perform mathematical computations and statistical analyses.

Discrete data represents countable values that take distinct, separate numbers, such as the number of students enrolled in a class, the number of questions answered correctly in a test, or the number of research papers published by a faculty member.

Continuous data, on the other hand, can take any value within a given range and is typically measured rather than counted. Examples include students’ heights and weights recorded in a physical education class, time taken by students to complete an online assessment, and the average marks obtained by students in an exam.

Categorical, or qualitative data, represents groups, labels, or classifications that do not have a numerical value but may be counted for frequency analysis.

Nominal data consists of categories that do not have a meaningful order or ranking, such as the different subjects chosen by students (e.g., Mathematics, Science, History), the types of schools (e.g., Government, Private, International), or students’ preferred learning styles (e.g., Visual, Auditory, Kinesthetic).

Ordinal data represents categories that have a meaningful order but do not have equal intervals between them. Examples include student performance ratings (such as Excellent, Good, Average, Poor), Likert scale responses in surveys (such as Strongly Agree, Agree, Neutral, Disagree, Strongly Disagree), and socio-economic status categories (such as Low, Middle, High).

Mindmap showing different types of data.

Different types of data.

Another important data type in educational research is time-series data, which is collected at regular intervals to observe trends over time. Examples include the annual dropout rate in secondary schools over the past decade, monthly attendance rates in a school over a year, and the number of students enrolling in higher education institutions each year.

Additionally, spatial data includes geographical or locational information, such as mapping the distribution of literacy rates across different states, identifying regions with the highest school dropout rates, and analysing school accessibility in rural and urban areas using Geographic Information Systems (GIS).

Understanding the different types of data is crucial for selecting appropriate analysis methods and ensuring accurate interpretation of research findings. Choosing the correct type of data ensures that we as researchers can derive meaningful and reliable insights from the data.

Now that we have refreshed some of the basic ideas about data and its types let us look at some data using R.


Exercise

Identify the variables (dependent/independent) and their types in the above examples.


2.3 Inbuilt datasets and Importing Data

To make statistical computations meaningful we will need data to work with. R has several excellent libraries which provide data about various measurements. Along with these datasets, we will also see how to import data from external sources such as spreadsheets, csv files etc.

Let us install some packages which provide several datasets. These are the learnbayes and MASS packages. To install them use the install.packages() function as shown below.

install.packages("learnbayes")

install.packages("MASS")

Exercise

Load these two libraries using the library() function.


Now within the LearnBayes and MASS libraries we have several datasets. To see the datasets in a library we will use data() command. But using use data() will give all the datasets installed on your R system.

To know more about MASS visit this link.

The learn more about LearnBayes visit this link.

# load  libraries 
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(LearnBayes)

library(MASS)

Attaching package: 'MASS'

The following object is masked from 'package:dplyr':

    select

To see a datasets in a specific library we specify to data() that we want. For example the code block below will print the datasets in the LearnBayes package. I will use package/library interchangeably.

data(package = "LearnBayes")


dataset_list <- data(package = "LearnBayes")$results[,3]
print(dataset_list)
 [1] "achievement"      "baseball.1964"    "bermuda.grass"    "birdextinct"     
 [5] "birthweight"      "breastcancer"     "calculus.grades"  "cancermortality" 
 [9] "chemotherapy"     "darwin"           "donner"           "election"        
[13] "election.2008"    "footballscores"   "hearttransplants" "iowagpa"         
[17] "jeter2004"        "marathontimes"    "puffin"           "schmidt"         
[21] "sluggerdata"      "soccergoals"      "stanfordheart"    "strikeout"       
[25] "studentdata"     

To see the MASS datasets

# view datasets in a particular library

#
data(package = "MASS")

This command will list all the datasets in the LearnBayes library. We will use some of the datasets to explore. We will use the dataset studentdata. To load this dataset we use data() command

# load the dataset
data("studentdata")

data("survey")

data("election")
str(election)
'data.frame':   67 obs. of  5 variables:
 $ county  : Factor w/ 67 levels "Alachua","Baker",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ perot   : int  8072 667 5922 819 25249 38964 630 7783 7244 3281 ...
 $ gore    : int  47300 2392 18850 3075 97318 386518 2155 29645 25525 14632 ...
 $ bush    : int  34062 5610 38637 5414 115185 177279 2873 35426 29766 41736 ...
 $ buchanan: int  262 73 248 65 570 789 90 182 270 186 ...
summary(election)
      county       perot            gore             bush       
 Alachua : 1   Min.   :  316   Min.   :   788   Min.   :  1316  
 Baker   : 1   1st Qu.: 1072   1st Qu.:  3058   1st Qu.:  4748  
 Bay     : 1   Median : 3739   Median : 14167   Median : 20206  
 Bradford: 1   Mean   : 7221   Mean   : 43400   Mean   : 43423  
 Brevard : 1   3rd Qu.: 8700   3rd Qu.: 45982   3rd Qu.: 56544  
 Broward : 1   Max.   :38964   Max.   :386518   Max.   :289492  
 (Other) :61                                                    
    buchanan     
 Min.   :   9.0  
 1st Qu.:  46.5  
 Median : 120.0  
 Mean   : 259.1  
 3rd Qu.: 285.5  
 Max.   :3407.0  
                 

2.4 “Seeing” the dataset

To “see” how this dataset is structured there are several commands that we can use.

str()

Let us start with the str() command which shows us the structure of a dataframe.

# to see the structure of the dataset
str(studentdata)
'data.frame':   657 obs. of  11 variables:
 $ Student: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Height : num  67 64 61 61 70 63 61 64 66 65 ...
 $ Gender : Factor w/ 2 levels "female","male": 1 1 1 1 2 1 1 1 1 2 ...
 $ Shoes  : num  10 20 12 3 4 NA 12 25 30 10 ...
 $ Number : int  5 7 2 6 5 3 3 4 3 7 ...
 $ Dvds   : num  10 5 6 40 6 5 53 20 40 22 ...
 $ ToSleep: num  -2.5 1.5 -1.5 2 0 1 1.5 0.5 -0.5 2.5 ...
 $ WakeUp : num  5.5 8 7.5 8.5 9 8.5 7.5 7.5 7 8.5 ...
 $ Haircut: num  60 0 48 10 15 25 35 25 30 12 ...
 $ Job    : num  30 20 0 0 17.5 0 20 0 25 0 ...
 $ Drink  : Factor w/ 3 levels "milk","pop","water": 3 2 1 3 2 3 3 2 3 1 ...
str(survey)
'data.frame':   237 obs. of  12 variables:
 $ Sex   : Factor w/ 2 levels "Female","Male": 1 2 2 2 2 1 2 1 2 2 ...
 $ Wr.Hnd: num  18.5 19.5 18 18.8 20 18 17.7 17 20 18.5 ...
 $ NW.Hnd: num  18 20.5 13.3 18.9 20 17.7 17.7 17.3 19.5 18.5 ...
 $ W.Hnd : Factor w/ 2 levels "Left","Right": 2 1 2 2 2 2 2 2 2 2 ...
 $ Fold  : Factor w/ 3 levels "L on R","Neither",..: 3 3 1 3 2 1 1 3 3 3 ...
 $ Pulse : int  92 104 87 NA 35 64 83 74 72 90 ...
 $ Clap  : Factor w/ 3 levels "Left","Neither",..: 1 1 2 2 3 3 3 3 3 3 ...
 $ Exer  : Factor w/ 3 levels "Freq","None",..: 3 2 2 2 3 3 1 1 3 3 ...
 $ Smoke : Factor w/ 4 levels "Heavy","Never",..: 2 4 3 2 2 2 2 2 2 2 ...
 $ Height: num  173 178 NA 160 165 ...
 $ M.I   : Factor w/ 2 levels "Imperial","Metric": 2 1 NA 2 2 1 1 2 2 2 ...
 $ Age   : num  18.2 17.6 16.9 20.3 23.7 ...

Note how the str() command gives the variable names for each column and their class.

Which classes of datatypes can you identify in each set? Use the dim() function to find out the dimensions of the data. What do you expect the dimensions of the above dataset (they are tables after all)?

head() and tail()

The head() and tail() commands in R are used to quickly view the beginning and end of a dataset, respectively. These functions are particularly helpful when working with large dataframes or vectors, allowing users to inspect a small portion of the data without displaying the entire dataset.

head(): Displays the first 6 rows (by default) of a dataframe or vector. This is useful when you want to get a quick look at the structure or the first few entries of the data.

head(studentdata)
  Student Height Gender Shoes Number Dvds ToSleep WakeUp Haircut  Job Drink
1       1     67 female    10      5   10    -2.5    5.5      60 30.0 water
2       2     64 female    20      7    5     1.5    8.0       0 20.0   pop
3       3     61 female    12      2    6    -1.5    7.5      48  0.0  milk
4       4     61 female     3      6   40     2.0    8.5      10  0.0 water
5       5     70   male     4      5    6     0.0    9.0      15 17.5   pop
6       6     63 female    NA      3    5     1.0    8.5      25  0.0 water

tail(): Displays the last 6 rows (by default) of a dataframe or vector. It’s useful for checking the most recent or final entries of your data.

tail(survey)
       Sex Wr.Hnd NW.Hnd W.Hnd   Fold Pulse  Clap Exer Smoke Height      M.I
232   Male   18.0   16.0 Right R on L    NA Right Some Never 180.34 Imperial
233 Female   18.0   18.0 Right L on R    85 Right Some Never 165.10 Imperial
234 Female   18.5   18.0 Right L on R    88 Right Some Never 160.00   Metric
235 Female   17.5   16.5 Right R on L    NA Right Some Never 170.00   Metric
236   Male   21.0   21.5 Right R on L    90 Right Some Never 183.00   Metric
237 Female   17.6   17.3 Right R on L    85 Right Freq Never 168.50   Metric
       Age
232 20.750
233 17.667
234 16.917
235 18.583
236 17.167
237 17.750

glimpse

The glimpse() function in R, provided by the dplyr package, offers a quick and concise overview of the structure of a dataframe or tibble. Unlike head() and tail(), which show only a subset of rows, glimpse() displays both the data types of each column and a preview of the data itself. This makes it especially useful for quickly understanding the composition and types of variables in a dataset.

glimpse(): Provides a transposed view of the dataframe, showing each column’s name, type, and a preview of its values. It helps to inspect the data structure in a compact form, particularly when dealing with wide datasets.

See Knuth (1984) for additional discussion of literate programming.

1 + 1
[1] 2

Sorting Data: Collection and Analysis

Missing Data 

Preparing Data for Analysis: From Raw to Ready