Structured Data: is organized in a highly predictable and well-defined structure. It follows a fixed schema. Examples include relational databases where data is organized into tables with predefined columns.
Unstructured Data: Data that lacks a predefined data model or does not fit well into relational databases. Examples include text documents, images, audio, and video files.
Semi-Structured Data: Data that does not fit neatly into a structured model but has some level of structure. Examples include XML and JSON data formats.
SQL (Structured Query Language): SQL is a domain-specific language used for managing and manipulating relational databases. Allows for querying, updating, and managing relational databases.
JSON (JavaScript Object Notation): A lightweight data interchange format that is easy for humans to read and write, and easy for machines to parse and generate. Commonly used for web APIs and configuration files.
Example of JSON Data : Consider data from a weather API represented in JSON format. While each data entry contains key-value pairs (like “temperature”: 25, “humidity”: 60), the structure can vary between different records.
{
“date”: “2023-01-15”,
“temperature”: 25,
“humidity”: 60,
“wind”: { “speed”: 10, “direction”: “NE” }
}
Big Data: Refers to datasets that are too large and complex for traditional data processing applications to handle efficiently. Its characteristics include Volume, Velocity, Variety, Veracity, and Value (known as the 5 V’s). Requires specialized tools and technologies for storage, processing, and analysis.
NoSQL Databases: Databases that provide a mechanism for storage and retrieval of data that is modeled in ways other than the tabular relations used in relational databases. Suitable for handling large amounts of unstructured or semi-structured data. Examples include MongoDB, Cassandra, and Redis.
2. DESCRIPTIVE STATISTICS
Let’s use a small dataset to explain some key concepts of descriptive statistics. Suppose we have the following dataset representing the scores of 10 students in a math test:
Arranging the data in ascending order: 65, 70, 75, 78, 80, 82, 85, 88, 90, 92.
Since there are 10 data points, the median is the average of the 5th and 6th values.
Median = (80 + 82)/2=81
Mode: The mode is the most frequently occurring value. In this dataset, there is no mode as each value occurs only once.
Measures of Dispersion
Range = Maximum Value − Minimum Value = 92 – 65 = 27
The range is 27.
Variance
Calculating the variance involves finding the squared differences between each score and the mean, summing them, and then dividing by the number of data points. For brevity, let’s skip the detailed calculation here.
Standard Deviation
It measures the average distance of each data point from the mean. It is the square root of the variance.
Interquartile Range (IQR): The IQR is the range covered by the middle 50% of the data.
Q1 is the first quartile (25th percentile) and
Q3 is the third quartile (75th percentile).
3. VISUALISATION
When we do Data Analytics, the data can be displayed graphically for better understanding
Bar Charts: Displays categorical data with rectangular bars. Each bar’s length corresponds to the frequency or count of the category.
Histograms: Similar to a bar chart but for continuous data. Divides the data into bins and represents the frequency of observations in each bin.
Scatter Plots: Represents the relationship between two continuous variables. Each point on the plot corresponds to a single observation.
Box Plots (Box-and-Whisker Plots): Displays the distribution of a dataset and identifies outliers. Shows the median, quartiles, and potential outliers.
Line Charts: Displays data points over a continuous interval. Useful for showing trends or patterns over time.
Pie Charts: Represents parts of a whole. Useful for displaying the proportion of each category in a dataset.
Heatmaps: A visual representation of data in a matrix format. Colors represent the values of the data points.
The Video below illustrates the use of these charts in data analytics.