Exercise 1: Introduction to Python¶
In this first exercise of Data Handling in Python, we will cover basic Python operations, installing and importing necessary libraries, loading data into a pandas DataFrame, and performing initial data exploration of the Scottish Health Survey (SHeS) dataset.
Getting started¶
It is important to keep all files organised for efficient data management and project workflow. To get started, create a new folder named "data-handling-in-python" as your working directory for this tutorial.
Next, create a new notebook using Jupyter Notebook, naming it "exercise-1". This will be where you perform the exercises for this tutorial.
Arithmetic Operations in Python¶
You can use Python to do basic operations that you would do on a calculator. For example, open up "exercise-1.ipynb" in Jupyter Notebook and type the following lines of code:
5 + 5 # addition
10
4 * 3 # multiplication
12
10 / 2 # division
5.0
2 ** 5 # exponent
32
13 % 4 # modulus
1
10 // 3 # integer division
3
Note: Since you are using Jupyter notebook, which is an interactive enviornment, you can simply type the expression and see the result. However, if you want to run the code in a script or non-interactive environment, you will need to use print()
to see the output, for example:
print(2 + 7)
9
Try the following examples:
7 * (4 - 2)
14
abs(-6)
6
abs(7)
7
In addition to these build-in operators, Python provides a math
library that offers a wide range of mathematical functions. First, you need to import the math
library:
import math
Once imported, you can use functions from the math
library to perform advanced mathematical operations. Here are some examples:
math.sqrt(16) # square root
4.0
math.factorial(5) # factorial
120
math.pi / 2 # pi
1.5707963267948966
math.log(1) # natural logarithm
0.0
math.log10(100) # logarithm with base 10
2.0
Assignment Operators in Python¶
In Python, you can use variables to store information. We use the assignment operator =
to assign values to variables. Try out the following:
x = 30 / 5
Here, we are assigning the variable x
the value of the result of the operation 30 / 5
.
We can see the actual value by printing the variable x
:
print(x)
6.0
You can then use the variable to do subsequent computations, e.g.:
x * 11
66.0
If you assign a different value to the same variable name, you will replace the original variable and its value will be lost. So, be careful in naming your variables!
Note: Python is case sensitive, so the variable x
is not the same as the variable X
. You will get an error if you use the wrong case:
print(X) # This will raise a NameError since 'X' is not defined
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[20], line 1 ----> 1 print(X) NameError: name 'X' is not defined
You can assign variables a value of any type, not just numbers. For example, you can store a string of characters by enclosing it in quotation marks:
course = "Data Handling in Python"
print(course)
Data Handling in Python
However, you can't mix types when performing arithmetic operations:
x + course # This will raise a TypeError
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[22], line 1 ----> 1 x + course TypeError: unsupported operand type(s) for +: 'float' and 'str'
You can ask Python what type a certain variable's value is by using the type()
function:
type(x)
float
type(course)
str
type(math.sqrt)
builtin_function_or_method
Comparison Operators in Python¶
Comparison operators are used to compare values and return boolean results (True or False). Try the following examples in Python:
5 > 2 # greater than
True
6 < 4 # less than
False
11 >= 15 # greater than or equal to
False
10 <= 10 # less than or equal to
True
2**3 == 8 # equal to
True
6/2 != 4 # not equal to
True
You can assign comparison results to a variable:
op = (2**3 == 8)
type(op)
bool
As you can see, the output of these operations are all True
/False
(boolean) values. In Python, these objects are of class bool
.
If you try to perform arithmetic operations on booleans, True
becomes 1 and False
becomes 0.
True + 10
11
False + 2
2
Logical Operators in Python¶
You can use logical operators to combine conditional statements.
For example, x and y
returns True
if both x
and y
are True
. If either x
or y
is False
, the operation will return False
. This operator and
is called the logical AND operator.
In contrast, x or y
returns True
if either x
or y
is True
. Therefore, the operation will only return False
if both x
and y
are False
. This operator or
is called the logical OR operator.
x = True
y = False
x and y # logical AND operator
False
x or y # logical OR operator
True
Save the "exercise-1" notebook and return to the Jupyter dashboard. Create a new notebook, naming it "scottish_health_survey". The rest of this tutorial will be completed within this notebook.
Install and Import Libraries¶
To work effectively with data in Python, you will need some additional libraries. The most commonly used ones are pandas
for data manipulation, numpy
for numerical operations, and matplotlib
for data visualization. Follow these steps to install and import these libraries.
Step 1: Install the Libraries¶
You can use pip
(the Python package installer) to install these libraries. Open your terminal or command prompt and run the following commands:
pip install pandas
pip install numpy
pip install matplotlib
These commands will download and install the libraries from the Python Package Index (PyPI), where most Python packages can be found.
Step 2: Import the Libraries¶
Once the libraries are installed, you can import them into the "scottish-health-survey" notebook using the following commands:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
We use the as
keyword to assign aliases to libraries, such as pd
for pandas
and np
for numpy
. This practice shortens the code when referencing these libraries and enhances readability.
Import the dataset¶
For this tutorial, we will be using a dataset collected by the Scottish Health Survey, containing data on various health indicators, behaviors, and outcomes for the Scottish population.
Download the CSV file "scottish_health_data.csv" from the MANTRA website and move it to your working directory.
Next, import the dataset using the read_csv()
function from the pandas
library, naming it health_survey
.
health_survey = pd.read_csv("scottish_health_data.csv")
The health_survey
DataFrame should now be loaded into your Python environment.
There are various functions in Python that allow you to do some initial exploration of data and get a feel for the dataset's features.
For example, the head()
method displays the first five rows of the dataset.
health_survey.head()
id | year | sex | fruit_vegetable_consumption_mean_daily_portions | alcohol_consumption_mean_weekly_units | mental_wellbeing | self_assessed_health | |
---|---|---|---|---|---|---|---|
0 | 1 | 2021 | Female | 3.40 | 0.00 | 3.40 | 72.2 |
1 | 2 | 2012 | Male | 3.19 | 17.50 | 3.19 | 76.8 |
2 | 3 | 2019 | Female | 3.30 | 1.91 | 3.30 | 78.3 |
3 | 4 | 2009 | Male | 3.30 | 0.00 | 3.30 | 71.2 |
4 | 5 | 2010 | Female | 3.20 | 22.04 | 3.20 | 71.2 |
Our dataset provides details on the year the data was collected, the sex of the individuals, their average daily consumption of fruits and vegetables, their average weekly alcohol consumption, their mental wellbeing scores, and their self-assessed general health status
The tail()
method displays the last five rows of the dataset:
health_survey.tail()
id | year | sex | fruit_vegetable_consumption_mean_daily_portions | alcohol_consumption_mean_weekly_units | mental_wellbeing | self_assessed_health | |
---|---|---|---|---|---|---|---|
626 | 627 | 2011 | Female | 3.18 | 9.9 | 3.18 | 71.5 |
627 | 628 | 2019 | Female | 3.20 | 14.8 | 3.20 | 70.4 |
628 | 629 | 2010 | Female | 3.21 | 8.6 | 3.21 | 72.4 |
629 | 630 | 2012 | Female | 3.15 | 16.9 | 3.15 | 70.0 |
630 | 11 | 2008 | Male | 3.16 | 20.3 | 3.16 | 77.0 |
Use the shape
attribute to find out the dimensions of the dataset:
health_survey.shape
(631, 7)
This tells us that the dataset has 631 rows and 7 columns.
Use the info()
function to get a concise summary of the dataset, including the data types of each column:
health_survey.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 631 entries, 0 to 630 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 631 non-null int64 1 year 631 non-null int64 2 sex 631 non-null object 3 fruit_vegetable_consumption_mean_daily_portions 629 non-null float64 4 alcohol_consumption_mean_weekly_units 629 non-null float64 5 mental_wellbeing 626 non-null float64 6 self_assessed_health 628 non-null float64 dtypes: float64(4), int64(2), object(1) memory usage: 34.6+ KB
The columns "fruit_vegetable_consumption_mean_daily_portions", "alcohol_consumption_mean_weekly_units", "mental_wellbeing", and "self_assessed_health" are all floating-point data types (float64
), "id" and "year" are integer data types (int64
), and "sex" is a string data type (object
). We can also see that there are some missing values in the numeric columns. Will will address these later.
Finally, use the describe()
method to get summary statistics for each column in the dataset:
health_survey.describe()
id | year | fruit_vegetable_consumption_mean_daily_portions | alcohol_consumption_mean_weekly_units | mental_wellbeing | self_assessed_health | |
---|---|---|---|---|---|---|
count | 631.000000 | 631.000000 | 629.000000 | 629.000000 | 626.000000 | 628.000000 |
mean | 315.017433 | 2014.632330 | 3.209300 | 15.095246 | 3.209265 | 73.832643 |
std | 182.268644 | 4.261064 | 0.087552 | 13.291717 | 0.087702 | 2.558671 |
min | 1.000000 | 2008.000000 | 2.700000 | 0.000000 | 2.700000 | 67.000000 |
25% | 157.500000 | 2011.000000 | 3.180000 | 9.300000 | 3.180000 | 72.000000 |
50% | 315.000000 | 2014.000000 | 3.200000 | 13.000000 | 3.200000 | 74.000000 |
75% | 472.500000 | 2018.000000 | 3.240000 | 16.700000 | 3.240000 | 75.700000 |
max | 630.000000 | 2022.000000 | 3.600000 | 126.900000 | 3.600000 | 79.500000 |
For numeric and integer variables, this summary provides the count, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum values of each column.