Exercise 1: Introduction to Python¶

In this first exercise of Data Handling in Python, we will cover basic Python operations, installing and importing necessary libraries, loading data into a pandas DataFrame, and performing initial data exploration of the Scottish Health Survey (SHeS) dataset.

Getting started¶

It is important to keep all files organised for efficient data management and project workflow. To get started, create a new folder named "data-handling-in-python" as your working directory for this tutorial.

Next, create a new notebook using Jupyter Notebook, naming it "exercise-1". This will be where you perform the exercises for this tutorial.

Arithmetic Operations in Python¶

You can use Python to do basic operations that you would do on a calculator. For example, open up "exercise-1.ipynb" in Jupyter Notebook and type the following lines of code:

In [1]:
5 + 5  # addition
Out[1]:
10
In [2]:
4 * 3  # multiplication
Out[2]:
12
In [3]:
10 / 2  # division
Out[3]:
5.0
In [4]:
2 ** 5  # exponent
Out[4]:
32
In [5]:
13 % 4  # modulus
Out[5]:
1
In [6]:
10 // 3  # integer division
Out[6]:
3

Note: Since you are using Jupyter notebook, which is an interactive enviornment, you can simply type the expression and see the result. However, if you want to run the code in a script or non-interactive environment, you will need to use print() to see the output, for example:

In [7]:
print(2 + 7)
9

Try the following examples:

In [8]:
7 * (4 - 2)
Out[8]:
14
In [9]:
abs(-6)
Out[9]:
6
In [10]:
abs(7)
Out[10]:
7

In addition to these build-in operators, Python provides a math library that offers a wide range of mathematical functions. First, you need to import the math library:

In [11]:
import math

Once imported, you can use functions from the math library to perform advanced mathematical operations. Here are some examples:

In [12]:
math.sqrt(16)  # square root
Out[12]:
4.0
In [13]:
math.factorial(5) # factorial
Out[13]:
120
In [14]:
math.pi / 2 # pi
Out[14]:
1.5707963267948966
In [15]:
math.log(1) # natural logarithm
Out[15]:
0.0
In [16]:
math.log10(100) # logarithm with base 10
Out[16]:
2.0

Assignment Operators in Python¶

In Python, you can use variables to store information. We use the assignment operator = to assign values to variables. Try out the following:

In [17]:
x = 30 / 5

Here, we are assigning the variable x the value of the result of the operation 30 / 5.

We can see the actual value by printing the variable x:

In [18]:
print(x)
6.0

You can then use the variable to do subsequent computations, e.g.:

In [19]:
x * 11
Out[19]:
66.0

If you assign a different value to the same variable name, you will replace the original variable and its value will be lost. So, be careful in naming your variables!

Note: Python is case sensitive, so the variable x is not the same as the variable X. You will get an error if you use the wrong case:

In [20]:
print(X)  # This will raise a NameError since 'X' is not defined
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[20], line 1
----> 1 print(X)

NameError: name 'X' is not defined

You can assign variables a value of any type, not just numbers. For example, you can store a string of characters by enclosing it in quotation marks:

In [21]:
course = "Data Handling in Python"
print(course)
Data Handling in Python

However, you can't mix types when performing arithmetic operations:

In [22]:
x + course  # This will raise a TypeError
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[22], line 1
----> 1 x + course

TypeError: unsupported operand type(s) for +: 'float' and 'str'

You can ask Python what type a certain variable's value is by using the type() function:

In [23]:
type(x)
Out[23]:
float
In [24]:
type(course)
Out[24]:
str
In [25]:
type(math.sqrt)
Out[25]:
builtin_function_or_method

Comparison Operators in Python¶

Comparison operators are used to compare values and return boolean results (True or False). Try the following examples in Python:

In [26]:
5 > 2     # greater than
Out[26]:
True
In [27]:
6 < 4     # less than
Out[27]:
False
In [28]:
11 >= 15  # greater than or equal to
Out[28]:
False
In [29]:
10 <= 10  # less than or equal to
Out[29]:
True
In [30]:
2**3 == 8 # equal to
Out[30]:
True
In [31]:
6/2 != 4  # not equal to
Out[31]:
True

You can assign comparison results to a variable:

In [32]:
op = (2**3 == 8)
type(op)
Out[32]:
bool

As you can see, the output of these operations are all True/False (boolean) values. In Python, these objects are of class bool.

If you try to perform arithmetic operations on booleans, True becomes 1 and False becomes 0.

In [33]:
True + 10
Out[33]:
11
In [34]:
False + 2
Out[34]:
2

Logical Operators in Python¶

You can use logical operators to combine conditional statements.

For example, x and y returns True if both x and y are True. If either x or y is False, the operation will return False. This operator and is called the logical AND operator.

In contrast, x or y returns True if either x or y is True. Therefore, the operation will only return False if both x and y are False. This operator or is called the logical OR operator.

In [35]:
x = True
y = False
In [36]:
x and y  # logical AND operator
Out[36]:
False
In [37]:
x or y   # logical OR operator
Out[37]:
True

Save the "exercise-1" notebook and return to the Jupyter dashboard. Create a new notebook, naming it "scottish_health_survey". The rest of this tutorial will be completed within this notebook.

Install and Import Libraries¶

To work effectively with data in Python, you will need some additional libraries. The most commonly used ones are pandas for data manipulation, numpy for numerical operations, and matplotlib for data visualization. Follow these steps to install and import these libraries.

Step 1: Install the Libraries¶

You can use pip (the Python package installer) to install these libraries. Open your terminal or command prompt and run the following commands:

In [ ]:
pip install pandas
pip install numpy
pip install matplotlib

These commands will download and install the libraries from the Python Package Index (PyPI), where most Python packages can be found.

Step 2: Import the Libraries¶

Once the libraries are installed, you can import them into the "scottish-health-survey" notebook using the following commands:

In [38]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

We use the as keyword to assign aliases to libraries, such as pd for pandas and np for numpy. This practice shortens the code when referencing these libraries and enhances readability.

Import the dataset¶

For this tutorial, we will be using a dataset collected by the Scottish Health Survey, containing data on various health indicators, behaviors, and outcomes for the Scottish population.

Download the CSV file "scottish_health_data.csv" from the MANTRA website and move it to your working directory.

Next, import the dataset using the read_csv() function from the pandas library, naming it health_survey.

In [39]:
health_survey = pd.read_csv("scottish_health_data.csv")

The health_survey DataFrame should now be loaded into your Python environment.

There are various functions in Python that allow you to do some initial exploration of data and get a feel for the dataset's features.

For example, the head() method displays the first five rows of the dataset.

In [40]:
health_survey.head()
Out[40]:
id year sex fruit_vegetable_consumption_mean_daily_portions alcohol_consumption_mean_weekly_units mental_wellbeing self_assessed_health
0 1 2021 Female 3.40 0.00 3.40 72.2
1 2 2012 Male 3.19 17.50 3.19 76.8
2 3 2019 Female 3.30 1.91 3.30 78.3
3 4 2009 Male 3.30 0.00 3.30 71.2
4 5 2010 Female 3.20 22.04 3.20 71.2

Our dataset provides details on the year the data was collected, the sex of the individuals, their average daily consumption of fruits and vegetables, their average weekly alcohol consumption, their mental wellbeing scores, and their self-assessed general health status

The tail() method displays the last five rows of the dataset:

In [41]:
health_survey.tail()
Out[41]:
id year sex fruit_vegetable_consumption_mean_daily_portions alcohol_consumption_mean_weekly_units mental_wellbeing self_assessed_health
626 627 2011 Female 3.18 9.9 3.18 71.5
627 628 2019 Female 3.20 14.8 3.20 70.4
628 629 2010 Female 3.21 8.6 3.21 72.4
629 630 2012 Female 3.15 16.9 3.15 70.0
630 11 2008 Male 3.16 20.3 3.16 77.0

Use the shape attribute to find out the dimensions of the dataset:

In [42]:
health_survey.shape
Out[42]:
(631, 7)

This tells us that the dataset has 631 rows and 7 columns.

Use the info() function to get a concise summary of the dataset, including the data types of each column:

In [43]:
health_survey.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 631 entries, 0 to 630
Data columns (total 7 columns):
 #   Column                                           Non-Null Count  Dtype  
---  ------                                           --------------  -----  
 0   id                                               631 non-null    int64  
 1   year                                             631 non-null    int64  
 2   sex                                              631 non-null    object 
 3   fruit_vegetable_consumption_mean_daily_portions  629 non-null    float64
 4   alcohol_consumption_mean_weekly_units            629 non-null    float64
 5   mental_wellbeing                                 626 non-null    float64
 6   self_assessed_health                             628 non-null    float64
dtypes: float64(4), int64(2), object(1)
memory usage: 34.6+ KB

The columns "fruit_vegetable_consumption_mean_daily_portions", "alcohol_consumption_mean_weekly_units", "mental_wellbeing", and "self_assessed_health" are all floating-point data types (float64), "id" and "year" are integer data types (int64), and "sex" is a string data type (object). We can also see that there are some missing values in the numeric columns. Will will address these later.

Finally, use the describe() method to get summary statistics for each column in the dataset:

In [44]:
health_survey.describe()
Out[44]:
id year fruit_vegetable_consumption_mean_daily_portions alcohol_consumption_mean_weekly_units mental_wellbeing self_assessed_health
count 631.000000 631.000000 629.000000 629.000000 626.000000 628.000000
mean 315.017433 2014.632330 3.209300 15.095246 3.209265 73.832643
std 182.268644 4.261064 0.087552 13.291717 0.087702 2.558671
min 1.000000 2008.000000 2.700000 0.000000 2.700000 67.000000
25% 157.500000 2011.000000 3.180000 9.300000 3.180000 72.000000
50% 315.000000 2014.000000 3.200000 13.000000 3.200000 74.000000
75% 472.500000 2018.000000 3.240000 16.700000 3.240000 75.700000
max 630.000000 2022.000000 3.600000 126.900000 3.600000 79.500000

For numeric and integer variables, this summary provides the count, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum values of each column.