Exercise 1: Introduction to Python¶

In this first exercise of Data Handling in Python, we will cover basic Python operations, installing and importing necessary libraries, loading data into a pandas DataFrame, and performing initial data exploration of the Scottish Health Survey (SHeS) dataset.

Getting started¶

It is important to keep all files organised for efficient data management and project workflow. To get started, create a new folder named "data-handling-in-python" as your working directory for this tutorial.

Next, create a new notebook using Jupyter Notebook, naming it "exercise-1". This will be where you perform the exercises for this tutorial.

Arithmetic Operations in Python¶

You can use Python to do basic operations that you would do on a calculator. For example, open up "exercise-1.ipynb" in Jupyter Notebook and type the following lines of code:

In [1]:

5 + 5  # addition

Out[1]:

In [2]:

4 * 3  # multiplication

Out[2]:

In [3]:

10 / 2  # division

Out[3]:

5.0

In [4]:

2 ** 5  # exponent

Out[4]:

In [5]:

13 % 4  # modulus

Out[5]:

In [6]:

10 // 3  # integer division

Out[6]:

Note: Since you are using Jupyter notebook, which is an interactive enviornment, you can simply type the expression and see the result. However, if you want to run the code in a script or non-interactive environment, you will need to use print() to see the output, for example:

In [7]:

print(2 + 7)

Try the following examples:

In [8]:

7 * (4 - 2)

Out[8]:

In [9]:

abs(-6)

Out[9]:

In [10]:

abs(7)

Out[10]:

In addition to these build-in operators, Python provides a math library that offers a wide range of mathematical functions. First, you need to import the math library:

In [11]:

import math

Once imported, you can use functions from the math library to perform advanced mathematical operations. Here are some examples:

In [12]:

math.sqrt(16)  # square root

Out[12]:

4.0

In [13]:

math.factorial(5) # factorial

Out[13]:

In [14]:

math.pi / 2 # pi

Out[14]:

1.5707963267948966

In [15]:

math.log(1) # natural logarithm

Out[15]:

0.0

In [16]:

math.log10(100) # logarithm with base 10

Out[16]:

2.0

Assignment Operators in Python¶

In Python, you can use variables to store information. We use the assignment operator = to assign values to variables. Try out the following:

In [17]:

x = 30 / 5

Here, we are assigning the variable x the value of the result of the operation 30 / 5.

We can see the actual value by printing the variable x:

In [18]:

print(x)

6.0

You can then use the variable to do subsequent computations, e.g.:

In [19]:

x * 11

Out[19]:

66.0

If you assign a different value to the same variable name, you will replace the original variable and its value will be lost. So, be careful in naming your variables!

Note: Python is case sensitive, so the variable x is not the same as the variable X. You will get an error if you use the wrong case:

In [20]:

print(X)  # This will raise a NameError since 'X' is not defined

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[20], line 1
----> 1 print(X)

NameError: name 'X' is not defined

You can assign variables a value of any type, not just numbers. For example, you can store a string of characters by enclosing it in quotation marks:

In [21]:

course = "Data Handling in Python"
print(course)

Data Handling in Python

However, you can't mix types when performing arithmetic operations:

In [22]:

x + course  # This will raise a TypeError

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[22], line 1
----> 1 x + course

TypeError: unsupported operand type(s) for +: 'float' and 'str'

You can ask Python what type a certain variable's value is by using the type() function:

In [23]:

type(x)

Out[23]:

float

In [24]:

type(course)

Out[24]:

str

In [25]:

type(math.sqrt)

Out[25]:

builtin_function_or_method

Comparison Operators in Python¶

Comparison operators are used to compare values and return boolean results (True or False). Try the following examples in Python:

In [26]:

5 > 2     # greater than

Out[26]:

True

In [27]:

6 < 4     # less than

Out[27]:

False

In [28]:

11 >= 15  # greater than or equal to

Out[28]:

False

In [29]:

10 <= 10  # less than or equal to

Out[29]:

True

In [30]:

2**3 == 8 # equal to

Out[30]:

True

In [31]:

6/2 != 4  # not equal to

Out[31]:

True

You can assign comparison results to a variable:

In [32]:

op = (2**3 == 8)
type(op)

Out[32]:

bool

As you can see, the output of these operations are all True/False (boolean) values. In Python, these objects are of class bool.

If you try to perform arithmetic operations on booleans, True becomes 1 and False becomes 0.

In [33]:

True + 10

Out[33]:

In [34]:

False + 2

Out[34]:

Logical Operators in Python¶

You can use logical operators to combine conditional statements.

For example, x and y returns True if both x and y are True. If either x or y is False, the operation will return False. This operator and is called the logical AND operator.

In contrast, x or y returns True if either x or y is True. Therefore, the operation will only return False if both x and y are False. This operator or is called the logical OR operator.

In [35]:

x = True
y = False

In [36]:

x and y  # logical AND operator

Out[36]:

False

In [37]:

x or y   # logical OR operator

Out[37]:

True

Save the "exercise-1" notebook and return to the Jupyter dashboard. Create a new notebook, naming it "scottish_health_survey". The rest of this tutorial will be completed within this notebook.

Install and Import Libraries¶

To work effectively with data in Python, you will need some additional libraries. The most commonly used ones are pandas for data manipulation, numpy for numerical operations, and matplotlib for data visualization. Follow these steps to install and import these libraries.

Step 1: Install the Libraries¶

You can use pip (the Python package installer) to install these libraries. Open your terminal or command prompt and run the following commands:

In [ ]:

pip install pandas
pip install numpy
pip install matplotlib

These commands will download and install the libraries from the Python Package Index (PyPI), where most Python packages can be found.

Step 2: Import the Libraries¶

Once the libraries are installed, you can import them into the "scottish-health-survey" notebook using the following commands:

In [38]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

We use the as keyword to assign aliases to libraries, such as pd for pandas and np for numpy. This practice shortens the code when referencing these libraries and enhances readability.

Import the dataset¶

For this tutorial, we will be using a dataset collected by the Scottish Health Survey, containing data on various health indicators, behaviors, and outcomes for the Scottish population.

Download the CSV file "scottish_health_data.csv" from the MANTRA website and move it to your working directory.

Next, import the dataset using the read_csv() function from the pandas library, naming it health_survey.

In [39]:

health_survey = pd.read_csv("scottish_health_data.csv")

The health_survey DataFrame should now be loaded into your Python environment.

There are various functions in Python that allow you to do some initial exploration of data and get a feel for the dataset's features.

For example, the head() method displays the first five rows of the dataset.

In [40]:

health_survey.head()

Out[40]:

	id	year	sex	fruit_vegetable_consumption_mean_daily_portions	alcohol_consumption_mean_weekly_units	mental_wellbeing	self_assessed_health
0	1	2021	Female	3.40	0.00	3.40	72.2
1	2	2012	Male	3.19	17.50	3.19	76.8
2	3	2019	Female	3.30	1.91	3.30	78.3
3	4	2009	Male	3.30	0.00	3.30	71.2
4	5	2010	Female	3.20	22.04	3.20	71.2

Our dataset provides details on the year the data was collected, the sex of the individuals, their average daily consumption of fruits and vegetables, their average weekly alcohol consumption, their mental wellbeing scores, and their self-assessed general health status

The tail() method displays the last five rows of the dataset:

In [41]:

health_survey.tail()

Out[41]:

	id	year	sex	fruit_vegetable_consumption_mean_daily_portions	alcohol_consumption_mean_weekly_units	mental_wellbeing	self_assessed_health
626	627	2011	Female	3.18	9.9	3.18	71.5
627	628	2019	Female	3.20	14.8	3.20	70.4
628	629	2010	Female	3.21	8.6	3.21	72.4
629	630	2012	Female	3.15	16.9	3.15	70.0
630	11	2008	Male	3.16	20.3	3.16	77.0

Use the shape attribute to find out the dimensions of the dataset:

In [42]:

health_survey.shape

Out[42]:

(631, 7)

This tells us that the dataset has 631 rows and 7 columns.

Use the info() function to get a concise summary of the dataset, including the data types of each column:

In [43]:

health_survey.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 631 entries, 0 to 630
Data columns (total 7 columns):
 #   Column                                           Non-Null Count  Dtype  
---  ------                                           --------------  -----  
 0   id                                               631 non-null    int64  
 1   year                                             631 non-null    int64  
 2   sex                                              631 non-null    object 
 3   fruit_vegetable_consumption_mean_daily_portions  629 non-null    float64
 4   alcohol_consumption_mean_weekly_units            629 non-null    float64
 5   mental_wellbeing                                 626 non-null    float64
 6   self_assessed_health                             628 non-null    float64
dtypes: float64(4), int64(2), object(1)
memory usage: 34.6+ KB

The columns "fruit_vegetable_consumption_mean_daily_portions", "alcohol_consumption_mean_weekly_units", "mental_wellbeing", and "self_assessed_health" are all floating-point data types (float64), "id" and "year" are integer data types (int64), and "sex" is a string data type (object). We can also see that there are some missing values in the numeric columns. Will will address these later.

Finally, use the describe() method to get summary statistics for each column in the dataset:

In [44]:

health_survey.describe()

Out[44]:

	id	year	fruit_vegetable_consumption_mean_daily_portions	alcohol_consumption_mean_weekly_units	mental_wellbeing	self_assessed_health
count	631.000000	631.000000	629.000000	629.000000	626.000000	628.000000
mean	315.017433	2014.632330	3.209300	15.095246	3.209265	73.832643
std	182.268644	4.261064	0.087552	13.291717	0.087702	2.558671
min	1.000000	2008.000000	2.700000	0.000000	2.700000	67.000000
25%	157.500000	2011.000000	3.180000	9.300000	3.180000	72.000000
50%	315.000000	2014.000000	3.200000	13.000000	3.200000	74.000000
75%	472.500000	2018.000000	3.240000	16.700000	3.240000	75.700000
max	630.000000	2022.000000	3.600000	126.900000	3.600000	79.500000

For numeric and integer variables, this summary provides the count, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum values of each column.