Exercise 4: Data Visualization and Plotting in Python¶

In this final part of the tutorial, we will learn how to create plots to visualize our data. We will be using matplotlib, a powerful plotting library in Python that we have already imported in Exercise 1. This section will guide you through creating basic and advanced plots, customizing them, and saving them in various formats. Let's get started!

Basic Plots¶

Let's start by creating some basic plots using the plot() function in matplotlib. We want to see how the mean fruit and vegetable consumption has changed over the years. To visualize this, we will create a line plot with mean_fruit_vegetables from the health_survey_clean DataFrame on the y-axis and year on the x-axis.

Step 1: Set the Figure Size¶

First, let's set the figure size to determine how large the plot should be on the page:

In [ ]:
plt.figure(figsize=(10, 5))

By setting the figure size to be figsize=(10, 5)), we specify that the plot should be 10 inches wide and 5 inches tall.

Step 2: Create a Line Plot¶

Then, use the plot function to generate the line plot. This function takes the structure plot(x-variable, y-variable). Use plt.show() to display the graph:

In [2]:
plt.figure(figsize=(10, 5))
plt.plot(health_survey_summary['year'], health_survey_summary['mean_fruit_vegetables'])
plt.show()
No description has been provided for this image

To add circular markers at each data point for better visibility, we can add an additional parameter marker='o':

In [3]:
plt.figure(figsize=(10, 5))
plt.plot(health_survey_summary['year'], health_survey_summary['mean_fruit_vegetables'], marker='o')
plt.show()
No description has been provided for this image

Step 3: Add Title and Labels¶

Add a title using plt.title() and labels using plt.xlabel() and plt.ylabel():

In [4]:
plt.figure(figsize=(10, 5))
plt.plot(health_survey_summary['year'], health_survey_summary['mean_fruit_vegetables'], marker='o')
plt.title('Mean Fruit & Vegetable Consumption Over Years')
plt.xlabel('Year')
plt.ylabel('Mean Fruit & Vegetable Consumption')
plt.show()
No description has been provided for this image

Step 4: Plot Multiple Lines¶

We can add multiple plots onto the same plot. Let's try this by adding a plot of mean alcohol consumption, giving it the marker x:

In [5]:
plt.figure(figsize=(10, 5))
plt.plot(health_survey_summary['year'], health_survey_summary['mean_fruit_vegetables'], marker='o')
plt.plot(health_survey_summary['year'], health_survey_summary['mean_alcohol'], marker='x')
plt.title('Mean Fruit & Vegetable vs Alcohol Consumption Over Years')
plt.xlabel('Year')
plt.ylabel('Mean Fruit & Vegetable Consumption')
plt.show()
No description has been provided for this image

Step 5: Add a Legend¶

Let's add a legend so we know which colour corresponds to which plot. To do this, we need to add a parameter label= to the plot() functions and then call plt.legend():

In [6]:
plt.figure(figsize=(10, 5))
plt.plot(health_survey_summary['year'], health_survey_summary['mean_fruit_vegetables'], marker='o', label='Fruit and Vegetables')
plt.plot(health_survey_summary['year'], health_survey_summary['mean_alcohol'], marker='x', label='Alcohol')
plt.title('Mean Fruit & Vegetable vs Alcohol Consumption Over Years')
plt.xlabel('Year')
plt.ylabel('Mean Fruit & Vegetable Consumption')
plt.legend()
plt.show()
No description has been provided for this image

Step 6: Set Axis Limits¶

You can set the axis limits to customize the range of values displayed on the x-axis and y-axis using plt.xlim() and plt.ylim() functions. Let's set a limit on the x-axis to period between 2016 and 2022.

In [7]:
plt.figure(figsize=(10, 5))
plt.plot(health_survey_summary['year'], health_survey_summary['mean_fruit_vegetables'], marker='o', label='Fruit and Vegetables')
plt.plot(health_survey_summary['year'], health_survey_summary['mean_alcohol'], marker='x', label='Alcohol')
plt.title('Mean Fruit & Vegetable vs Alcohol Consumption Over Years')
plt.xlabel('Year')
plt.ylabel('Mean Fruit & Vegetable Consumption')
plt.xlim(2016, 2022)  # Set x-axis limits
plt.legend()
plt.show()
No description has been provided for this image

Step 7: Change the Colours¶

Finally, you can customize the colours of your plots to make your visualizations more appealing. You can specify the colour of each line in a plot using the color parameter in plot:

In [8]:
plt.figure(figsize=(10, 5))
plt.plot(health_survey_summary['year'], health_survey_summary['mean_fruit_vegetables'], marker='o', label='Fruit and Vegetables', color='green')
plt.plot(health_survey_summary['year'], health_survey_summary['mean_alcohol'], marker='x', label='Alcohol', color='red')
plt.title('Mean Fruit & Vegetable vs Alcohol Consumption Over Years')
plt.xlabel('Year')
plt.ylabel('Mean Fruit & Vegetable Consumption')
plt.legend()
plt.show()
No description has been provided for this image

In addition to line plots, matplotlib offers a wide variety of plot types to visualize data in different ways. Here are some common types of plots:

  • Bar Plot: bar(x, height)
  • Scatter Plot: scatter(x, y)
  • Histogram: hist(x)
  • Box Plot: boxplot(x)
  • Pie Chart: pie(x)

Each type of plot serves a different purpose and can provide unique insights into your data. For more detailed information and examples of the different types of plots available in matplotlib, you can refer to the official documentation: https://matplotlib.org/stable/plot_types/index.html

Scatter Plots¶

matplotlib allows us to create advanced plots that can provide deeper insights into our data. Let's start by creating a scatter plot to visualize the relationship between mean alcohol consumption and mental wellbeing.

To create a scatter plot, follow the steps for creating a basic plot, but use the scatter() function instead of plot():

In [9]:
plt.figure(figsize=(10, 5))
plt.scatter(
    health_survey_clean['alcohol_consumption_mean_weekly_units'],
    health_survey_clean['mental_wellbeing'],
    label='Data Points')
plt.title('Scatter Plot of Mean Alcohol Consumption vs Mental Wellbeing')
plt.xlabel('Mean Alcohol Consumption (Weekly Units)')
plt.ylabel('Mean Mental Wellbeing')
plt.show()
No description has been provided for this image

To enhance the visualization, we can add colour to the plot by applying a colourmap. The c= parameter allows us to specify which variable to use for colouring the points, while the cmap= parameter lets us choose the colourmap. This will add a gradient of colours to the points, providing an additional dimension to the data representation.

We can use the colourmap viridis to add a gradient of colours to the scatter plot based on the values of mental_wellbeing:

In [10]:
plt.figure(figsize=(10, 5))
plt.scatter(
    health_survey_clean['alcohol_consumption_mean_weekly_units'],
    health_survey_clean['mental_wellbeing'],
    c=health_survey_clean['mental_wellbeing'],
    cmap='viridis',
    label='Data Points'
)
plt.title('Scatter Plot of Mean Alcohol Consumption vs Mental Wellbeing')
plt.xlabel('Mean Alcohol Consumption (Weekly Units)')
plt.ylabel('Mean Mental Wellbeing')
plt.show()
No description has been provided for this image

Trend Lines¶

To further understand the relationship between alcohol consumption and mental wellbeing, we can add a trend line to the scatter plot using polyfit and poly1d from the NumPy package.

Step 1: Calculate the Coefficients for the Trend Line¶

First, we need to calculate the coefficients for the trend line using polyfit. The polyfit function fits a polynomial of a specified degree to a set of data points. It uses the least squares method to find the coefficients of the polynomial that best fits the data. The function takes the following structure:

np.polyfit(x, y, degree)

  • x: The x-coordinates of the data points.
  • y: The y-coordinates of the data points.
  • degree: The degree of the polynomial to fit (1 for linear, 2 for quadratic, etc.).

The output of polyfit is an array of polynomial coefficients. Let's fit a first-degree (linear) polynomial to the graph of alcohol consumption vs. mental wellbeing and see what coefficients it generates:

In [11]:
coef = np.polyfit(
    health_survey_clean['alcohol_consumption_mean_weekly_units'],
    health_survey_clean['mental_wellbeing'], 1)
print(coef)
[-0.0054244   3.29097147]

Since this is a linear fit, the polyfit function outputs a vector containing two coefficients: the slope and the intercept of the line.

Step 2: Generate the Polynomial Function¶

Next, we use these coefficients to create our mathematical equation for the trend line using np.poly1d. This function takes an array of polynomial coefficients and returns a polynomial function.

In [12]:
p = np.poly1d(coef)
print(p)
 
-0.005424 x + 3.291

We can see that the equation which defines our trend line is:

$$y = -0.005424 x + 3.291$$

The result p is a polynomial function that you can use to evaluate the fitted polynomial at any x-value.

Step 3: Generate a Smooth Range of X-Values¶

Now that we have our polynomial function p, we can plot the line of best fit on the graph. To ensure a smooth curve, we need to generate a range of x-values and evaluate the polynomial at these points.

Use np.linspace to generate a smooth range of x-values from the minimum to the maximum alcohol consumption values:

In [13]:
x_range = np.linspace(
    health_survey_clean['alcohol_consumption_mean_weekly_units'].min(),
    health_survey_clean['alcohol_consumption_mean_weekly_units'].max()
)

By generating these x-values, we can then use our polynomial function p to compute the corresponding y-values, creating a smooth line that represents the trend across the entire range of data.

Step 4: Plot the Trend Line¶

Finally, use plt.plot to plot the trend line on top of the scatter plot, using x_range and the evaluated values of the polynomial function p(x_range) as arguments:

In [14]:
plt.figure(figsize=(10, 5))

plt.scatter(
    health_survey_clean['alcohol_consumption_mean_weekly_units'],
    health_survey_clean['mental_wellbeing'],
    c=health_survey_clean['mental_wellbeing'],
    cmap='viridis',
    label='Data Points'
)

# Plot the trendline
plt.plot(
    x_range,
    p(x_range),
    color='red', linestyle='--', label='Trendline'
)

plt.title('Scatter Plot with Trendline of Mean Alcohol Consumption vs Mental Wellbeing')
plt.xlabel('Mean Alcohol Consumption (Weekly Units)')
plt.ylabel('Mean Mental Wellbeing')
plt.legend()
plt.show()
No description has been provided for this image

We can see from the trendline that there is a negative relationship between alcohol consumption and mental wellbeing, where mean mental wellbeing decreases as alcohol consumption increases.

Step 5: Fit a Second-Degree Polynomial¶

In the previous step, we added a linear trend line to the scatter plot. However, the relationship between alcohol consumption and mental wellbeing might not be strictly linear. To better capture this relationship, we can fit a higher-degree polynomial, such as a second-degree polynomial (quadratic).

To fit a second-degree polynomial, we can use np.polyfit with a degree of 2:

In [15]:
coef_quad = np.polyfit(
    health_survey_clean['alcohol_consumption_mean_weekly_units'],
    health_survey_clean['mental_wellbeing'], 2)
print(coef_quad)
[ 3.49456328e-05 -8.32386556e-03  3.32052856e+00]

Next, create the polynomial function using np.poly1d:

In [16]:
p_quad = np.poly1d(coef_quad)
print(p_quad)
           2
3.495e-05 x - 0.008324 x + 3.321

The quadratic equation which defines the new line of best fit is:

$$y = 0.00003495 x^2 - 0.008324 x + 3.321$$

Finally, plot the quadratic trend line on top of the scatter plot:

In [17]:
plt.figure(figsize=(10, 5))
plt.scatter(
    health_survey_clean['alcohol_consumption_mean_weekly_units'],
    health_survey_clean['mental_wellbeing'],
    c=health_survey_clean['mental_wellbeing'],
    cmap='viridis',
    label='Data Points'
)

plt.plot(
    x_range,
    p_quad(x_range),
    color='red', linestyle='--', label='Quadratic Trendline'
)

plt.title('Scatter Plot with Quadratic Trendline of Mean Alcohol Consumption vs Mental Wellbeing')
plt.xlabel('Mean Alcohol Consumption (Weekly Units)')
plt.ylabel('Mean Mental Wellbeing')
plt.legend()
plt.show()
No description has been provided for this image

Summary¶

In this tutorial, we covered essential data handling operations in Python, including importing and exporting data, cleaning and preprocessing datasets, performing data transformations and manipulations, and creating visualizations using numpy, pandas, and matplotlib. I hope you found the tutorial both enjoyable and useful. Good luck and happy coding!

Library documentation¶

  • Pandas user guide: https://pandas.pydata.org/docs/user_guide/index.html#user-guide
  • NumPy user guide: https://numpy.org/doc/stable/user/index.html#user
  • Matplotlib user guide: https://matplotlib.org/stable/users/index.html

Further Resources¶

  • https://www.kaggle.com/learn/
  • https://www.datacamp.com/courses/
  • https://realpython.com/python-for-data-analysis/