import pandas as pd
import plotly.express as pxSome important imports
In order to work with a panda data frame, we need to import the pandas library. Finally we need libraries that enable visualization in python such as plotly.
We are working with Palmer penguins data set which we need to import by using its url.
url = "https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/palmer_penguins.csv"
penguins = pd.read_csv(url)
penguins.head()| studyName | Sample Number | Species | Region | Island | Stage | Individual ID | Clutch Completion | Date Egg | Culmen Length (mm) | Culmen Depth (mm) | Flipper Length (mm) | Body Mass (g) | Sex | Delta 15 N (o/oo) | Delta 13 C (o/oo) | Comments | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | PAL0708 | 1 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N1A1 | Yes | 11/11/07 | 39.1 | 18.7 | 181.0 | 3750.0 | MALE | NaN | NaN | Not enough blood for isotopes. |
| 1 | PAL0708 | 2 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N1A2 | Yes | 11/11/07 | 39.5 | 17.4 | 186.0 | 3800.0 | FEMALE | 8.94956 | -24.69454 | NaN |
| 2 | PAL0708 | 3 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N2A1 | Yes | 11/16/07 | 40.3 | 18.0 | 195.0 | 3250.0 | FEMALE | 8.36821 | -25.33302 | NaN |
| 3 | PAL0708 | 4 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N2A2 | Yes | 11/16/07 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Adult not sampled. |
| 4 | PAL0708 | 5 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N3A1 | Yes | 11/16/07 | 36.7 | 19.3 | 193.0 | 3450.0 | FEMALE | 8.76651 | -25.32426 | NaN |
Data Visualization
We will use the express module form plotly library to create a visualization of culmen length vs culmen depth for each island as we figured out in our project for PIC16A that those features are the best candidates based on their linear coefficients. We need to pass data frame, x and y coordinates to the px.scatter function. x and y coordinates are the names of the columns in our data frame. As we want to predict species based on these features, we color each species with a distinct color by setting the color to “Species” column. We set width and height appropriately to get the most readable visualization. Facet_col creates subplots for each island and title sets the title of the figure.
fig = px.scatter(data_frame = penguins,
x = "Culmen Length (mm)",
y = "Culmen Depth (mm)",
color = "Species",
width = 1000,
height = 300,
opacity = 0.5,
facet_col = "Island",
title = "Culmen length vs Culmen depth in each island"
)
# reduce whitespace
fig.update_layout(margin={"r":0,"t":50,"l":0,"b":0})
# show the plot
fig.show()