First i trust everyone is safe. Second i hope people are recovering somewhat from the SVB situation. We are at the end of a era, cycle or epoch; take your pick. Third i felt like picking a
Python function that was simple in nature but very helpful.
The function is
pandas.describe(). i’ve previously written about other introspection libraries like DABL however this is rather simple and in place. Actually i never had utilized it before. i was working on some other code as a hobby in the areas of transfer learning and was playing around with some data and decided to to use the breast cancer data form the sklearn library which is much like the iris data used for canonical modeling and comparison. Most machine learning is data cleansing and feature selection so lets start with something we know.
Breast cancer is the second most common cancer in women worldwide, with an estimated 2.3 million new cases in 2020. Early detection is key to improving survival rates, and machine learning algorithms can aid in diagnosing and treating breast cancer. In this blog, we will explore how to load and analyze the breast cancer dataset using the
scikit-learn library in
The breast cancer dataset is included in
scikit-learn's datasets module, which contains a variety of well-known datasets for machine learning. The features describe the characteristics of the cell nuclei present in the image. We can load the dataset using the
load_breast_cancer function, which returns a dictionary-like object containing the data and metadata about the dataset.
It has been surmised that machine learning is mostly data exploration and data cleaning.
from sklearn.datasets import load_breast_cancer
import pandas as pd
#Load the breast cancer dataset
data = load_breast_cancer()
data object returned by
load_breast_cancer contains the feature data and the target variable. The feature data contains measurements of 30 different features, such as radius, texture, and symmetry, extracted from digitized images of fine needle aspirate (FNA) of breast mass. The target variable is binary, with a value of 0 indicating a benign tumor and a value of 1 indicating a malignant tumor.
We can convert the feature data and target variable into a pandas dataframe using the
DataFrame constructor from the pandas library. We also add a column to the dataframe containing the target variable.
#Convert the data to a pandas dataframe
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = pd.Series(data.target)
Finally, we can use the
describe method of the pandas dataframe to get a summary of the dataset. The
describe method returns a table containing the count, mean, standard deviation, minimum, and maximum values for each feature, as well as the count, mean, standard deviation, minimum, and maximum values for the target variable.
#Use the describe() method to get a summary of the dataset
The output of the
describe method is as follows:
mean radius mean texture ... worst symmetry target
count 569.000000 569.000000 ... 569.000000 569.000000
mean 14.127292 19.289649 ... 0.290076 0.627417
std 3.524049 4.301036 ... 0.061867 0.483918
min 6.981000 9.710000 ... 0.156500 0.000000
25% 11.700000 16.170000 ... 0.250400 0.000000
50% 13.370000 18.840000 ... 0.282200 1.000000
75% 15.780000 21.800000 ... 0.317900 1.000000
max 28.110000 39.280000 ... 0.663800 1.000000
[8 rows x 31 columns]
From the summary statistics, we can see that the mean values of the features vary widely, with the mean radius ranging from 6.981 to 28.11 and the mean texture ranging from 9.71 to 39.28. We can also see that the target variable is roughly balanced, with 62.7% of the tumors being malignant.
Pretty nice utility.
Then again in looking at this data one would think we could get to first principles engineering and root causes and make it go away? This directly affects motherhood which i still believe is the hardest job in humanity. Makes you wonder where all the money goes?
#iwishyouwater <- Free Diver Steph who is also a mom hunting pelagics on #onebreath
Muzak To Blog By Peter Gabriel’s “Peter Gabriels 3: Melt (remastered). He is coming out with a new album. Games Without Frontiers and Intruder are timeless. i applied long ago to work at Real World Studios and received the nicest rejection letter.