First i trust everyone is safe. Second i hope people are recovering somewhat from the SVB situation. We are at the end of a era, cycle or epoch; take your pick. Third i felt like picking a Python function that was simple in nature but very helpful.
The function is pandas.describe(). i’ve previously written about other introspection libraries like DABL however this is rather simple and in place. Actually i never had utilized it before. i was working on some other code as a hobby in the areas of transfer learning and was playing around with some data and decided to to use the breast cancer data form the sklearn library which is much like the iris data used for canonical modeling and comparison. Most machine learning is data cleansing and feature selection so lets start with something we know.
Breast cancer is the second most common cancer in women worldwide, with an estimated 2.3 million new cases in 2020. Early detection is key to improving survival rates, and machine learning algorithms can aid in diagnosing and treating breast cancer. In this blog, we will explore how to load and analyze the breast cancer dataset using the scikit-learn library in Python.
The breast cancer dataset is included in scikit-learn's datasets module, which contains a variety of well-known datasets for machine learning. The features describe the characteristics of the cell nuclei present in the image. We can load the dataset using the load_breast_cancer function, which returns a dictionary-like object containing the data and metadata about the dataset.
It has been surmised that machine learning is mostly data exploration and data cleaning.
from sklearn.datasets import load_breast_cancer
import pandas as pd
#Load the breast cancer dataset
data = load_breast_cancer()
The data object returned by load_breast_cancer contains the feature data and the target variable. The feature data contains measurements of 30 different features, such as radius, texture, and symmetry, extracted from digitized images of fine needle aspirate (FNA) of breast mass. The target variable is binary, with a value of 0 indicating a benign tumor and a value of 1 indicating a malignant tumor.
We can convert the feature data and target variable into a pandas dataframe using the DataFrame constructor from the pandas library. We also add a column to the dataframe containing the target variable.
#Convert the data to a pandas dataframe
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = pd.Series(data.target)
Finally, we can use the describe method of the pandas dataframe to get a summary of the dataset. The describe method returns a table containing the count, mean, standard deviation, minimum, and maximum values for each feature, as well as the count, mean, standard deviation, minimum, and maximum values for the target variable.
#Use the describe() method to get a summary of the dataset
print(df.describe())
From the summary statistics, we can see that the mean values of the features vary widely, with the mean radius ranging from 6.981 to 28.11 and the mean texture ranging from 9.71 to 39.28. We can also see that the target variable is roughly balanced, with 62.7% of the tumors being malignant.
Pretty nice utility.
Then again in looking at this data one would think we could get to first principles engineering and root causes and make it go away? This directly affects motherhood which i still believe is the hardest job in humanity. Makes you wonder where all the money goes?
Until then,
#iwishyouwater <- Free Diver Steph who is also a mom hunting pelagics on #onebreath
Muzak To Blog By Peter Gabriel’s “Peter Gabriels 3: Melt (remastered). He is coming out with a new album. Games Without Frontiers and Intruder are timeless. i applied long ago to work at Real World Studios and received the nicest rejection letter.
It enables us to dabble in vicarious vice and to sit in smug judgment on the result.
Online Quote Generator
First, i hope everyone is safe. Second i haven’t written a Snake_Byte [ ] in quite some time so here goes. This is a library i ran across late last night and well for what it achieves even for data exploration it is well worth the pip install dabl cost of it all.
Data analysis is an essential task in the field of machine learning and artificial intelligence. However, it can be a challenging and time-consuming task, especially for those who are not familiar with programming. That’s where the dabl library comes into play.
dabl, short for Data Analysis Baseline Library, is a high-level data analysis library in python, designed to make data analysis as easy and effortless as possible. It is an open-source library, developed and maintained by the scikit-learn community.
The library provides a collection of simple and intuitive functions for exploring, cleaning, transforming, and visualizing data. With dabl, users can perform various data analysis tasks such as regression, classification, clustering, anomaly detection, and more, with just a few lines of code.
One of the main benefits of dabl is that it helps users get started quickly by providing a set of default actions for each task. For example, to perform a regression analysis, users can simply call the “regression” function and pass in their data, and dabl will take care of the rest.
Another advantage of dabl is that it provides easy-to-understand visualizations of the results, allowing users to quickly understand the results of their analysis and make informed decisions based on the data. This is particularly useful for non-technical users who may not be familiar with complex mathematical models or graphs.
dabl also integrates well with other popular data analysis libraries such as pandas, numpy, and matplotlib, making it a convenient tool for those already familiar with these libraries.
So let us jump into the code shall we?
This code uses the dabl library to perform regression analysis on the Titanic dataset. The dataset is loaded using the pandas library and passed to the dabl.SimpleRegressor function for analysis. The fit method is used to fit the regression model to the data, and the score method is used to evaluate the performance of the model. Finally, the dabl.plot function is used to visualize the results of the regression analysis.
import dabl
import pandas as pd
import matplotlib.pyplot as plt
# Load the Titanic dataset from the disk
titanic = pd.read_csv(dabl.datasets.data_path("titanic.csv"))
#check shape columns etc
titanic.shape
titanic.head
#all that is good tons of stuff going on here but now let us ask dabl whats up:
titanic_clean = dabl.clean(titanic, verbose=1)
#a cool call to detect types
types = dabl.detect_types(titanic_clean)
print (types)
#lets do some eye candy
dabl.plot(titanic, 'survived')
#lets check the distribution
plt.show()
#let us try simple regression if it works it works
# Perform regression analysis
fc = dabl.SimpleClassifier(random_state=0)
X = titanic_clean.drop("survived", axis=1)
y = titanic_clean.survived
fc.fit(X, y)
Ok so lets break this down a little.
We load the data set: (make sure the target directory is the same)
# Load the Titanic dataset from the disk
titanic = pd.read_csv(dabl.datasets.data_path("titanic.csv"))
Of note we loaded this in to a pandas dataframe. Assuming we can use python and load a comma-separated values file lets now do some exploration:
pclass survived name \
0 1 1 Allen, Miss. Elisabeth Walton
1 1 1 Allison, Master. Hudson Trevor
2 1 0 Allison, Miss. Helen Loraine
3 1 0 Allison, Mr. Hudson Joshua Creighton
4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels)
... ... ... ...
1304 3 0 Zabour, Miss. Hileni
1305 3 0 Zabour, Miss. Thamine
1306 3 0 Zakarian, Mr. Mapriededer
1307 3 0 Zakarian, Mr. Ortin
1308 3 0 Zimmerman, Mr. Leo
sex age sibsp parch ticket fare cabin embarked boat \
0 female 29 0 0 24160 211.3375 B5 S 2
1 male 0.9167 1 2 113781 151.55 C22 C26 S 11
2 female 2 1 2 113781 151.55 C22 C26 S ?
3 male 30 1 2 113781 151.55 C22 C26 S ?
4 female 25 1 2 113781 151.55 C22 C26 S ?
... ... ... ... ... ... ... ... ... ...
1304 female 14.5 1 0 2665 14.4542 ? C ?
1305 female ? 1 0 2665 14.4542 ? C ?
1306 male 26.5 0 0 2656 7.225 ? C ?
1307 male 27 0 0 2670 7.225 ? C ?
1308 male 29 0 0 315082 7.875 ? S ?
body home.dest
0 ? St Louis, MO
1 ? Montreal, PQ / Chesterville, ON
2 ? Montreal, PQ / Chesterville, ON
3 135 Montreal, PQ / Chesterville, ON
4 ? Montreal, PQ / Chesterville, ON
... ... ...
1304 328 ?
1305 ? ?
1306 304 ?
1307 ? ?
1308 ? ?
Wow tons of stuff going on here and really this is cool data from an awful disaster. Ok let dabl exercise some muscle here and ask it to clean it up a bit:
Ah sweet! So data science, machine learning or data mining is 80% cleaning up the data. Take what you can get and go with it folks. dabl even informs us it appears the target method looks like a classification problem. As the name suggests, Classification means classifying the data on some grounds. It is a type of Supervised learning. In classification, the target column should be a Categorical column. If the target has only two categories like the one in the dataset above (Fit/Unfit), it’s called a Binary Classification Problem. When there are more than 2 categories, it’s a Multi-class Classification Problem. The “target” column is also called a “Class” in the Classification problem.
Now lets do some analysis. Yep we are just getting to some statistics. There are univariate and bivariate in this case.
Bivariate analysis is the simultaneous analysis of two variables. It explores the concept of the relationship between two variable whether there exists an association and the strength of this association or whether there are differences between two variables and the significance of these differences.
The main three types we will see here are:
Categorical v/s Numerical
Numerical V/s Numerical
Categorical V/s Categorical data
Also of note Linear Discriminant Analysis or LDA is a dimensionality reduction technique. It is used as a pre-processing step in machine learning. The goal of LDA is to project the features in higher dimensional space onto a lower-dimensional space in order to avoid the curse of dimensionality and also reduce resources and dimensional costs. The original technique was developed in the year 1936 by Ronald A. Fisher and was named Linear Discriminant or Fisher’s Discriminant Analysis.
(NOTE there is another LDA (Latent Dirichlet Allocation which is used in Semantic Engineering that is quite different).
dabl.plot(titanic, 'survived')
In the following plots that auto-magically happens is continuous feature plots for discriminant analysis.
Continuous Feature PairPlots
In the plots you will also see PCA (Principle Component Analysis). PCA was invented in 1901 by Karl Pearson, as an analog of the principal axis theorem in mechanics; it was later independently developed and named by Harold Hotelling in the 1930s. Depending on the field of application, it is also named the discrete Karhunen–Loève transform (KLT) in signal processing, the Hotelling transform in multivariate quality control, proper orthogonal decomposition (POD) in mechanical engineering. PCA is used extensively in many and my first usage of it was in 1993 for three-dimensional rendering of sound.
Discriminating PCA Directions
What is old is new again.
The main difference is that the Linear discriminant analysis is a supervised dimensionality reduction technique that also achieves classification of the data simultaneously. LDA focuses on finding a feature subspace that maximizes the separability between the groups. While Principal component analysis is an unsupervised Dimensionality reduction technique, it ignores the class label. PCA focuses on capturing the direction of maximum variation in the data set.
LDA
Both reduce the dimensionality of the dataset and make it more computationally resourceful. LDA and PCA both form a new set of components.
The last plot is categorical versus target.
So now lets try as dabl said a SimpleClassifier then fit the data to the line. (hey some machine learning!)
fc = dabl.SimpleClassifier(random_state=0)
X = titanic_clean.drop("survived", axis=1)
y = titanic_clean.survived
fc.fit(X, y)
This should produce the following outputs with accuracy metrics:
This actually calls the sklearn routines in aggregate. Looks like our old friend logistic regression works. keep it simple sam it ain’t gotta be complicated.
In conclusion, dabl is a highly recommended library for those looking to simplify their data analysis tasks. With its intuitive functions and visualizations, it provides a quick and easy way to perform data analysis, making it an ideal tool for both technical and non-technical user. Again, the real strength of dabl is in providing simple interfaces for data exploration. For more information:
#iwishyouwater <- hold your breath on a dive with my comrade at arms @corepaddleboards. great video and the clarity was astounding.
Muzak To Blog By: “Ballads For Two”, Chet Baker and Wolfgang Lackerschmid, trumpet meet vibraphone sparsity. The space between the note is where all of the action lives.
First, as always, i hope everyone is safe, Second, as i mentioned in my last Snake_Byte [] let us do something a little more technical and scientific. For context, the catalyst for this was a surprising discussion that came from how current machine learning interviews are being conducted and how the basics of the distance between two vectors have been overlooked. So this is a basic example and in the following Snake_Byte [] i promise to get into something a little more say carnivore.
With that let us move to some linear algebra. For those that don’t know what linear algebra is, i will refer you to the best book on the subject, Professor Gilbert Strang’s Linear Algebra and its Applications.
i am biased here; however, i do believe the two most important areas of machine learning and data science are linear algebra and probability, with optimization techniques coming in a close third.
So dear reader, please bear with me here. We will review a little math; maybe for some, this will be new, and for those that already know this, you can rest your glass-balls.
We denote be -dimensional vectors taking real numbers as their entries. For example:
where are the indices respectively. In this case [3].
An -by- matrix is denoted as . The transpose of a matrix is denoted as . A matrix can be viewed according to its columns and its rows:
where are the row and column indices.
An array is a data structure in python programming that holds fix number of elements and these elements should be of the same data type. The main idea behind using an array of storing multiple elements of the same type. Most of the data structure makes use of an array to implement their algorithm. There is two important parts of the array:
Element: Each item stored in the array is called an element.
Index: Every element in the array has its own numerical value to identify the element.
Think of programming a loop, tuple, list,array,range or matrix:
from math import exp
v1 = [x, y] # list of variables
v2 = (-1, 2) # tuple of numbers
v3 = (x1, x2, x3) # tuple of variables
v4 = [exp(-i*0.1) for i in range(150)] #ye ole range loop
and check this out for a matrix:
import numpy as np
a = np.matrix('0 1:2 3')
print (a)
output: [[0 1]
[2 3]]
which folks is why we like the Snake Language. Really that is about it for vectors and matrices. The theory is where you get into proofs and derivations which can save you a ton of time on optimizations.
So now let’s double click on some things that will make you sound cool at the parties or meetups.
A vector can be multiplied by a number. This number is usually denoted as a scalar:
Now given this one of the most fundamental aspects in all of machine-learning is the inner product, also called dot product, or scalar product, of two vectors, is a number. Most of all, machine learning algorithms have some form of a dot product somewhere within the depths of all the mathz. Nvidia GPUs are optimized for (you guessed it) dot products.
So how do we set this up? Multiplication of scalar and a vector yields:
Ok good so far.
The inner or dot product of two n-vectors is defined as:
which, if you are paying attention yields:
(1)
Geometrically, the dot product of and equals the length of times the length of times the cosine of the angle between them:
ok so big deal huh? yea, but check this out in the Snake_Language:
# dot product of two vectors
# Importing numpy module
import numpy as np
# Taking two scalar values
a = 5
b = 7
# Calculating dot product using dot()
print(np.dot(a, b))
output: 35
hey now!
# Importing numpy module
import numpy as np
# Taking two 2D array
# For 2-D arrays it is the matrix product
a = [[2, 1], [0, 3]]
b = [[1, 1], [3, 2]]
# Calculating dot product using dot()
print(np.dot(a, b))
output:[[5 4]
[9 6]]
Mathematically speaking the inner product is a generalization of a dot product. As we said constructing a vector is done using the command np.array. Inside this command, one needs to enter the array. For a column vector, we write [[1],[2],[3]], with an outer [], and three inner [] for each entry. If the vector is a row vector, the one can omit the inner []’s by just calling np.array([1, 2, 3]).
Given two column vectors x and y, the inner product is computed via np.dot(x.T,y), where np.dot is the command for inner product, and x.T returns the transpose of x. One can also call np.transpose(x), which is the same as x.T.
# Python code to perform an inner product with transposition
import numpy as np
x = np.array([[1],[0],[-1]])
y = np.array([[3],[2],[0]])
z = np.dot(np.transpose(x),y)
print (z)
Yes, now dear read you now can impress your friends with your linear algebra and python prowess.
Note: In this case, the dot product is scale independent for actual purposes of real computation you must do something called a norm of a vector. i won’t go into the mechanics of this unless asked for further explanations on the mechanics of linear algebra. i will gladly go into pythonic examples if so asked and will be happy to write about said subject. Feel free to inquire in the comments below.
Unitl Then,
#iwishyouwater <- Nathan Florence with Kelly Slater at the Box. Watch.
tctjr.
Muzak to Blog By: INXS. i had forgotten how good of a band they were and the catalog. Michael Hutchinson, the lead singer, hung himself in a hotel room. Check out the song “By My Side”, “Dont Change” and “Never Tear Us Apart” and “To Look At You”. They weren’t afraid the take production chances.
Note[2]: i resurrected some very old content from a previous site i owned i imported the older blogs. Some hilarious. Some sad. Some infuriating. i’m shining them up. Feel free to look back in time.
Complexity control is the central problem of writing software in the real world.
Eric S. Raymond
AI-Generated Software Architecture Diagram
Hello dear readers! first i hope everyone is safe. Secondly, it is the mondy-iest WEDNESDAY ever! Ergo its time for a Snake_Byte!
Grabbing a tome off the bookshelf we randomly open and it and the subject matter today is Module Packages. So there will not be much if any code but more discussion as it were on the explanations thereof.
Module imports are the mainstay of the snake language.
A Python module is a file that has a .py extension, and a Python package is any folder that has modules inside it (or if your still in Python 2, a folder that contains an __init__.py file).
What happens when you have code in one module that needs to access code in another module or package? You import it!
In python a directory is said to be a package thus imports are known as package imports. What happens in import is that the code is turned into a directory from a local (your come-pooter) or that cloud thing everyone talks about these days.
It turns out that hierarchy simplifies the search path complexities with organizing files and trends toward simplifying search path settings.
Absolute imports are preferred because they are direct. It is easy to tell exactly where the imported resource is located and what it is just by looking at the statement. Additionally, absolute imports remain valid even if the current location of the import statement changes. In addition, PEP 8 explicitly recommends absolute imports. However, sometimes they get so complicated you want to use relative imports.
So how do imports work?
import dir1.dir2.mod
from dir1.dir2.mod import x
Note the “dotted path” in these statements is assumed to correspond to the path through the directory on the machine you are developing on. In this case it leads to mod.py So in this case directory dir1 which is subdirectory dir2 and contains the module mod.py. Historically the dot path syntax was created for platform neutrality and from a technical standpoint paths in import statements become object paths.
In general the leftmost module in the search path unless it is a home directory top level file is exactly where the file presides.
In Python 3.x packages changed slightly and only applies to imports within files located in package directories. The changes include:
Modifies the module import search path semantic to skip the package’s own directory by default. These checks are essentially absolute imports
Extension of the syntax f from statements to allow them to explicitly request that imports search the packages directories only, This is the relative import mentioned above.
so for instance:
from.import spam #relative to this package
Instructs Python to import a module named spam located in the same package directory as the file in which this statement appears.
Similarly:
from.spam import name
states from a module named spam located in the same package as the file that contains this statement import the variable name.
Something to remember is that an import without a leading dot always causes Python to skip the relative components of the module import search path and looks instead in absolute directories that sys.path contains. You can only force the dot nomenclature with relative imports with the from statement.
Packages are standard now in Python 3.x. It is now very common to see very large third-party extensions deployed as part of a set of package directories rather than flat list modules. Also, caveat emptor using the relative import function can save memory. Read the documentation. Many times importing AllTheThings results in major memory usage an issue when you are going to production with highly optimized python.
There is much more to this import stuff. Transitive Module Reloads, Managing other programs with Modules (meta-programming), Data Hiding etc. i urge you to go into the LazyWebTM and poke around.
MUZAK TO BLOG BY: NIN – “The Downward Spiral (Deluxe Edition)”. A truly phenomenal piece of work. NIN second album, trent reznor told jimmy iovine upon delivering the concept album “Im’ Sorry I had to…”. In 1992, Reznor moved to 10050 Cielo Drive in Benedict Canyon, Los Angeles, where actress Sharon Tate formally lived and where he made the record. i believe it changed the entire concept of music and created a new genre. From an engineering point of view, Digidesign‘s TurboSynth and Pro Tools were used extensively.
First i trust everyone is safe. Second its WEDNESDAY so we got us a Snake_Byte! Today i wanted to keep this simple, fun and return to a set of fun methods that are included in the defacto standard for plotting in python which is Matplotlib. The method(s) are called XKCD Style plotting via plt.xkcd().
If you don’t know what is referencing it is xkcd, sometimes styled XKCD, whcih is a webcomic created in 2005 by American author Randall Munroe. The comic’s tagline describes it as “a webcomic of romance, sarcasm, math, and language”. Munroe states on the comic’s website that the name of the comic is not an initialism but “just a word with no phonetic pronunciation”. i personally have read it since its inception in 2005. The creativity is astounding.
Which brings us to the current Snake_Byte. If you want to have some fun and creativity in your marketechure[1] and spend fewer hours on those power points bust out some plt.xkcd() style plots!
So really that is all there with all the bells and whistles that matplotlib has to offer.
The following script was based on Randall Munroe’s Stove Ownership.
(Some will get the inside industry joke.)
with plt.xkcd():
# Based on "Stove Ownership" from XKCD by Randall Munroe
# https://xkcd.com/418/
fig = plt.figure()
ax = fig.add_axes((0.1, 0.2, 0.8, 0.7))
ax.spines.right.set_color('none')
ax.spines.top.set_color('none')
ax.set_xticks([])
ax.set_yticks([])
ax.set_ylim([-30, 10])
data = np.ones(100)
data[70:] -= np.arange(30)
ax.annotate(
'THE DAY I TRIED TO CREATE \nAN INTEROPERABLE SOLTUION\nIN HEALTH IT',
xy=(70, 1), arrowprops=dict(arrowstyle='->'), xytext=(15, -10))
ax.plot(data)
ax.set_xlabel('time')
ax.set_ylabel('MY OVERALL MENTAL SANITY')
fig.text(
0.5, 0.05,
'"Stove Ownership" from xkcd by Randall Munroe',
ha='center')
Interoperability In Health IT
So dear readers there it is an oldie but goodie and it is so flexible! Add it to your slideware or maretechure or just add it because its cool.
Until Then,
#iwishyouwater <- Mentawis surfing paradise. At least someone is living.
Muzak To Blog By: NULL
[1] Marchitecture is a portmanteau of the words marketing and architecture. The term is applied to any form of electronic architecture perceived to have been produced purely for marketing reasons and has in many companies replaced actual software creation.
First, i trust everyone is safe. Second: Hey Now! Wednesday is already here again! Why did Willy Wonka say about “So Much Time And So Little To Do?” Or better yet “Time Is Fun When You Are Having Flies!” Snake_Byte[8] Time!
This is a serendipitous one because i stumbled onto a library that uses a library that i mentioned in my last Snake_Bytes which was pandas. It’s called MitoSheets and it auto-generates code for your data wrangling needs and also allows you to configure and graph within your Jupyter_Lab_Notebooks. i was skeptical.
So we will start at the beginning which is where most things start:
i am making the assumption you are either using a venv or conda etc. i use a venv so here are the installation steps:
Note the two step process you need both to instantiate the entire library.
Next crank up ye ole Jupyter Lab:
import mitosheets
mito.sheet()
It throws up a wonky splash screen to grab your digits and email to push you information on the Pro_version i imagine.
Then you can select a file. i went with the nba.csv file from the last blog Snake_Bytes[7] Pandas Not The Animal. Find it here :
Then low and behold it spit out the following code:
from mitosheet import *; register_analysis("id-ydobpddcec");
# Imported nba.csv
import pandas as pd
nba = pd.read_csv(r'nba.csv')
register_analysis("id-ydobpddcec") is locked to the respective file.
So how easy is it to graph? Well, it was trivial. Select graph then X & Y axis:
Team Members vs Team Graph
Graph Configuration
So naturally i wanted to change the graph to purple and add some grid lines with a legend to test the export and here was the result:
They gotcha!
As Henry Ford said you can have any color car as long as it is black. In this case you are stock with the above graph while useful it’s not going to catch anyone’s eye.
Then i tried to create a pivot table and it spit out the following code:
from mitosheet import *; register_analysis("id-ydobpddcec");
# Imported nba.csv
import pandas as pd
nba = pd.read_csv(r'nba.csv')
# Pivoted into nba
tmp_df = nba[['Team', 'Position', 'Number']]
pivot_table = tmp_df.pivot_table(
index=['Team'],
columns=['Number'],
values=['Position'],
aggfunc={'Position': ['count']}
)
pivot_table.set_axis([flatten_column_header(col) for col in pivot_table.keys()], axis=1, inplace=True)
nba_pivot = pivot_table.reset_index()
Note the judicious use of our friend the pandas library.
Changing the datatype is easy:
from salary to datatime_ascending
from mitosheet import *; register_analysis("id-ydobpddcec");
# Imported nba.csv
import pandas as pd
nba = pd.read_csv(r'nba.csv')
# Changed Salary to dtype datetime
import pandas as pd
nba['Salary'] = pd.to_datetime(nba['Salary'], unit='s', errors='coerce')
It also lets you clear the current analysis:
Modal Dialog
So i started experimenting with the filtering:
Player Weight < 180.0 lbs
from mitosheet import *; register_analysis("id-ydobpddcec");
# Imported nba.csv
import pandas as pd
nba = pd.read_csv(r'nba.csv')
# Filtered Weight
nba = nba[nba['Weight'] < 180]
The views for modification are on the right side of the layout of the table which is very convenient. The automatic statistics and visualizations are helpful as well:
Unique Ascending Values
Weight Frequencies < 180.0 lbs
The max,min,median, and std are very useful and thoughtful:
Rule Based Summary Statistics
The following in and of itself could be enough to pip install the library:
DataFrame Gymnastics
You can even have multiple dataframes that can be merged. Between those items and the summary stats for those that are experienced this could be enough price to entry to pip install and then install the library. For those that really don’t know how to code this allows you to copypasta code and learn some pretty basic yet very powerful immediate insights into data. Also if you are a business analyst, a developer could get you going in no time with this library.
i don’t particularly like the lockouts on the paywall for features. In today’s age of open-source humans will get around that issue and just use something else, especially the experienced folks. However, what caught my attention was the formatting and immediate results with a code base that is useful elsewhere, so i think the Mito developer team is headed in the right direction. i really can see this library evolving and adding sklearn and who knows Github Copilot. Good on them.
Give it a test drive.
Until Then,
#iwishyouwater <- #OuterKnown Tahiti Pro 2022 – Best Waves
Muzak To Blog By: Tracks from “Joe’s Garage” by Frank Zappa. “A Little Green Rosetta” is hilarious as well as a testament to Zappa’s ability to put together truly astound musicians. i love the central scrutinizer and “Watermelon in Easter Hey” i believe is one of the best guitar pieces of all time. Even Zappa said it was one of his best pieces and to this day Dweezil Zappa is the only person allowed to play it. One of my readers when i reviewed the Zappa documentary called the piece “intoxicating”. Another exciting aspect of this album is that he used live guitar solos and dubbed them into the studio work except for Watermelon In Easter Hey. The other Muzak was by a band that put Atlanta on the map: Outkast. SpeakerBoxx is phenomenal and Andre3000 is an amazing musician. “Prototype” and “Pink & Blue”. Wew.
DISCLAIMER: This blog was written some time ago. Software breaks once in a while and there was a ghost in my LazyWebTM machine. We are back to our regularly scheduled program. Read on Dear Reader, and your humble narrator apologizes.
The other day i was talking to someone about file manipulations and whatnot and mentioned how awesome Pandas and the magic of the df.DoWhatEverYaWant( my_data_object) via a dataframe was and they weren’t really familiar with Pandas. So being that no one knows everything i figured i would write a Snake_Byte[] about Pandas. i believe i met the author of pandas – Wes Mckinney at a PyData conference years ago at Facebook. Really nice human and has created one of the most used libraries for data wrangling.
One of the most nagging issues with machine learning, in general, is the access of high integrity canonical training sets or even just high integrity data sets writ large.
By my estimate over the years having performed various types of learning systems and algorithm development, machine learning is 80% data preparation, 10% data piping, 5% training, and 5% banging your head against the keyboard. Caveat Emptor – variable rates apply, depending on the industry vertical.
It is well known that there are basically three main attributes to the integrity of the data: complete, atomic, and well-annotated.
Complete data sets mean analytical results for all required influent and effluent constituents as specified in the effluent standard for a specific site on a specific date.
Atomic data sets are data elements that represent the lowest level of detail. For example, in a daily sales report, the individual items that are sold are atomic data, whereas roll-ups such as invoices and summary totals from invoices are aggregate data.
Well-annotated data sets are the categorization and labeling of data for ML applications. Training data must be properly categorized and annotated for a specific use case. With high-quality, human-powered data annotation, companies can build and improve ML implementations. This is where we get into issues such as Gold Standard Sets and Provenance of Data.
Installing Pandas:
Note: Before you install Pandas, you must bear in mind that it supports only Python versions 3.7, 3.8, and 3.9.
I am also assuming you are using some type of virtual environment.
As per the usual installation packages for most Python libraries:
pip install pandas
You can also choose to use a package manager in which case it’s probably already included.
#import pandas pd is the industry shorthand
import pandas as pd
#check the version
pd.__version__
[2]: '1.4.3'
Ok we have it set up correctly.
So what is pandas?
Glad you asked, i have always thought of pandas as enhancingnumpy as pandas is built on numpy. numpy Itis the fundamental library of python, used to perform scientific computing. It provides high-performance multidimensional arrays and tools to deal with them. A numPy array is a grid of values (of the same type) indexed by a tuple of positive integers, numpy arrays are fast, easy to understand, and give users the right to perform calculations across arrays. pandas on the other hand provides high-performance, fast, easy-to-use data structures, and data analysis tools for manipulating numeric data and most importantly time series manipulation.
So lets start with the pandas series object which is a one dimensional array of indexed data which can be created from a list or an array:
data = pd.Series([0.1,0.2,0.3,0.4, 0.5])
data
[5]: 0 0.1
1 0.2
2 0.3
3 0.4
4 0.5
dtype: float64
The cool thing about this output is that Series creates and wraps both a sequence and the related indices; ergo we can access both the values and index attributes. To double check this we can access values:
You can access the associated values via the [ ] square brackets just like numpy however pandas.Series is much more flexible than the numpy counterpart that it emulates. They say imitation is the highest form of flattery.
Lets go grab some data from the LazyWebTM:
If one really thinks about the aspects of pandas.Series it is really a specialized version of a python dictionary. For those unfamiliar a dictionary (dict) is python structure that maps arbirtrary keys to a set of arbitrary values. Super powerful for data manipulation and data wrangling. Taking this is a step further pandas.Series is a structure that maps typed keys to a set of typed values. The typing is very important whereas the type-specific compiled code within numpy arrays makes it much more efficient than a python list. In the same vein pandas.Series is much more efficient python dictionaries. pandas.Series has an insane amount of commands:
Next, we move to what i consider the most powerful aspect of pandas the DataFrame. A DataFrame is a data structure that organizes data into a 2-dimensional table of rows and columns, much like a spreadsheet. DataFrames are one of the most common data structures used in modern data analytics because they are a flexible and intuitive way of storing and working with data.
# Python code demonstrate creating
# DataFrame from dict narray / lists
# By default addresses.
import pandas as pd
# intialise data of lists.
data = {'Name':['Bob', 'Carol', 'Alice', ''],
'Age':[18, 20, 22, 24]}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
print(df)
[8]:Name Age
0 Bob 18
1 Carol 20
2 Alice 22
3 24
Lets grab some data. nba.csv is a flat file of NBA statistics of players:
i don’t watch or follow sports so i don’t know what is in this file. Just did a google search for csv statistics and this file came up.
# importing pandas package
import pandas as pd
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
# retrieving row by loc method
first = data.loc["Avery Bradley"]
second = data.loc["R.J. Hunter"]
print(first, "\n\n\n", second)
[9]:
Team Boston Celtics
Number 0.0
Position PG
Age 25.0
Height 6-2
Weight 180.0
College Texas
Salary 7730337.0
Name: Avery Bradley, dtype: object
Team Boston Celtics
Number 28.0
Position SG
Age 22.0
Height 6-5
Weight 185.0
College Georgia State
Salary 1148640.0
Name: R.J. Hunter, dtype: object
How nice is this? Easy Peasy. It seems almost too easy.
Just to show how far reaching pandas is now in the data science world for all of you who think you may need to use Spark there is a package called PySpark. In PySpark A DataFrame is equivalent to a relational table in Spark SQL, and can be created using various functions. Once created, it can be manipulated using the various domain-specific-language (DSL) functions much like your beloved SQL.
Which might be another Snake_Byte in the future.
i also found pandas being used in ye ole #HealthIT#FHIR for as we started this off csv manipulation. Think of this Snake_Byte as an Ouroboros.
This github repo converts csv2fhir ( can haz interoperability? ):
with pd.read_csv(file_path, **csv_reader_params) as buffer:
for chunk in buffer:
chunk: DataFrame = execute(chunk_tasks, chunk)
# increment the source row number for the next chunk/buffer processed
# add_row_num is the first task in the list
starting_row_num = chunk["rowNum"].max() + 1
chunk_tasks[0] = Task(name="add_row_num", params={"starting_index": starting_row_num})
chunk: Series = chunk.apply(_convert_row_to_fhir, axis=1)
for processing_exception, group_by_key, fhir_resources in chunk:
yield processing_exception, group_by_key, fhir_resources
So this brings us to the end of this Snake_Byte. Hope this gave you a little taste of a great python library that is used throughout the industry.
One other mention on the Muzak To Blog By must go to the fantastic Greek Composer, Evángelos Odysséas Papathanassíou (aka Vangelis) who recently passed away. We must not let the music be lost like tears in the rain, Vangelis’ music will live forever. Rest In Power, Maestro Vangelis. i have spent many countless hours listening to your muzak and now the sheep are truly dreaming. Listen here -> Memories Of Green.
Any intelligent fool can make things bigger, more complex, and more violent. It takes a touch of genius and a lot of courage to move in the opposite direction.
E.F. Schumacher
First, i hope everyone is safe.
Second, i had meant this for reading over Thanksgiving but transparently I was having technical difficulties with \LATEX rendering and it appears that both MATHJAX and native LATEX are not working on my site. For those interested i even injected the MATHJAX code into my .php header. Hence i had to rewrite a bunch of stuff alas with no equations. Although for some reason unbenowst to me my table worked.
Third, Hey its time for a Snake_Byte [] !
In this installment, i will be discussing Algorithm Complexity and will be using a Python method that i previously wrote about in Snake_Byte[5]: Range.
So what is algorithm complexity? Well, you may remember in your mathematics or computer science classes “Big Oh” notation. For those that don’t know this involves both space and time complexity not to be confused with Space-Time Continuums.
Let’s hit the LazyWeb and particularly Wikipedia:
“Big O notation is a mathematical notation that describes the limiting behavior of a function when the argument tends towards a particular value or infinity. It is a member of a family of notations invented by Paul Bachmann, Edmund Landau, and others collectively called Bachmann–Landau notation or asymptotic notation.”
— Wikipedia’s definition of Big O notation
Hmmm. Let’s try to parse that a little better shall we?
So you want to figure out how slow or hopefully how fast your code is using fancy algebraic terms and terminology. So you want to measure the algorithmic behavior as a function of two variables with time complexity and space complexity. Time is both the throughput as well as how fast from t0-tni1 the algorithm operates. Then we have space complexity which is literally how much memory (either in memory or persistent memory) the algorithms require as a function of the input. As an added bonus you can throw around the word asymptotic:
From Dictionary.com
/ (ˌæsɪmˈtɒtɪk) / adjective. of or referring to an asymptote. (of a function, series, formula, etc) approaching a given value or condition, as a variable or an expression containing a variable approaches a limit, usually infinity.
Ergo asymptotic analysis means how the algorithm responds “to” or “with” values that approach ∞.
So “Hey what’s the asymptotic response of the algorithm?”
Hence we need a language that will allow us to say that the computing time, as a function of (n), grows ‘on the order of n3,’ or ‘at most as fast as n3,’ or ‘at least as fast as n *log*n,’ etc.
There are five symbols that are used in the language of comparing the rates of growth of functions they are the following five: ‘o’ (read ‘is little oh of’), O (read ‘is big oh of’), ‘θ’ (read ‘is theta of’), ‘∼’ (read ‘is asymptotically equal to’ or, irreverently, as ‘twiddles’), and Ω (read ‘is omega of’). It is interesting to note there are discrepancies amongst the ranks of computer science and mathematics as to the accuracy and validity of each. We will just keep it simple and say Big-Oh.
So given f(x) and g(x) be two functions of x. Where each of the five symbols above are intended to compare the rapidity of growth of f and g. If we say that f(x) = o(g(x)), then informally we are saying that f grows more slowly than g does when x is very large.
Let’s address the time complexity piece i don’t want to get philosophical on What is Time? So for now and this blog i will make the bounds it just like an arrow t(0) – t(n-1)
That said the analysis of the algorithm is for an order of magnitude not the actual running time. There is a python function called time that we can use to do an exact analysis for the running time. Remember this is to save you time upfront to gain an understanding of the time complexity before and while you are designing said algorithm.
Most arithmetic operations are constant time; multiplication usually takes longer than addition and subtraction, and division takes even longer, but these run times don’t depend on the magnitude of the operands. Very large integers are an exception; in that case, the run time increases with the number of digits.
So for Indexing operations whether reading or writing elements in a sequence or dictionary are also constant time, regardless of the size of the data structure.
A for loop that traverses a sequence or dictionary is usually linear, as long as all of the operations in the body of the loop are constant time.
The built-in function sum is also linear because it does the same thing, but it tends to be faster because it is a more efficient implementation; in the language of algorithmic analysis, it has a smaller leading coefficient.
If you use the same loop to “add” a list of strings, the run time is quadratic because string concatenation is linear.
The string method join is usually faster because it is linear in the total length of the strings.
So let’s look at an example using the previous aforementioned range built-in function:
So this is much like the linear example above: The lowest complexity is O(1). When we have a loop:
k = 0
for i in range(n):
for j in range(m):
print(i)
k=k+1
In this case for nested loops we multiply the time complexity thus O(n*m). it also works the same for a loop with time complexity (n) we call a function a function with time complexity (m). When calculating complexity we omit the constant regardless if its execution 5 or 100 times.
When you are performing an analysis look for worst-case boundary conditions or examples.
Linear O(n):
for i in range(n):
if t[i] == 0:
return 0
return 1
Quadratic O(n**2):
res = 0
for i in range (n):
for in range (m):
res += 1
return (res)
There are other types if time complexity like exponential time and factorial time. Exponential Time is O(2**n) and Factorial Time is O(n!).
For space complexity memory has a limit especially if you have ever chased down a heap allocation or trash collection bug. Like we said earlier there is no free lunch you either trade space for time or time for space. Data-driven architectures respond to the input size of the data. Thus the dimensionality of the input space needs to be addressed. If you have a constant number of variables: O(1). If you need to declare an array like using numpy for instance with (n) elements then you have linear space complexity O(n). Remember these are independent of the size of the problem.
For a great book on Algorithm Design and Analysis i highly recommend:
It goes in-depth to growth rates and dominance relations etc `as it relates to graph algorithms, search and sorting as well as cryptographic functions.
There is also a trilogy of sorts called Algorithms Unlocked and Illuminated by Roughgarden and Cormen which are great and less mathematically rigorous if that is not your forte.
Well, i hope this gave you a taste. i had meant this to be a much longer and more in-depth blog however i need to fix this latex issue so i can properly address the matters at hand.
Until then,
#iwishyouwater <- Alexey Molchanov new world freedive record. He is a really awesome human.
Muzak To Blog By: Maddalena (Original Motion Picture Soundtrack) by the Maestro Ennio Morricone – Rest in Power Maestro i have spent many hours listening to your works.
Second, i’ll will be moving the frequency of Snake_Bytes [] to every other Wednesday. This is to provide higher quality information and also to allow me space and time to write other blogs. i trust dear reader y’all do not mind.
Third, i noticed i was remiss in explaining a function i used in a previous Snake_Byte [ ] that of the Python built-in function called range.
Range is a very useful function for, well, creating iterations on variables and loops.
# lets see how this works:
range(4)
[0,1,2,3]
How easy can that be?
Four items were returned. Now we can create a range or a for loop over that list – very meta huh?
Please note in the above example the list starts off with 0. So what if you want your range function to start with 1 base index instead of 0? You can specify that in the range function:
# Start with 1 for intial index
range (1,4)
[1,2,3]
Note the last number in the index in order to be inclusive for the entire index.
Lets try something a little more advanced with some eye candy:
%matplotlib inline
x_cords = range(-50,50)
y_cords = [x*x for x in x_cords]
plt.plot(x_cords, y_cords)
plt.show()
X^2 Function aka Parabola
We passed a computation into the loop to compute over the indices of range x in this case.
In one of the previous Snake_Bytes[] i utilized a for loop and range which is extremely powerful to iterate over sequences:
for i in range (3):
print(i,"Pythons")
0 Pythons
1 Pythons
2 Pythons
For those that really need power when it comes to indexing, sequencing and iteration you can change the list for instance, as we move across it. For example:
L = [1,2,3,4,5,6]
#no add one to each row
# or L[1] = L[i] +1 used all
# the time in matrix operations
for i in range(len(L)):
L[i] += 1
print (L)
[2,3,4,5,6,7]
Note there is a more “slick” way to do this with listcomprehension without changing the original list in place. However, that’s outside the scope if you will of this Snake_Byte[] . Maybe i should do that for the next one?
Well, i hope you have a slight idea of the power of range.
Also, i think this was more “byte-able” and not tl;dr. Let me know!
Expose yourself to as much randomness as possible.
~ Ben Casnocha
A Visualization Of Randomness
First i trust everyone is safe.
Second it is WEDNESDAY and that must mean a Snake_Byte or you are working in a startup because every day is WEDNESDAY in a startup!
i almost didn’t get this one done because well life happens but i want to remain true to the goals herewith to the best of my ability.
So in today’s Snake_Byte we are going to cover Random and PseudoRandom Numbers. i really liked this one because it was more in line with scientific computing and numerical optimization.
The random module in Python generates what is called pseudorandom numbers. It is in the vernacular a pseudorandom number generator (PRNG). This generation includes different types of distributions for said numbers.
So what is a pseudorandom number:
“A pseudorandom number generator (PRNG), also known as a deterministic random bit generator, is an algorithm for generating a sequence of numbers whose properties approximate the properties of sequences of random numbers.” ~ Wikipedia
The important aspect here is: theproperties approximate sequences of random numbers. So this means that it is statistically random even though it was generated by a deterministic response.
While i have used the random module and have even generated various random number algorithms i learned something new in this blog. The pseudorandom number generator in Python uses an algorithm called the Mersenne Twister algorithm. The period of said algorithm is length 2**19937-1 for the 32 bit version and there is also a 64-bit version. The underlying implementation in C is both fast and thread-safe. The Mersenne Twister is one of the most extensively tested random number generators in existence. One issue though is that due to the deterministic nature of the algorithm it is not suitable for cryptographic methods.
Let us delve down into some code into the various random module offerings, shall we?
i like using %system in Jupyter Lab to create an interactive session. First we import random. Lets look at random.random() which returns a uniform distribution and when multiplied by a integer bounds it within that distribution range:
%system
import random
for i in range (5):
x = random.random() * 100
print (x)
Next let us look at random.randrange(start, stop[, step]) which returns a randomly selected element from range(start, stop, step). This is equivalent to choice(range(start, stop, step)) but doesn’t actually build a range object.
Parameter
Description
start
Optional. An integer specifying at which position to start. Default 0
stop
Required. An integer specifying at which position to end.
step
Optional. An integer specifying the incrementation. Default 1
random.ranrange parameters
for i in range (5):
print(random.randrange(10, 100,1))
84
21
94
91
87
Now let us move on to some calls that you would use in signal processing, statistics or machine learning. The first one is gauss(). gauss() returns a gaussian distribution using the following mathematics:
Gaussian distribution (also known as normal distribution) is a bell-shaped curve (aka the bell curve), and it is assumed that during any measurement values will follow a normal distribution with an equal number of measurements above and below the mean value.
Parameter
Description
mu
the mean
sigma
the standard deviation
returns
a random gaussian distribution floating number
gauss() parameters
# import the required libraries
import random
import matplotlib.pyplot as plt
#set the inline magic
%matplotlib inline
# store the random numbers in a list
nums = []
mu = 100
sigma = 50
for i in range(100000):
temp = random.gauss(mu, sigma)
nums.append(temp)
# plot the distribution
plt.hist(nums, bins = 500, ec="red")
plt.show()
Gaussian Distribution in Red
There are several more parameters in the random module, setter functions, seed functions and very complex statistical functions. Hit stack overflow and give it a try! Also it doesn’t hurt if you dust off that probability and statistics textbook!
As a last thought which came first the framework of entropy or the framework of randomness? As well as is everything truly random? i would love to hear your thought in the comments!
M. Matsumoto and T. Nishimura, “Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator”, ACM Transactions on Modeling and Computer Simulation Vol. 8, No. 1, January pp.3–30 1998
Muzak To Muzak To Blog By: Black Sabbath – The End: Live In Birmingham