Snake_Byte[15] Fourier, Discrete and Fast Transformers

The frequency domain of mind (a mind, it must be stressed, is an unextended, massless, immaterial singularity) can produce an extended, spacetime domain of matter via ontological Fourier mathematics, and the two domains interact via inverse and forward Fourier transforms.

~ Dr. Cody Newman, The Ontological Self: The Ontological Mathematics of Consciousness

I am Optimus Transformer Ruler Of The AutoCorrelation Bots

First i trust everyone is safe. i haven’t written technical blog in a while so figured i would write a Snake_Byte on one of my favorite equations The Fourier Transform:

    \[\hat{f} (\xi)=\int_{-\infty}^{\infty}f(x)e^{-2\pi ix\xi}dx\]

More specifically we will be dealing with the Fast Fourier Transform which is an implementation of The Discrete Fourier Transform. The Fourier Transform operates on continuous signals and while i do believe we will have analog computing devices (again) in the future we have to operate on 0’s and 1’s at this juncture thus we have a discrete version thereof. The discrete version:

    \[F(x) &= f\f[k] &= \sum_{j=0}^{N-1} x[j]\left(e^{-2\pi i k/N}\right)^j\0 &\leq k < N\]

where:

    \[f[k] &= f_e[k]+e^{-2\pi i k/N}f_o[k]\f[k+N/2] &= f_e[k]-e^{-2\pi i k/N}f_o[k]\]

The Discrete Fourier Transform (DFT) is a mathematical operation. The Fast Fourier Transform (FFT) is an efficient algorithm for the evaluation of that operation (actually, a family of such algorithms). However, it is easy to get these two confused. Often, one may see a phrase like “take the FFT of this sequence”, which really means to take the DFT of that sequence using the FFT algorithm to do it efficiently.

The Fourier sequence is a kernel operation for any number of transforms where the kernel is matched to the signal if possible. The Fourier Transform is a series of sin(\theta) and cos(\theta) which makes it really useful for audio and radar analysis.

For the FFT it only takes O(n\log{}n) for the sequence computation and as one would imagine this is a substantial gain. The most commonly used FFT algorithm is the Cooley-Tukey algorithm, which was named after J. W. Cooley and John Tukey. It’s a divide and conquer algorithm for the machine calculation of complex Fourier series. It breaks the DFT into smaller DFTs. Other FFT algorithms include the Rader’s algorithm, Winograd Fourier transform algorithm, Chirp Z-transform algorithm, etc. The only rub comes as a function of the delay throughput.

There have been amazing text books written on this subject and i will list them at the end of the blarg[1,2,3]

So lets get on with some code. First we do the usual houskeeping on import libraries as well as doing some majik for inline display if you are using JupyterNotebooks. Of note ffpack which is a package of Fortran subroutines for the fast Fourier transform. It includes complex, real, sine, cosine, and quarter-wave transforms. It was developed by Paul Swarztrauber of the National Center for Atmospheric Research, and is included in the general-purpose mathematical library SLATEC.

# House keeping libraries imports and inline plots:
import numpy as np
from scipy import fftpack
%matplotlib inline
import matplotlib.pyplot as pl

We now set up some signals where we create a sinusoid with a sample rate. We use linspace to set up the amplitude and signal length.

#frequency in cycles per second or Hertz
#this is equivalent to concert A

Frequency = 20 
# Sampling rate or the number of measurements per second
# This is the rate of digital audio

Sampling_Frequency = 100 

# set up the signal space:
time = np.linspace(0,2,2 * Sampling_Frequency, endpoint = False)
signal = np.sin(Frequency * 2 * np.pi * time)

Next we plot the sinusoid under consideration:

# plot the signal:
fif, ax = plt.subplots()
ax.plot(time, signal)
ax.set_xlabel('Time [seconds]')
ax.set_ylabel('Signal Amplitude')

Next we apply the Fast Fourier Transform and transform into the frequency domain:

X_Hat = fftpack.fft(signal)
Frequency_Component = fftpack.fftfreq(len(signal)) * Sampling_Frequency

We now plot the transformed sinusoid depicting the frequencies we generated:

# plot frequency components of the signal:
fig, ax = plt.subplots()
ax.stem(Frequency_Component, np.abs(X_Hat)) # absolute value of spectrum
ax.set_xlabel ('Frequency in Hertz [HZ] Of Transformed Signal')
ax.set_ylabel ('Frequency Domain (Spectrum) Magnitude')
ax.set_xlim(-Sampling_Frequency / 2, Sampling_Frequency / 2)
ax.set_ylim(-5,110)

To note you will see two frequency components, this is because there are positive and negative (real and imaginary) components to the transform which is what we see using the stem plots as expected. This is because the kernel as mentioned before is both sin(\theta) and cos(\theta).

So something really cool happens when using the FFT. It is called the convolution theorem as well as Dual Domain Theory. Convolution in the time domain yields multiplication in the frequency domain. Mathematically, the convolution theorem states that under suitable conditions the Fourier transform of a convolution of two functions (or signals) is the poin-twise (Hadamard multiplication) product of their Fourier transforms. More generally, convolution in one domain (e.g., time domain) equals point-wise multiplication in the other domain (e.g., frequency domain).

Where:

    \[x(t)*h(t) &= y(t)\]

    \[X(f) H(f) &= Y(f)\]

So there you have it. A little taster on the powerful Fourier Transform.

Until Then,

#iwishyouwater <- Cloudbreak this past year

Muzak To Blarg by: Voyager Essential Max Ricther. Phenomenal. November is truly staggering.

References:

[1] The Fourier Transform and Its Applications by Dr Ronald N Bracewell. i had the honor of taking the actual class at Stanford University from Professor Bracewell.

[2] The Fourier Transform and Its Applications by E. Roan Brigham. Graet book on butterfly and overlap-add derivations thereof.

[3] Adaptive Digital Signal Processing by Dr. Claude Lindquist. A phenomenal book on frequency domain signal processing and kernel analysis. A book ahead of its time. Professor Lindquist was a mentor and had a direct effect and affect on my career and the way i approach information theory.

Snake_Byte:[14] Coding In Philosophical Frameworks

Dalle-E Generated Philospher

Your vision will only become clear when you can look into your heart. Who looks outside, dreams; who looks inside, awakes. Knowing your own darkness is the best method for dealing with the darknesses of other people. We cannot change anything until we accept it.

~ C. Jung

(Caveat Emptor: This blog is rather long in the snakes tooth and actually more like a CHOMP instead of a BYTE. tl;dr)

First, Oh Dear Reader i trust everyone is safe, Second sure feels like we are living in an age of Deus Ex Machina, doesn’t it? Third with this in mind i wanted to write a Snake_Byte that have been “thoughting” about for quite some but never really knew how to approach it if truth be told. I cant take full credit for this ideation nor do i actually want to claim any ideation. Jay Sales and i were talking a long time after i believe i gave a presentation on creating Belief Systems using BeliefNetworks or some such nonsense.

The net of the discussion was we both believed that in the future we will code in philosophical frameworks.

Maybe we are here?

So how would one go about coding an agent-based distributed system that allowed one to create an agent or a piece of evolutionary code to exhibit said behaviors of a philosophical framework?

Well we must first attempt to define a philosophy and ensconce it into a quantized explanation.

Stoicism seemed to me at least the best first mover here as it appeared to be the tersest by definition.

So first those not familiar with said philosophy, Marcus Aurelius was probably the most famous practitioner of Stoicism. i have put some references that i have read at the end of this blog1.

Stoicism is a philosophical school that emphasizes rationality, self-control, and inner peace in the face of adversity. In thinking about this i figure To build an agent-based software system that embodies Stoicism, we would need to consider several key aspects of this philosophy.

  • Stoics believe in living in accordance with nature and the natural order of things. This could be represented in an agent-based system through a set of rules or constraints that guide the behavior of the agents, encouraging them to act in a way that is in harmony with their environment and circumstances.
  • Stoics believe in the importance of self-control and emotional regulation. This could be represented in an agent-based system through the use of decision-making algorithms that take into account the agent’s emotional state and prioritize rational, level-headed responses to stimuli.
  • Stoics believe in the concept of the “inner citadel,” or the idea that the mind is the only thing we truly have control over. This could be represented in an agent-based system through a focus on internal states and self-reflection, encouraging agents to take responsibility for their own thoughts and feelings and strive to cultivate a sense of inner calm and balance.
  • Stoics believe in the importance of living a virtuous life and acting with moral purpose. This could be represented in an agent-based system through the use of reward structures and incentives that encourage agents to act in accordance with Stoic values such as courage, wisdom, and justice.

So given a definition of Stoicism we then need to create a quantized model or discrete model of those behaviors that encompass a “Stoic Individual”. i figured we could use the evolutionary library called DEAP (Distributed Evolutionary Algorithms in Python ). DEAP contains both genetic algorithms and genetic programs utilities as well as evolutionary strategy methods for this type of programming.

Genetic algorithms and genetic programming are both techniques used in artificial intelligence and optimization, but they have some key differences.

This is important as people confuse the two.

Genetic algorithms are a type of optimization algorithm that use principles of natural selection to find the best solution to a problem. In a genetic algorithm, a population of potential solutions is generated and then evaluated based on their fitness. The fittest solutions are then selected for reproduction, and their genetic information is combined to create new offspring solutions. This process of selection and reproduction continues until a satisfactory solution is found.

On the other hand, genetic programming is a form of machine learning that involves the use of genetic algorithms to automatically create computer programs. Instead of searching for a single solution to a problem, genetic programming evolves a population of computer programs, which are represented as strings of code. The programs are evaluated based on their ability to solve a specific task, and the most successful programs are selected for reproduction, combining their genetic material to create new programs. This process continues until a program is evolved that solves the problem to a satisfactory level.

So the key difference between genetic algorithms and genetic programming is that genetic algorithms search for a solution to a specific problem, while genetic programming searches for a computer program that can solve the problem. Genetic programming is therefore a more general approach, as it can be used to solve a wide range of problems, but it can also be more computationally intensive due to the complexity of evolving computer programs2.

So returning back to the main() function as it were, we need create a genetic program that models Stoic behavior using the DEAP library,

First need to define the problem and the relevant fitness function. This is where the quantized part comes into play. Since Stoic behavior involves a combination of rationality, self-control, and moral purpose, we could define a fitness function that measures an individual’s ability to balance these traits and act in accordance with Stoic values.

So lets get to the code.

To create a genetic program that models Stoic behavior using the DEAP library in a Jupyter Notebook, we first need to install the DEAP library. We can do this by running the following command in a code cell:

pip install deap

Next, we can import the necessary modules and functions:

import random
import operator
import numpy as np
from deap import algorithms, base, creator, tools

We can then define the problem and the relevant fitness function. Since Stoic behavior involves a combination of rationality, self-control, and moral purpose, we could define a fitness function that measures an individual’s ability to balance these traits and act in accordance with Stoic values.

Here’s an example of how we might define a “fitness function” for this problem:

# Define the fitness function.  NOTE: # i am open to other ways of defining this and other models
# the definition of what is a behavior needs to be quantized or discretized and 
# trying to do that yields a lossy functions most times.  Its also self referential

def fitness_function(individual):
    # Calculate the fitness based on how closely the individual's behavior matches stoic principles
    fitness = 0
    # Add points for self-control, rationality, focus, resilience, and adaptability can haz Stoic?
    fitness += individual[0]  # self-control
    fitness += individual[1]  # rationality
    fitness += individual[2]  # focus
    fitness += individual[3]  # resilience
    fitness += individual[4]  # adaptability
    return fitness,

# Define the genetic programming problem
creator.create("FitnessMax", base.Fitness, weights=(1.0,))
creator.create("Individual", list, fitness=creator.FitnessMax)

# Initialize the genetic algorithm toolbox
toolbox = base.Toolbox()

# Define the genetic operators
toolbox.register("attribute", random.uniform, 0, 1)
toolbox.register("individual", tools.initRepeat, creator.Individual, toolbox.attribute, n=5)
toolbox.register("population", tools.initRepeat, list, toolbox.individual)
toolbox.register("evaluate", fitness_function)
toolbox.register("mate", tools.cxTwoPoint)
toolbox.register("mutate", tools.mutGaussian, mu=0, sigma=0.1, indpb=0.1)
toolbox.register("select", tools.selTournament, tournsize=3)

# Run the genetic algorithm
population = toolbox.population(n=10)
for generation in range(20):
    offspring = algorithms.varAnd(population, toolbox, cxpb=0.5, mutpb=0.1)
    fits = toolbox.map(toolbox.evaluate, offspring)
    for fit, ind in zip(fits, offspring):
        ind.fitness.values = fit
    population = toolbox.select(offspring, k=len(population))
    
# Print the best individual found
best_individual = tools.selBest(population, k=1)[0]

print ("Best Individual:", best_individual)
 

Here, we define the genetic programming parameters (i.e., the traits that we’re optimizing for) using the toolbox.register function. We also define the evaluation function (stoic_fitness), genetic operators (mate and mutate), and selection operator (select) using DEAP’s built-in functions.

We then define the fitness function that the genetic algorithm will optimize. This function takes an “individual” (represented as a list of five attributes) as input, and calculates the fitness based on how closely the individual’s behavior matches stoic principles.

We then define the genetic programming problem via the quantized attributes, and initialize the genetic algorithm toolbox with the necessary genetic operators.

Finally, we run the genetic algorithm for 20 generations, and print the best individual found. The selBest function is used to select the top individual fitness agent or a “behavior” if you will for that generation based on the iterations or epochs. This individual represents an agent that mimics the philosophy of stoicism in software, with behavior that is self-controlled, rational, focused, resilient, and adaptable.

Best Individual: [0.8150247518866958, 0.9678037028949047, 0.8844195735244268, 0.3970642186025506, 1.2091810770505023]

This denotes the best individual with those best balanced attributes or in this case the Most Stoic,

As i noted this is a first attempt at this problem i think there is a better way with a full GP solution as well as a tunable fitness function. In a larger distributed system you would then use this agent as a framework amongst other agents you would define.

i at least got this out of my head.

until then,

#iwishyouwater <- Alexey Molchanov and Dan Bilzerian at Deep Dive Dubai

Muzak To Blog By: Phil Lynott “The Philip Lynott Album”, if you dont know who this is there is a statue in Ireland of him that i walked a long way with my co-founder, Lisa Maki a long time ago to pay homage to the great Irish singer of the amazing band Thin Lizzy. Alas they took Phil to be cleaned that day. At least we got to walk and talk and i’ll never forget that day. This is one of his solo efforts and i believe he is one of the best artists of all time. The first track is deeply emotional.

References:

[1] A list of books on Stoicism -> click HERE.

[2] Genetic Programming (On the Programming of Computers by Means of Natural Selection), By Professor John R. Koza. There are multiple volumes i think four and i have all of this but this is a great place to start and the DEAP documentation. Just optimizing a transcendental functions is mind blowing what GP comes out with using arithmetic

Snake_Byte:[13] The Describe Function.

DALLE-2 Draws Describe

First i trust everyone is safe. Second i hope people are recovering somewhat from the SVB situation. We are at the end of a era, cycle or epoch; take your pick. Third i felt like picking a Python function that was simple in nature but very helpful.

The function is pandas.describe(). i’ve previously written about other introspection libraries like DABL however this is rather simple and in place. Actually i never had utilized it before. i was working on some other code as a hobby in the areas of transfer learning and was playing around with some data and decided to to use the breast cancer data form the sklearn library which is much like the iris data used for canonical modeling and comparison. Most machine learning is data cleansing and feature selection so lets start with something we know.

Breast cancer is the second most common cancer in women worldwide, with an estimated 2.3 million new cases in 2020. Early detection is key to improving survival rates, and machine learning algorithms can aid in diagnosing and treating breast cancer. In this blog, we will explore how to load and analyze the breast cancer dataset using the scikit-learn library in Python.

The breast cancer dataset is included in scikit-learn's datasets module, which contains a variety of well-known datasets for machine learning. The features describe the characteristics of the cell nuclei present in the image. We can load the dataset using the load_breast_cancer function, which returns a dictionary-like object containing the data and metadata about the dataset.

It has been surmised that machine learning is mostly data exploration and data cleaning.

from sklearn.datasets import load_breast_cancer
import pandas as pd

#Load the breast cancer dataset
data = load_breast_cancer()

The data object returned by load_breast_cancer contains the feature data and the target variable. The feature data contains measurements of 30 different features, such as radius, texture, and symmetry, extracted from digitized images of fine needle aspirate (FNA) of breast mass. The target variable is binary, with a value of 0 indicating a benign tumor and a value of 1 indicating a malignant tumor.

We can convert the feature data and target variable into a pandas dataframe using the DataFrame constructor from the pandas library. We also add a column to the dataframe containing the target variable.

#Convert the data to a pandas dataframe
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = pd.Series(data.target)

Finally, we can use the describe method of the pandas dataframe to get a summary of the dataset. The describe method returns a table containing the count, mean, standard deviation, minimum, and maximum values for each feature, as well as the count, mean, standard deviation, minimum, and maximum values for the target variable.

#Use the describe() method to get a summary of the dataset
print(df.describe())

The output of the describe method is as follows:

mean radius  mean texture  ...  worst symmetry      target
count   569.000000    569.000000  ...      569.000000  569.000000
mean     14.127292     19.289649  ...        0.290076    0.627417
std       3.524049      4.301036  ...        0.061867    0.483918
min       6.981000      9.710000  ...        0.156500    0.000000
25%      11.700000     16.170000  ...        0.250400    0.000000
50%      13.370000     18.840000  ...        0.282200    1.000000
75%      15.780000     21.800000  ...        0.317900    1.000000
max      28.110000     39.280000  ...        0.663800    1.000000

[8 rows x 31 columns]

From the summary statistics, we can see that the mean values of the features vary widely, with the mean radius ranging from 6.981 to 28.11 and the mean texture ranging from 9.71 to 39.28. We can also see that the target variable is roughly balanced, with 62.7% of the tumors being malignant.

Pretty nice utility.

Then again in looking at this data one would think we could get to first principles engineering and root causes and make it go away? This directly affects motherhood which i still believe is the hardest job in humanity. Makes you wonder where all the money goes?

Until then,

#iwishyouwater <- Free Diver Steph who is also a mom hunting pelagics on #onebreath

Muzak To Blog By Peter Gabriel’s “Peter Gabriels 3: Melt (remastered). He is coming out with a new album. Games Without Frontiers and Intruder are timeless. i applied long ago to work at Real World Studios and received the nicest rejection letter.

Snake_Byte[12]: Dabl A High-Level Data Analysis Library in Python

Not To Be Confused With The Game

It enables us to dabble in vicarious vice and to sit in smug judgment on the result.

Online Quote Generator

First, i hope everyone is safe. Second i haven’t written a Snake_Byte [ ] in quite some time so here goes. This is a library i ran across late last night and well for what it achieves even for data exploration it is well worth the pip install dabl cost of it all.

Data analysis is an essential task in the field of machine learning and artificial intelligence. However, it can be a challenging and time-consuming task, especially for those who are not familiar with programming. That’s where the dabl library comes into play.

dabl, short for Data Analysis Baseline Library, is a high-level data analysis library in python, designed to make data analysis as easy and effortless as possible. It is an open-source library, developed and maintained by the scikit-learn community.

The library provides a collection of simple and intuitive functions for exploring, cleaning, transforming, and visualizing data. With dabl, users can perform various data analysis tasks such as regression, classification, clustering, anomaly detection, and more, with just a few lines of code.

One of the main benefits of dabl is that it helps users get started quickly by providing a set of default actions for each task. For example, to perform a regression analysis, users can simply call the “regression” function and pass in their data, and dabl will take care of the rest.

Another advantage of dabl is that it provides easy-to-understand visualizations of the results, allowing users to quickly understand the results of their analysis and make informed decisions based on the data. This is particularly useful for non-technical users who may not be familiar with complex mathematical models or graphs.

dabl also integrates well with other popular data analysis libraries such as pandas, numpy, and matplotlib, making it a convenient tool for those already familiar with these libraries.

So let us jump into the code shall we?

This code uses the dabl library to perform regression analysis on the Titanic dataset. The dataset is loaded using the pandas library and passed to the dabl.SimpleRegressor function for analysis. The fit method is used to fit the regression model to the data, and the score method is used to evaluate the performance of the model. Finally, the dabl.plot function is used to visualize the results of the regression analysis.

import dabl
import pandas as pd
import matplotlib.pyplot as plt

# Load the Titanic dataset from the disk
titanic = pd.read_csv(dabl.datasets.data_path("titanic.csv"))
#check shape columns etc
titanic.shape
titanic.head
#all that is good tons of stuff going on here but now let us ask dabl whats up:
titanic_clean = dabl.clean(titanic, verbose=1)

#a cool call to detect types
types = dabl.detect_types(titanic_clean)
print (types)
#lets do some eye candy
dabl.plot(titanic, 'survived')
#lets check the distribution
plt.show()
#let us try simple regression if it works it works
# Perform regression analysis
fc = dabl.SimpleClassifier(random_state=0)
X = titanic_clean.drop("survived", axis=1)
y = titanic_clean.survived
fc.fit(X, y)                     

Ok so lets break this down a little.

We load the data set: (make sure the target directory is the same)

# Load the Titanic dataset from the disk
titanic = pd.read_csv(dabl.datasets.data_path("titanic.csv"))

Of note we loaded this in to a pandas dataframe. Assuming we can use python and load a comma-separated values file lets now do some exploration:

#check shape columns etc
titanic.shape
titanic.head

You should see the following:

(1309, 14) 

Which is [1309 rows x 14 columns]

and then:

pclass  survived                                             name  \
0          1         1                    Allen, Miss. Elisabeth Walton   
1          1         1                   Allison, Master. Hudson Trevor   
2          1         0                     Allison, Miss. Helen Loraine   
3          1         0             Allison, Mr. Hudson Joshua Creighton   
4          1         0  Allison, Mrs. Hudson J C (Bessie Waldo Daniels)   
...      ...       ...                                              ...   
1304       3         0                             Zabour, Miss. Hileni   
1305       3         0                            Zabour, Miss. Thamine   
1306       3         0                        Zakarian, Mr. Mapriededer   
1307       3         0                              Zakarian, Mr. Ortin   
1308       3         0                               Zimmerman, Mr. Leo   

         sex     age  sibsp  parch  ticket      fare    cabin embarked boat  \
0     female      29      0      0   24160  211.3375       B5        S    2   
1       male  0.9167      1      2  113781    151.55  C22 C26        S   11   
2     female       2      1      2  113781    151.55  C22 C26        S    ?   
3       male      30      1      2  113781    151.55  C22 C26        S    ?   
4     female      25      1      2  113781    151.55  C22 C26        S    ?   
...      ...     ...    ...    ...     ...       ...      ...      ...  ...   
1304  female    14.5      1      0    2665   14.4542        ?        C    ?   
1305  female       ?      1      0    2665   14.4542        ?        C    ?   
1306    male    26.5      0      0    2656     7.225        ?        C    ?   
1307    male      27      0      0    2670     7.225        ?        C    ?   
1308    male      29      0      0  315082     7.875        ?        S    ?   

     body                        home.dest  
0       ?                     St Louis, MO  
1       ?  Montreal, PQ / Chesterville, ON  
2       ?  Montreal, PQ / Chesterville, ON  
3     135  Montreal, PQ / Chesterville, ON  
4       ?  Montreal, PQ / Chesterville, ON  
...   ...                              ...  
1304  328                                ?  
1305    ?                                ?  
1306  304                                ?  
1307    ?                                ?  
1308    ?                                ?  

Wow tons of stuff going on here and really this is cool data from an awful disaster. Ok let dabl exercise some muscle here and ask it to clean it up a bit:

titanic_clean = dabl.clean(titanic, verbose=1)
types = dabl.detect_types(titanic_clean)
print (types)

i set verbose = 1 in this case and dabl.detect_types() shows the types detected which i found helpful:

Detected feature types:
continuous      0
dirty_float     3
low_card_int    2
categorical     5
date            0
free_string     4
useless         0
dtype: int64

However look what dabl did for us;

                      continuous  dirty_float  low_card_int  categorical  \
pclass                     False        False         False         True   
survived                   False        False         False         True   
name                       False        False         False        False   
sex                        False        False         False         True   
sibsp                      False        False          True        False   
parch                      False        False          True        False   
ticket                     False        False         False        False   
cabin                      False        False         False        False   
embarked                   False        False         False         True   
boat                       False        False         False         True   
home.dest                  False        False         False        False   
age_?                      False        False         False         True   
age_dabl_continuous         True        False         False        False   
fare_?                     False        False         False        False   
fare_dabl_continuous        True        False         False        False   
body_?                     False        False         False         True   
body_dabl_continuous        True        False         False        False   

                       date  free_string  useless  
pclass                False        False    False  
survived              False        False    False  
name                  False         True    False  
sex                   False        False    False  
sibsp                 False        False    False  
parch                 False        False    False  
ticket                False         True    False  
cabin                 False         True    False  
embarked              False        False    False  
boat                  False        False    False  
home.dest             False         True    False  
age_?                 False        False    False  
age_dabl_continuous   False        False    False  
fare_?                False        False     True  
fare_dabl_continuous  False        False    False  
body_?                False        False    False  
body_dabl_continuous  False        False    False 
Target looks like classification
Linear Discriminant Analysis training set score: 0.578
 

Ah sweet! So data science, machine learning or data mining is 80% cleaning up the data. Take what you can get and go with it folks. dabl even informs us it appears the target method looks like a classification problem. As the name suggests, Classification means classifying the data on some grounds. It is a type of Supervised learning. In classification, the target column should be a Categorical column. If the target has only two categories like the one in the dataset above (Fit/Unfit), it’s called a Binary Classification Problem. When there are more than 2 categories, it’s a Multi-class Classification Problem. The “target” column is also called a “Class” in the Classification problem.

Now lets do some analysis. Yep we are just getting to some statistics. There are univariate and bivariate in this case.

Bivariate analysis is the simultaneous analysis of two variables. It explores the concept of the relationship between two variable whether there exists an association and the strength of this association or whether there are differences between two variables and the significance of these differences.

The main three types we will see here are:

  1. Categorical v/s Numerical 
  2. Numerical V/s Numerical
  3. Categorical V/s Categorical data

Also of note Linear Discriminant Analysis or LDA is a dimensionality reduction technique. It is used as a pre-processing step in machine learning. The goal of LDA is to project the features in higher dimensional space onto a lower-dimensional space in order to avoid the curse of dimensionality and also reduce resources and dimensional costs. The original technique was developed in the year 1936 by Ronald A. Fisher and was named Linear Discriminant or Fisher’s Discriminant Analysis. 

(NOTE there is another LDA (Latent Dirichlet Allocation which is used in Semantic Engineering that is quite different).

dabl.plot(titanic, 'survived')

In the following plots that auto-magically happens is continuous feature plots for discriminant analysis.

Continuous Feature PairPlots

In the plots you will also see PCA (Principle Component Analysis). PCA was invented in 1901 by Karl Pearson, as an analog of the principal axis theorem in mechanics; it was later independently developed and named by Harold Hotelling in the 1930s.  Depending on the field of application, it is also named the discrete Karhunen–Loève transform (KLT) in signal processing, the Hotelling transform in multivariate quality control, proper orthogonal decomposition (POD) in mechanical engineering. PCA is used extensively in many and my first usage of it was in 1993 for three-dimensional rendering of sound.

Discriminating PCA Directions

What is old is new again.

The main difference is that the Linear discriminant analysis is a supervised dimensionality reduction technique that also achieves classification of the data simultaneously. LDA focuses on finding a feature subspace that maximizes the separability between the groups. While Principal component analysis is an unsupervised Dimensionality reduction technique, it ignores the class label. PCA focuses on capturing the direction of maximum variation in the data set.

LDA

Both reduce the dimensionality of the dataset and make it more computationally resourceful. LDA and PCA both form a new set of components.

The last plot is categorical versus target.

So now lets try as dabl said a SimpleClassifier then fit the data to the line. (hey some machine learning!)

fc = dabl.SimpleClassifier(random_state=0)
X = titanic_clean.drop("survived", axis=1)
y = titanic_clean.survived
fc.fit(X, y) 

This should produce the following outputs with accuracy metrics:

Running DummyClassifier(random_state=0)
accuracy: 0.618 average_precision: 0.382 roc_auc: 0.500 recall_macro: 0.500 f1_macro: 0.382
=== new best DummyClassifier(random_state=0) (using recall_macro):
accuracy: 0.618 average_precision: 0.382 roc_auc: 0.500 recall_macro: 0.500 f1_macro: 0.382

Running GaussianNB()
accuracy: 0.970 average_precision: 0.975 roc_auc: 0.984 recall_macro: 0.964 f1_macro: 0.968
=== new best GaussianNB() (using recall_macro):
accuracy: 0.970 average_precision: 0.975 roc_auc: 0.984 recall_macro: 0.964 f1_macro: 0.968

Running MultinomialNB()
accuracy: 0.964 average_precision: 0.988 roc_auc: 0.990 recall_macro: 0.956 f1_macro: 0.961
Running DecisionTreeClassifier(class_weight='balanced', max_depth=1, random_state=0)
accuracy: 0.976 average_precision: 0.954 roc_auc: 0.971 recall_macro: 0.971 f1_macro: 0.974
=== new best DecisionTreeClassifier(class_weight='balanced', max_depth=1, random_state=0) (using recall_macro):
accuracy: 0.976 average_precision: 0.954 roc_auc: 0.971 recall_macro: 0.971 f1_macro: 0.974

Running DecisionTreeClassifier(class_weight='balanced', max_depth=5, random_state=0)
accuracy: 0.969 average_precision: 0.965 roc_auc: 0.983 recall_macro: 0.965 f1_macro: 0.967
Running DecisionTreeClassifier(class_weight='balanced', min_impurity_decrease=0.01,
                       random_state=0)
accuracy: 0.976 average_precision: 0.954 roc_auc: 0.971 recall_macro: 0.971 f1_macro: 0.974
Running LogisticRegression(C=0.1, class_weight='balanced', max_iter=1000,
                   random_state=0)
accuracy: 0.974 average_precision: 0.991 roc_auc: 0.993 recall_macro: 0.970 f1_macro: 0.972
Running LogisticRegression(C=1, class_weight='balanced', max_iter=1000, random_state=0)
accuracy: 0.975 average_precision: 0.991 roc_auc: 0.994 recall_macro: 0.971 f1_macro: 0.973

Best model:
DecisionTreeClassifier(class_weight='balanced', max_depth=1, random_state=0)
Best Scores:
accuracy: 0.976 average_precision: 0.954 roc_auc: 0.971 recall_macro: 0.971 f1_macro: 0.974

This actually calls the sklearn routines in aggregate. Looks like our old friend logistic regression works. keep it simple sam it ain’t gotta be complicated.

In conclusion, dabl is a highly recommended library for those looking to simplify their data analysis tasks. With its intuitive functions and visualizations, it provides a quick and easy way to perform data analysis, making it an ideal tool for both technical and non-technical user. Again, the real strength of dabl is in providing simple interfaces for data exploration. For more information:

dabl github. <- click here

Until Then,

#iwishyouwater <- hold your breath on a dive with my comrade at arms @corepaddleboards. great video and the clarity was astounding.

Muzak To Blog By: “Ballads For Two”, Chet Baker and Wolfgang Lackerschmid, trumpet meet vibraphone sparsity. The space between the note is where all of the action lives.

Snake_Byte[11] Linear Algebra, Matrices and Products – Oh My!

Algebra is the metaphysics of arithmetic.

~ John Ray
Looks Hard.

First, as always, i hope everyone is safe, Second, as i mentioned in my last Snake_Byte [] let us do something a little more technical and scientific. For context, the catalyst for this was a surprising discussion that came from how current machine learning interviews are being conducted and how the basics of the distance between two vectors have been overlooked. So this is a basic example and in the following Snake_Byte [] i promise to get into something a little more say carnivore.

With that let us move to some linear algebra. For those that don’t know what linear algebra is, i will refer you to the best book on the subject, Professor Gilbert Strang’s Linear Algebra and its Applications.

i am biased here; however, i do believe the two most important areas of machine learning and data science are linear algebra and probability, with optimization techniques coming in a close third.

So dear reader, please bear with me here. We will review a little math; maybe for some, this will be new, and for those that already know this, you can rest your glass-balls.

We denote x\in\mathbb(R)^N be N-dimensional vectors taking real numbers as their entries. For example:

\begin{bmatrix}\Huge 0 \\ 1 \\ 2 \end{bmatrix}

where \{a_i\} are the indices respectively. In this case [3].

An M-by-N matrix is denoted as X\in\mathbb(R)^N . The transpose of a matrix is denoted as X^T. A matrix X can be viewed according to its columns and its rows:

\begin{bmatrix}  0 & 1 & 2 \\ 3 & 4 & 5\\ 6 & 7 & 8 \\ \end{bmatrix}

where \{a_i_j\} are the row and column indices.

An array is a data structure in python programming that holds fix number of elements and these elements should be of the same data type. The main idea behind using an array of storing multiple elements of the same type. Most of the data structure makes use of an array to implement their algorithm. There is two important parts of the array:

  • Element: Each item stored in the array is called an element.
  • Index: Every element in the array has its own numerical value to identify the element.

Think of programming a loop, tuple, list,array,range or matrix:

from math import exp
v1 = [x, y] # list of variables
v2 = (-1, 2) # tuple of numbers
v3 = (x1, x2, x3) # tuple of variables

v4 = [exp(-i*0.1) for i in range(150)] #ye ole range loop

and check this out for a matrix:

import numpy as np
a = np.matrix('0 1:2 3')
print (a)
output: [[0 1]
 [2 3]]

which folks is why we like the Snake Language. Really that is about it for vectors and matrices. The theory is where you get into proofs and derivations which can save you a ton of time on optimizations.

So now let’s double click on some things that will make you sound cool at the parties or meetups.

A vector can be multiplied by a number. This number a is usually denoted as a scalar:

a\cdot (v_1,v_2) = (av_1,av_2)

Now given this one of the most fundamental aspects in all of machine-learning is the inner product, also called dot product, or scalar product, of two vectors, is a number. Most of all, machine learning algorithms have some form of a dot product somewhere within the depths of all the mathz. Nvidia GPUs are optimized for (you guessed it) dot products.

So how do we set this up? Multiplication of scalar a and a vector (v_1,\dots,v_{n-1}) yields:

(av_0,\dots,av_{n-1})

Ok good so far.

The inner or dot product of two n-vectors is defined as:

(u_0,\dots,u_{n-1})\cdot(v_0,\dots,v_{n-1}) = u_0v_0 +,\dots,+ u_{n-1}v_{n-1}

which, if you are paying attention yields:

(1)   \begin{equation*} = \sum_{j=0}^{N-1}{u_jv_j}\end{equation*}

Geometrically, the dot product of U and V equals the length of U times the length of V times the cosine of the angle between them:

\textbd{U}\cdot\textbf{V}=|\textbf{U}||\textbf{V}|\cos\theta

ok so big deal huh? yea, but check this out in the Snake_Language:

# dot product of two vectors
 
# Importing numpy module
import numpy as np
 
# Taking two scalar values
a = 5
b = 7
 
# Calculating dot product using dot()
print(np.dot(a, b))
output: 35

hey now!

# Importing numpy module
import numpy as np
 
# Taking two 2D array
# For 2-D arrays it is the matrix product
a = [[2, 1], [0, 3]]
b = [[1, 1], [3, 2]]
 
# Calculating dot product using dot()
print(np.dot(a, b))
output:[[5 4]
       [9 6]]

Mathematically speaking the inner product is a generalization of a dot product. As we said constructing a vector is done using the command np.array. Inside this command, one needs to enter the array. For a column vector, we write [[1],[2],[3]], with an outer [], and three inner [] for each entry. If the vector is a row vector, the one can omit the inner []’s by just calling np.array([1, 2, 3]).

Given two column vectors x and y, the inner product is computed via np.dot(x.T,y), where np.dot is the command for inner product, and x.T returns the transpose of x. One can also call np.transpose(x), which is the same as x.T.

 # Python code to perform an inner product with transposition
 import numpy as np
 x = np.array([[1],[0],[-1]])
 y = np.array([[3],[2],[0]]) 
 z = np.dot(np.transpose(x),y)
print (z) 


Yes, now dear read you now can impress your friends with your linear algebra and python prowess.

Note: In this case, the dot product is scale independent for actual purposes of real computation you must do something called a norm of a vector. i won’t go into the mechanics of this unless asked for further explanations on the mechanics of linear algebra. i will gladly go into pythonic examples if so asked and will be happy to write about said subject. Feel free to inquire in the comments below.

Unitl Then,

#iwishyouwater <- Nathan Florence with Kelly Slater at the Box. Watch.

tctjr.

Muzak to Blog By: INXS. i had forgotten how good of a band they were and the catalog. Michael Hutchinson, the lead singer, hung himself in a hotel room. Check out the song “By My Side”, “Dont Change” and “Never Tear Us Apart” and “To Look At You”. They weren’t afraid the take production chances.

Note[2]: i resurrected some very old content from a previous site i owned i imported the older blogs. Some hilarious. Some sad. Some infuriating. i’m shining them up. Feel free to look back in time.

Snake_Byte[10] – Module Packages

Complexity control is the central problem of writing software in the real world.

Eric S. Raymond
AI-Generated Software Architecture Diagram

Hello dear readers! first i hope everyone is safe. Secondly, it is the mondy-iest WEDNESDAY ever! Ergo its time for a Snake_Byte!

Grabbing a tome off the bookshelf we randomly open and it and the subject matter today is Module Packages. So there will not be much if any code but more discussion as it were on the explanations thereof.

Module imports are the mainstay of the snake language.

A Python module is a file that has a .py extension, and a Python package is any folder that has modules inside it (or if your still in Python 2, a folder that contains an __init__.py file).

What happens when you have code in one module that needs to access code in another module or package? You import it!

In python a directory is said to be a package thus imports are known as package imports. What happens in import is that the code is turned into a directory from a local (your come-pooter) or that cloud thing everyone talks about these days.

It turns out that hierarchy simplifies the search path complexities with organizing files and trends toward simplifying search path settings.

Absolute imports are preferred because they are direct. It is easy to tell exactly where the imported resource is located and what it is just by looking at the statement. Additionally, absolute imports remain valid even if the current location of the import statement changes. In addition, PEP 8 explicitly recommends absolute imports. However, sometimes they get so complicated you want to use relative imports.

So how do imports work?

import dir1.dir2.mod
from dir1.dir2.mod import x

Note the “dotted path” in these statements is assumed to correspond to the path through the directory on the machine you are developing on. In this case it leads to mod.py So in this case directory dir1 which is subdirectory dir2 and contains the module mod.py. Historically the dot path syntax was created for platform neutrality and from a technical standpoint paths in import statements become object paths.

In general the leftmost module in the search path unless it is a home directory top level file is exactly where the file presides.

In Python 3.x packages changed slightly and only applies to imports within files located in package directories. The changes include:

  • Modifies the module import search path semantic to skip the package’s own directory by default. These checks are essentially absolute imports
  • Extension of the syntax f from statements to allow them to explicitly request that imports search the packages directories only, This is the relative import mentioned above.

so for instance:

from.import spam #relative to this package

Instructs Python to import a module named spam located in the same package directory as the file in which this statement appears.

Similarly:

from.spam import name

states from a module named spam located in the same package as the file that contains this statement import the variable name.

Something to remember is that an import without a leading dot always causes Python to skip the relative components of the module import search path and looks instead in absolute directories that sys.path contains. You can only force the dot nomenclature with relative imports with the from statement.

Packages are standard now in Python 3.x. It is now very common to see very large third-party extensions deployed as part of a set of package directories rather than flat list modules. Also, caveat emptor using the relative import function can save memory. Read the documentation. Many times importing AllTheThings results in major memory usage an issue when you are going to production with highly optimized python.

There is much more to this import stuff. Transitive Module Reloads, Managing other programs with Modules (meta-programming), Data Hiding etc. i urge you to go into the LazyWebTM and poke around.

in addition a very timely post:

PyPl is running a survey on packages:

Take the survey here -> PyPl Survey on Packages

Here some great comments and suggestions via Y-Combinator News:

Y-Combinator News Commentary on PyPl Packages,

That is all for now. i think next time we are going to delve into some more scientific or mathematical snake language bytes.

Until Then,

#iwishyouwater <- Wedge top 50 wipeouts. Smoookifications!

@tctjr

MUZAK TO BLOG BY: NIN – “The Downward Spiral (Deluxe Edition)”. A truly phenomenal piece of work. NIN second album, trent reznor told jimmy iovine upon delivering the concept album “Im’ Sorry I had to…”. In 1992, Reznor moved to 10050 Cielo Drive in Benedict Canyon, Los Angeles, where actress Sharon Tate formally lived and where he made the record. i believe it changed the entire concept of music and created a new genre. From an engineering point of view,  Digidesign‘s TurboSynth and  Pro Tools were used extensively.

Snake_Byte[9] XKCD PLOTS

An algorithm must be seen to be believed.

~ D. Knuth

First i trust everyone is safe. Second its WEDNESDAY so we got us a Snake_Byte! Today i wanted to keep this simple, fun and return to a set of fun methods that are included in the defacto standard for plotting in python which is Matplotlib. The method(s) are called XKCD Style plotting via plt.xkcd().

If you don’t know what is referencing it is xkcd, sometimes styled XKCD, whcih is a webcomic created in 2005 by American author Randall Munroe. The comic’s tagline describes it as “a webcomic of romance, sarcasm, math, and language”. Munroe states on the comic’s website that the name of the comic is not an initialism but “just a word with no phonetic pronunciation”. i personally have read it since its inception in 2005. The creativity is astounding.

Which brings us to the current Snake_Byte. If you want to have some fun and creativity in your marketechure[1] and spend fewer hours on those power points bust out some plt.xkcd() style plots!

First thing is you need to install matplotlib:

pip install matplotlib

in this simple example we need numpy:

pip install numpy
import numpy as np
plt.xkcd() 
plt.plot(np.sin(np.linspace(0, 10)))
plt.plot(np.sin(np.linspace(10, 20)))
plt.title('Sorta Lissajous')
Sorta Lissajous

So really that is all there with all the bells and whistles that matplotlib has to offer.

The following script was based on Randall Munroe’s Stove Ownership.

(Some will get the inside industry joke.)

with plt.xkcd():
    # Based on "Stove Ownership" from XKCD by Randall Munroe
    # https://xkcd.com/418/

    fig = plt.figure()
    ax = fig.add_axes((0.1, 0.2, 0.8, 0.7))
    ax.spines.right.set_color('none')
    ax.spines.top.set_color('none')
    ax.set_xticks([])
    ax.set_yticks([])
    ax.set_ylim([-30, 10])

    data = np.ones(100)
    data[70:] -= np.arange(30)

    ax.annotate(
        'THE DAY I TRIED TO CREATE \nAN INTEROPERABLE SOLTUION\nIN HEALTH IT',
        xy=(70, 1), arrowprops=dict(arrowstyle='->'), xytext=(15, -10))

    ax.plot(data)

    ax.set_xlabel('time')
    ax.set_ylabel('MY OVERALL MENTAL SANITY')
    fig.text(
        0.5, 0.05,
        '"Stove Ownership" from xkcd by Randall Munroe',
        ha='center')
Interoperability In Health IT

So dear readers there it is an oldie but goodie and it is so flexible! Add it to your slideware or maretechure or just add it because its cool.

Until Then,

#iwishyouwater <- Mentawis surfing paradise. At least someone is living.

Muzak To Blog By: NULL

[1] Marchitecture is a portmanteau of the words marketing and architecture. The term is applied to any form of electronic architecture perceived to have been produced purely for marketing reasons and has in many companies replaced actual software creation.

Snake_Bytes[8] Intro_To_Mito

Got a Tape Backup Bob?

Software Is The Language Of Automation

Jensen Huang

First, i trust everyone is safe. Second: Hey Now! Wednesday is already here again! Why did Willy Wonka say about “So Much Time And So Little To Do?” Or better yet “Time Is Fun When You Are Having Flies!” Snake_Byte[8] Time!

This is a serendipitous one because i stumbled onto a library that uses a library that i mentioned in my last Snake_Bytes which was pandas. It’s called MitoSheets and it auto-generates code for your data wrangling needs and also allows you to configure and graph within your Jupyter_Lab_Notebooks. i was skeptical.

So we will start at the beginning which is where most things start:

i am making the assumption you are either using a venv or conda etc. i use a venv so here are the installation steps:

pip install mitoinstaller
pip mitoinstaller install

Note the two step process you need both to instantiate the entire library.

Next crank up ye ole Jupyter Lab:

import mitosheets
mito.sheet()

It throws up a wonky splash screen to grab your digits and email to push you information on the Pro_version i imagine.

Then you can select a file. i went with the nba.csv file from the last blog Snake_Bytes[7] Pandas Not The Animal. Find it here :

Then low and behold it spit out the following code:

from mitosheet import *; register_analysis("id-ydobpddcec");
    
# Imported nba.csv
import pandas as pd
nba = pd.read_csv(r'nba.csv')

register_analysis("id-ydobpddcec") is locked to the respective file.

So how easy is it to graph? Well, it was trivial. Select graph then X & Y axis:

Team Members vs Team Graph
Graph Configuration

So naturally i wanted to change the graph to purple and add some grid lines with a legend to test the export and here was the result:

They gotcha!

As Henry Ford said you can have any color car as long as it is black. In this case you are stock with the above graph while useful it’s not going to catch anyone’s eye.

Then i tried to create a pivot table and it spit out the following code:

from mitosheet import *; register_analysis("id-ydobpddcec");
    
# Imported nba.csv
import pandas as pd
nba = pd.read_csv(r'nba.csv')

# Pivoted into nba
tmp_df = nba[['Team', 'Position', 'Number']]
pivot_table = tmp_df.pivot_table(
    index=['Team'],
    columns=['Number'],
    values=['Position'],
    aggfunc={'Position': ['count']}
)
pivot_table.set_axis([flatten_column_header(col) for col in pivot_table.keys()], axis=1, inplace=True)
nba_pivot = pivot_table.reset_index()

Note the judicious use of our friend the pandas library.

Changing the datatype is easy:

from salary to datatime_ascending
from mitosheet import *; register_analysis("id-ydobpddcec");
    
# Imported nba.csv
import pandas as pd
nba = pd.read_csv(r'nba.csv')

# Changed Salary to dtype datetime
import pandas as pd
nba['Salary'] = pd.to_datetime(nba['Salary'], unit='s', errors='coerce')

It also lets you clear the current analysis:

Modal Dialog

So i started experimenting with the filtering:

Player Weight < 180.0 lbs
from mitosheet import *; register_analysis("id-ydobpddcec");
    
# Imported nba.csv
import pandas as pd
nba = pd.read_csv(r'nba.csv')

# Filtered Weight
nba = nba[nba['Weight'] < 180]

The views for modification are on the right side of the layout of the table which is very convenient. The automatic statistics and visualizations are helpful as well:

Unique Ascending Values
Weight Frequencies < 180.0 lbs

The max,min,median, and std are very useful and thoughtful:

Rule Based Summary Statistics

The following in and of itself could be enough to pip install the library:

DataFrame Gymnastics

You can even have multiple dataframes that can be merged. Between those items and the summary stats for those that are experienced this could be enough price to entry to pip install and then install the library. For those that really don’t know how to code this allows you to copypasta code and learn some pretty basic yet very powerful immediate insights into data. Also if you are a business analyst, a developer could get you going in no time with this library.

i don’t particularly like the lockouts on the paywall for features. In today’s age of open-source humans will get around that issue and just use something else, especially the experienced folks. However, what caught my attention was the formatting and immediate results with a code base that is useful elsewhere, so i think the Mito developer team is headed in the right direction. i really can see this library evolving and adding sklearn and who knows Github Copilot. Good on them.

Give it a test drive.

Until Then,

#iwishyouwater <- #OuterKnown Tahiti Pro 2022 – Best Waves

@tctjr

Muzak To Blog By: Tracks from “Joe’s Garage” by Frank Zappa. “A Little Green Rosetta” is hilarious as well as a testament to Zappa’s ability to put together truly astound musicians. i love the central scrutinizer and “Watermelon in Easter Hey” i believe is one of the best guitar pieces of all time. Even Zappa said it was one of his best pieces and to this day Dweezil Zappa is the only person allowed to play it. One of my readers when i reviewed the Zappa documentary called the piece “intoxicating”. Another exciting aspect of this album is that he used live guitar solos and dubbed them into the studio work except for Watermelon In Easter Hey. The other Muzak was by a band that put Atlanta on the map: Outkast. SpeakerBoxx is phenomenal and Andre3000 is an amazing musician. “Prototype” and “Pink & Blue”. Wew.

Snake_Byte[7]: Pandas (Not The Animal)

Groupings Of Pandas In A Frame

DISCLAIMER: This blog was written some time ago. Software breaks once in a while and there was a ghost in my LazyWebTM machine. We are back to our regularly scheduled program. Read on Dear Reader, and your humble narrator apologizes.

The other day i was talking to someone about file manipulations and whatnot and mentioned how awesome Pandas and the magic of the df.DoWhatEverYaWant( my_data_object) via a dataframe was and they weren’t really familiar with Pandas. So being that no one knows everything i figured i would write a Snake_Byte[] about Pandas. i believe i met the author of pandas – Wes Mckinney at a PyData conference years ago at Facebook. Really nice human and has created one of the most used libraries for data wrangling.

One of the most nagging issues with machine learning, in general, is the access of high integrity canonical training sets or even just high integrity data sets writ large.

By my estimate over the years having performed various types of learning systems and algorithm development, machine learning is 80% data preparation, 10% data piping, 5% training, and 5% banging your head against the keyboard. Caveat Emptor – variable rates apply, depending on the industry vertical.

It is well known that there are basically three main attributes to the integrity of the data: complete, atomic, and well-annotated.

Complete data sets mean analytical results for all required influent and effluent constituents as specified in the effluent standard for a specific site on a specific date.

Atomic data sets are data elements that represent the lowest level of detail. For example, in a daily sales report, the individual items that are sold are atomic data, whereas roll-ups such as invoices and summary totals from invoices are aggregate data.

Well-annotated data sets are the categorization and labeling of data for ML applications. Training data must be properly categorized and annotated for a specific use case. With high-quality, human-powered data annotation, companies can build and improve ML implementations. This is where we get into issues such as Gold Standard Sets and Provenance of Data.

Installing Pandas:

Note: Before you install Pandas, you must bear in mind that it supports only Python versions 3.7, 3.8, and 3.9.

I am also assuming you are using some type of virtual environment.

As per the usual installation packages for most Python libraries:

pip install pandas

You can also choose to use a package manager in which case it’s probably already included.

#import pandas pd is the industry shorthand
import pandas as pd
#check the version
pd.__version__
[2]: '1.4.3'

Ok we have it set up correctly.

So what is pandas?

Glad you asked, i have always thought of pandas as enhancing numpy as pandas is built on numpy. numpy It is the fundamental library of python, used to perform scientific computing. It provides high-performance multidimensional arrays and tools to deal with them. A numPy array is a grid of values (of the same type) indexed by a tuple of positive integers, numpy arrays are fast, easy to understand, and give users the right to perform calculations across arrays. pandas on the other hand provides high-performance, fast, easy-to-use data structures, and data analysis tools for manipulating numeric data and most importantly time series manipulation.

So lets start with the pandas series object which is a one dimensional array of indexed data which can be created from a list or an array:

data = pd.Series([0.1,0.2,0.3,0.4, 0.5])
data
[5]: 0    0.1
     1    0.2
     2    0.3
     3    0.4
     4    0.5
     dtype: float64

The cool thing about this output is that Series creates and wraps both a sequence and the related indices; ergo we can access both the values and index attributes. To double check this we can access values:

[6]: data.values
[6]: array([0.1, 0.2, 0.3, 0.4, 0.5])

and the index:

[7]: data.index
[7]: RangeIndex(start=0, stop=5, step=1)

You can access the associated values via the [ ] square brackets just like numpy however pandas.Series is much more flexible than the numpy counterpart that it emulates. They say imitation is the highest form of flattery.

Lets go grab some data from the LazyWebTM:

If one really thinks about the aspects of pandas.Series it is really a specialized version of a python dictionary. For those unfamiliar a dictionary (dict) is python structure that maps arbirtrary keys to a set of arbitrary values. Super powerful for data manipulation and data wrangling. Taking this is a step further pandas.Series is a structure that maps typed keys to a set of typed values. The typing is very important whereas the type-specific compiled code within numpy arrays makes it much more efficient than a python list. In the same vein pandas.Series is much more efficient python dictionaries. pandas.Series has an insane amount of commands:

Find Series Reference Here.

Next, we move to what i consider the most powerful aspect of pandas the DataFrame. A DataFrame is a data structure that organizes data into a 2-dimensional table of rows and columns, much like a spreadsheet. DataFrames are one of the most common data structures used in modern data analytics because they are a flexible and intuitive way of storing and working with data.

# Python code demonstrate creating 
# DataFrame from dict narray / lists 
# By default addresses.
 
import pandas as pd
 
# intialise data of lists.
data = {'Name':['Bob', 'Carol', 'Alice', ''],
        'Age':[18, 20, 22, 24]}
 
# Create DataFrame
df = pd.DataFrame(data)
 
# Print the output.
print(df)
 [8]:
    Name  Age
0    Bob   18
1  Carol   20
2  Alice   22
3          24       

Lets grab some data. nba.csv is a flat file of NBA statistics of players:

Get the NBA data file here.

i don’t watch or follow sports so i don’t know what is in this file. Just did a google search for csv statistics and this file came up.

# importing pandas package
import pandas as pd
 
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
 
# retrieving row by loc method
first = data.loc["Avery Bradley"]
second = data.loc["R.J. Hunter"]
 
 
print(first, "\n\n\n", second)
[9]:
Team        Boston Celtics
Number                 0.0
Position                PG
Age                   25.0
Height                 6-2
Weight               180.0
College              Texas
Salary           7730337.0
Name: Avery Bradley, dtype: object 


Team        Boston Celtics
Number                28.0
Position                SG
Age                   22.0
Height                 6-5
Weight               185.0
College      Georgia State
Salary           1148640.0
Name: R.J. Hunter, dtype: object

How nice is this? Easy Peasy. It seems almost too easy.

For reference here is the pandas.Dataframe reference documentation.

Just to show how far reaching pandas is now in the data science world for all of you who think you may need to use Spark there is a package called PySpark. In PySpark A DataFrame is equivalent to a relational table in Spark SQL, and can be created using various functions. Once created, it can be manipulated using the various domain-specific-language (DSL) functions  much like your beloved SQL.

Which might be another Snake_Byte in the future.

i also found pandas being used in ye ole #HealthIT #FHIR for as we started this off csv manipulation. Think of this Snake_Byte as an Ouroboros.

This github repo converts csv2fhir ( can haz interoperability? ):

with pd.read_csv(file_path, **csv_reader_params) as buffer:
        for chunk in buffer:

            chunk: DataFrame = execute(chunk_tasks, chunk)

            # increment the source row number for the next chunk/buffer processed
            # add_row_num is the first task in the list
            starting_row_num = chunk["rowNum"].max() + 1
            chunk_tasks[0] = Task(name="add_row_num", params={"starting_index": starting_row_num})

            chunk: Series = chunk.apply(_convert_row_to_fhir, axis=1)

            for processing_exception, group_by_key, fhir_resources in chunk:
                yield processing_exception, group_by_key, fhir_resources

So this brings us to the end of this Snake_Byte. Hope this gave you a little taste of a great python library that is used throughout the industry.

Muzak To Blog By:

Mike Patton & The Metropole Orchestra – Mondo Cane – June 12th 2008 (Full Show) <- A true genius at work!

One other mention on the Muzak To Blog By must go to the fantastic Greek Composer, Evángelos Odysséas Papathanassíou (aka Vangelis) who recently passed away. We must not let the music be lost like tears in the rain, Vangelis’ music will live forever. Rest In Power, Maestro Vangelis. i have spent many countless hours listening to your muzak and now the sheep are truly dreaming. Listen here -> Memories Of Green.

Snake_Byte[6] Algorithm Complexity

The Lighter Side of Complexity - The Complexity Project
Your software design?

Any intelligent fool can make things bigger, more complex, and more violent. It takes a touch of genius and a lot of courage to move in the opposite direction.

E.F. Schumacher

First, i hope everyone is safe.

Second, i had meant this for reading over Thanksgiving but transparently I was having technical difficulties with \LATEX rendering and it appears that both MATHJAX and native LATEX are not working on my site. For those interested i even injected the MATHJAX code into my .php header. Hence i had to rewrite a bunch of stuff alas with no equations. Although for some reason unbenowst to me my table worked.

Third, Hey its time for a Snake_Byte [] !

In this installment, i will be discussing Algorithm Complexity and will be using a Python method that i previously wrote about in Snake_Byte[5]: Range.

So what is algorithm complexity?  Well, you may remember in your mathematics or computer science classes “Big Oh” notation.  For those that don’t know this involves both space and time complexity not to be confused with Space-Time Continuums.  

Let’s hit the LazyWeb and particularly Wikipedia:

“Big O notation is a mathematical notation that describes the limiting behavior of a function when the argument tends towards a particular value or infinity. It is a member of a family of notations invented by Paul Bachmann, Edmund Landau, and others collectively called Bachmann–Landau notation or asymptotic notation.”

— Wikipedia’s definition of Big O notation

Hmmm.   Let’s try to parse that a little better shall we?

So you want to figure out how slow or hopefully how fast your code is using fancy algebraic terms and terminology.  So you want to measure the algorithmic behavior as a function of two variables with time complexity and space complexity.  Time is both the throughput as well as how fast from t0-tni1 the algorithm operates.  Then we have space complexity which is literally how much memory (either in memory or persistent memory) the algorithms require as a function of the input.  As an added bonus you can throw around the word asymptotic:

From Dictionary.com

/ (ˌæsɪmˈtɒtɪk) / adjective. of or referring to an asymptote. (of a function, series, formula, etc) approaching a given value or condition, as a variable or an expression containing a variable approaches a limit, usually infinity.

Ergo asymptotic analysis means how the algorithm responds “to” or “with” values that approach ∞.

So “Hey what’s the asymptotic response of the algorithm?”

Hence we need a language that will allow us to say that the computing time, as a function of (n), grows ‘on the order of n3,’ or ‘at most as fast as n3,’ or ‘at least as fast as n *log*n,’ etc.

There are five symbols that are used in the language of comparing the rates of growth of functions they are the following five: ‘o’ (read ‘is little oh of’), O (read ‘is big oh of’), ‘θ’ (read ‘is theta of’), ‘∼’ (read ‘is asymptotically equal to’ or, irreverently, as ‘twiddles’), and Ω (read ‘is omega of’). It is interesting to note there are discrepancies amongst the ranks of computer science and mathematics as to the accuracy and validity of each. We will just keep it simple and say Big-Oh.

So given f(x) and g(x) be two functions of x. Where each of the five symbols above are intended to compare the rapidity of growth of f and g. If we say that f(x) = o(g(x)), then informally we are saying that f grows more slowly than g does when x is very large.

Let’s address the time complexity piece i don’t want to get philosophical on What is Time? So for now and this blog i will make the bounds it just like an arrow t(0) – t(n-1)

That said the analysis of the algorithm is for an order of magnitude not the actual running time. There is a python function called time that we can use to do an exact analysis for the running time.  Remember this is to save you time upfront to gain an understanding of the time complexity before and while you are designing said algorithm.

Most arithmetic operations are constant time; multiplication usually takes longer than addition and subtraction, and division takes even longer, but these run times don’t depend on the magnitude of the operands. Very large integers are an exception; in that case, the run time increases with the number of digits.

So for Indexing operations whether reading or writing elements in a sequence or dictionary are also constant time, regardless of the size of the data structure.

A for loop that traverses a sequence or dictionary is usually linear, as long as all of the operations in the body of the loop are constant time.

The built-in function sum is also linear because it does the same thing, but it tends to be faster because it is a more efficient implementation; in the language of algorithmic analysis, it has a smaller leading coefficient.

If you use the same loop to “add” a list of strings, the run time is quadratic because string concatenation is linear.

The string method join is usually faster because it is linear in the total length of the strings.

So let’s look at an example using the previous aforementioned range built-in function:

So this is much like the linear example above: The lowest complexity is O(1). When we have a loop:


k = 0
for i in range(n):
    for j in range(m):
        print(i)
        k=k+1

In this case for nested loops we multiply the time complexity thus O(n*m). it also works the same for a loop with time complexity (n) we call a function a function with time complexity (m). When calculating complexity we omit the constant regardless if its execution 5 or 100 times.

When you are performing an analysis look for worst-case boundary conditions or examples.

Linear O(n):

for i in range(n):
 if t[i] == 0:
   return 0
return 1

Quadratic O(n**2):

res = 0
for i in range (n):
   for in range (m):
      res += 1
return (res)

There are other types if time complexity like exponential time and factorial time. Exponential Time is O(2**n) and Factorial Time is O(n!).

For space complexity memory has a limit especially if you have ever chased down a heap allocation or trash collection bug. Like we said earlier there is no free lunch you either trade space for time or time for space. Data-driven architectures respond to the input size of the data. Thus the dimensionality of the input space needs to be addressed. If you have a constant number of variables: O(1). If you need to declare an array like using numpy for instance with (n) elements then you have linear space complexity O(n). Remember these are independent of the size of the problem.

For a great book on Algorithm Design and Analysis i highly recommend:

The Algorithm Design Manual by Steven S. Skiena (click it takes you to amazon)

It goes in-depth to growth rates and dominance relations etc `as it relates to graph algorithms, search and sorting as well as cryptographic functions.

There is also a trilogy of sorts called Algorithms Unlocked and Illuminated by Roughgarden and Cormen which are great and less mathematically rigorous if that is not your forte.

Well, i hope this gave you a taste. i had meant this to be a much longer and more in-depth blog however i need to fix this latex issue so i can properly address the matters at hand.

Until then,

#iwishyouwater <- Alexey Molchanov new world freedive record. He is a really awesome human.

Muzak To Blog By: Maddalena (Original Motion Picture Soundtrack) by the Maestro Ennio Morricone – Rest in Power Maestro i have spent many hours listening to your works.