Snake_Byte[12]: Dabl A High-Level Data Analysis Library in Python

Not To Be Confused With The Game

It enables us to dabble in vicarious vice and to sit in smug judgment on the result.

Online Quote Generator

First, i hope everyone is safe. Second i haven’t written a Snake_Byte [ ] in quite some time so here goes. This is a library i ran across late last night and well for what it achieves even for data exploration it is well worth the pip install dabl cost of it all.

Data analysis is an essential task in the field of machine learning and artificial intelligence. However, it can be a challenging and time-consuming task, especially for those who are not familiar with programming. That’s where the dabl library comes into play.

dabl, short for Data Analysis Baseline Library, is a high-level data analysis library in python, designed to make data analysis as easy and effortless as possible. It is an open-source library, developed and maintained by the scikit-learn community.

The library provides a collection of simple and intuitive functions for exploring, cleaning, transforming, and visualizing data. With dabl, users can perform various data analysis tasks such as regression, classification, clustering, anomaly detection, and more, with just a few lines of code.

One of the main benefits of dabl is that it helps users get started quickly by providing a set of default actions for each task. For example, to perform a regression analysis, users can simply call the “regression” function and pass in their data, and dabl will take care of the rest.

Another advantage of dabl is that it provides easy-to-understand visualizations of the results, allowing users to quickly understand the results of their analysis and make informed decisions based on the data. This is particularly useful for non-technical users who may not be familiar with complex mathematical models or graphs.

dabl also integrates well with other popular data analysis libraries such as pandas, numpy, and matplotlib, making it a convenient tool for those already familiar with these libraries.

So let us jump into the code shall we?

This code uses the dabl library to perform regression analysis on the Titanic dataset. The dataset is loaded using the pandas library and passed to the dabl.SimpleRegressor function for analysis. The fit method is used to fit the regression model to the data, and the score method is used to evaluate the performance of the model. Finally, the dabl.plot function is used to visualize the results of the regression analysis.

import dabl
import pandas as pd
import matplotlib.pyplot as plt

# Load the Titanic dataset from the disk
titanic = pd.read_csv(dabl.datasets.data_path("titanic.csv"))
#check shape columns etc
titanic.shape
titanic.head
#all that is good tons of stuff going on here but now let us ask dabl whats up:
titanic_clean = dabl.clean(titanic, verbose=1)

#a cool call to detect types
types = dabl.detect_types(titanic_clean)
print (types)
#lets do some eye candy
dabl.plot(titanic, 'survived')
#lets check the distribution
plt.show()
#let us try simple regression if it works it works
# Perform regression analysis
fc = dabl.SimpleClassifier(random_state=0)
X = titanic_clean.drop("survived", axis=1)
y = titanic_clean.survived
fc.fit(X, y)                     

Ok so lets break this down a little.

We load the data set: (make sure the target directory is the same)

# Load the Titanic dataset from the disk
titanic = pd.read_csv(dabl.datasets.data_path("titanic.csv"))

Of note we loaded this in to a pandas dataframe. Assuming we can use python and load a comma-separated values file lets now do some exploration:

#check shape columns etc
titanic.shape
titanic.head

You should see the following:

(1309, 14) 

Which is [1309 rows x 14 columns]

and then:

pclass  survived                                             name  \
0          1         1                    Allen, Miss. Elisabeth Walton   
1          1         1                   Allison, Master. Hudson Trevor   
2          1         0                     Allison, Miss. Helen Loraine   
3          1         0             Allison, Mr. Hudson Joshua Creighton   
4          1         0  Allison, Mrs. Hudson J C (Bessie Waldo Daniels)   
...      ...       ...                                              ...   
1304       3         0                             Zabour, Miss. Hileni   
1305       3         0                            Zabour, Miss. Thamine   
1306       3         0                        Zakarian, Mr. Mapriededer   
1307       3         0                              Zakarian, Mr. Ortin   
1308       3         0                               Zimmerman, Mr. Leo   

         sex     age  sibsp  parch  ticket      fare    cabin embarked boat  \
0     female      29      0      0   24160  211.3375       B5        S    2   
1       male  0.9167      1      2  113781    151.55  C22 C26        S   11   
2     female       2      1      2  113781    151.55  C22 C26        S    ?   
3       male      30      1      2  113781    151.55  C22 C26        S    ?   
4     female      25      1      2  113781    151.55  C22 C26        S    ?   
...      ...     ...    ...    ...     ...       ...      ...      ...  ...   
1304  female    14.5      1      0    2665   14.4542        ?        C    ?   
1305  female       ?      1      0    2665   14.4542        ?        C    ?   
1306    male    26.5      0      0    2656     7.225        ?        C    ?   
1307    male      27      0      0    2670     7.225        ?        C    ?   
1308    male      29      0      0  315082     7.875        ?        S    ?   

     body                        home.dest  
0       ?                     St Louis, MO  
1       ?  Montreal, PQ / Chesterville, ON  
2       ?  Montreal, PQ / Chesterville, ON  
3     135  Montreal, PQ / Chesterville, ON  
4       ?  Montreal, PQ / Chesterville, ON  
...   ...                              ...  
1304  328                                ?  
1305    ?                                ?  
1306  304                                ?  
1307    ?                                ?  
1308    ?                                ?  

Wow tons of stuff going on here and really this is cool data from an awful disaster. Ok let dabl exercise some muscle here and ask it to clean it up a bit:

titanic_clean = dabl.clean(titanic, verbose=1)
types = dabl.detect_types(titanic_clean)
print (types)

i set verbose = 1 in this case and dabl.detect_types() shows the types detected which i found helpful:

Detected feature types:
continuous      0
dirty_float     3
low_card_int    2
categorical     5
date            0
free_string     4
useless         0
dtype: int64

However look what dabl did for us;

                      continuous  dirty_float  low_card_int  categorical  \
pclass                     False        False         False         True   
survived                   False        False         False         True   
name                       False        False         False        False   
sex                        False        False         False         True   
sibsp                      False        False          True        False   
parch                      False        False          True        False   
ticket                     False        False         False        False   
cabin                      False        False         False        False   
embarked                   False        False         False         True   
boat                       False        False         False         True   
home.dest                  False        False         False        False   
age_?                      False        False         False         True   
age_dabl_continuous         True        False         False        False   
fare_?                     False        False         False        False   
fare_dabl_continuous        True        False         False        False   
body_?                     False        False         False         True   
body_dabl_continuous        True        False         False        False   

                       date  free_string  useless  
pclass                False        False    False  
survived              False        False    False  
name                  False         True    False  
sex                   False        False    False  
sibsp                 False        False    False  
parch                 False        False    False  
ticket                False         True    False  
cabin                 False         True    False  
embarked              False        False    False  
boat                  False        False    False  
home.dest             False         True    False  
age_?                 False        False    False  
age_dabl_continuous   False        False    False  
fare_?                False        False     True  
fare_dabl_continuous  False        False    False  
body_?                False        False    False  
body_dabl_continuous  False        False    False 
Target looks like classification
Linear Discriminant Analysis training set score: 0.578
 

Ah sweet! So data science, machine learning or data mining is 80% cleaning up the data. Take what you can get and go with it folks. dabl even informs us it appears the target method looks like a classification problem. As the name suggests, Classification means classifying the data on some grounds. It is a type of Supervised learning. In classification, the target column should be a Categorical column. If the target has only two categories like the one in the dataset above (Fit/Unfit), it’s called a Binary Classification Problem. When there are more than 2 categories, it’s a Multi-class Classification Problem. The “target” column is also called a “Class” in the Classification problem.

Now lets do some analysis. Yep we are just getting to some statistics. There are univariate and bivariate in this case.

Bivariate analysis is the simultaneous analysis of two variables. It explores the concept of the relationship between two variable whether there exists an association and the strength of this association or whether there are differences between two variables and the significance of these differences.

The main three types we will see here are:

  1. Categorical v/s Numerical 
  2. Numerical V/s Numerical
  3. Categorical V/s Categorical data

Also of note Linear Discriminant Analysis or LDA is a dimensionality reduction technique. It is used as a pre-processing step in machine learning. The goal of LDA is to project the features in higher dimensional space onto a lower-dimensional space in order to avoid the curse of dimensionality and also reduce resources and dimensional costs. The original technique was developed in the year 1936 by Ronald A. Fisher and was named Linear Discriminant or Fisher’s Discriminant Analysis. 

(NOTE there is another LDA (Latent Dirichlet Allocation which is used in Semantic Engineering that is quite different).

dabl.plot(titanic, 'survived')

In the following plots that auto-magically happens is continuous feature plots for discriminant analysis.

Continuous Feature PairPlots

In the plots you will also see PCA (Principle Component Analysis). PCA was invented in 1901 by Karl Pearson, as an analog of the principal axis theorem in mechanics; it was later independently developed and named by Harold Hotelling in the 1930s.  Depending on the field of application, it is also named the discrete Karhunen–Loève transform (KLT) in signal processing, the Hotelling transform in multivariate quality control, proper orthogonal decomposition (POD) in mechanical engineering. PCA is used extensively in many and my first usage of it was in 1993 for three-dimensional rendering of sound.

Discriminating PCA Directions

What is old is new again.

The main difference is that the Linear discriminant analysis is a supervised dimensionality reduction technique that also achieves classification of the data simultaneously. LDA focuses on finding a feature subspace that maximizes the separability between the groups. While Principal component analysis is an unsupervised Dimensionality reduction technique, it ignores the class label. PCA focuses on capturing the direction of maximum variation in the data set.

LDA

Both reduce the dimensionality of the dataset and make it more computationally resourceful. LDA and PCA both form a new set of components.

The last plot is categorical versus target.

So now lets try as dabl said a SimpleClassifier then fit the data to the line. (hey some machine learning!)

fc = dabl.SimpleClassifier(random_state=0)
X = titanic_clean.drop("survived", axis=1)
y = titanic_clean.survived
fc.fit(X, y) 

This should produce the following outputs with accuracy metrics:

Running DummyClassifier(random_state=0)
accuracy: 0.618 average_precision: 0.382 roc_auc: 0.500 recall_macro: 0.500 f1_macro: 0.382
=== new best DummyClassifier(random_state=0) (using recall_macro):
accuracy: 0.618 average_precision: 0.382 roc_auc: 0.500 recall_macro: 0.500 f1_macro: 0.382

Running GaussianNB()
accuracy: 0.970 average_precision: 0.975 roc_auc: 0.984 recall_macro: 0.964 f1_macro: 0.968
=== new best GaussianNB() (using recall_macro):
accuracy: 0.970 average_precision: 0.975 roc_auc: 0.984 recall_macro: 0.964 f1_macro: 0.968

Running MultinomialNB()
accuracy: 0.964 average_precision: 0.988 roc_auc: 0.990 recall_macro: 0.956 f1_macro: 0.961
Running DecisionTreeClassifier(class_weight='balanced', max_depth=1, random_state=0)
accuracy: 0.976 average_precision: 0.954 roc_auc: 0.971 recall_macro: 0.971 f1_macro: 0.974
=== new best DecisionTreeClassifier(class_weight='balanced', max_depth=1, random_state=0) (using recall_macro):
accuracy: 0.976 average_precision: 0.954 roc_auc: 0.971 recall_macro: 0.971 f1_macro: 0.974

Running DecisionTreeClassifier(class_weight='balanced', max_depth=5, random_state=0)
accuracy: 0.969 average_precision: 0.965 roc_auc: 0.983 recall_macro: 0.965 f1_macro: 0.967
Running DecisionTreeClassifier(class_weight='balanced', min_impurity_decrease=0.01,
                       random_state=0)
accuracy: 0.976 average_precision: 0.954 roc_auc: 0.971 recall_macro: 0.971 f1_macro: 0.974
Running LogisticRegression(C=0.1, class_weight='balanced', max_iter=1000,
                   random_state=0)
accuracy: 0.974 average_precision: 0.991 roc_auc: 0.993 recall_macro: 0.970 f1_macro: 0.972
Running LogisticRegression(C=1, class_weight='balanced', max_iter=1000, random_state=0)
accuracy: 0.975 average_precision: 0.991 roc_auc: 0.994 recall_macro: 0.971 f1_macro: 0.973

Best model:
DecisionTreeClassifier(class_weight='balanced', max_depth=1, random_state=0)
Best Scores:
accuracy: 0.976 average_precision: 0.954 roc_auc: 0.971 recall_macro: 0.971 f1_macro: 0.974

This actually calls the sklearn routines in aggregate. Looks like our old friend logistic regression works. keep it simple sam it ain’t gotta be complicated.

In conclusion, dabl is a highly recommended library for those looking to simplify their data analysis tasks. With its intuitive functions and visualizations, it provides a quick and easy way to perform data analysis, making it an ideal tool for both technical and non-technical user. Again, the real strength of dabl is in providing simple interfaces for data exploration. For more information:

dabl github. <- click here

Until Then,

#iwishyouwater <- hold your breath on a dive with my comrade at arms @corepaddleboards. great video and the clarity was astounding.

Muzak To Blog By: “Ballads For Two”, Chet Baker and Wolfgang Lackerschmid, trumpet meet vibraphone sparsity. The space between the note is where all of the action lives.