It enables us to dabble in vicarious vice and to sit in smug judgment on the result.Online Quote Generator
First, i hope everyone is safe. Second i haven’t written a Snake_Byte [ ] in quite some time so here goes. This is a library i ran across late last night and well for what it achieves even for data exploration it is well worth the
pip install dabl cost of it all.
Data analysis is an essential task in the field of machine learning and artificial intelligence. However, it can be a challenging and time-consuming task, especially for those who are not familiar with programming. That’s where the
dabl library comes into play.
dabl, short for Data Analysis Baseline Library, is a high-level data analysis library in
python, designed to make data analysis as easy and effortless as possible. It is an open-source library, developed and maintained by the
The library provides a collection of simple and intuitive functions for exploring, cleaning, transforming, and visualizing data. With
dabl, users can perform various data analysis tasks such as regression, classification, clustering, anomaly detection, and more, with just a few lines of code.
One of the main benefits of
dabl is that it helps users get started quickly by providing a set of default actions for each task. For example, to perform a regression analysis, users can simply call the “regression” function and pass in their data, and
dabl will take care of the rest.
Another advantage of
dabl is that it provides easy-to-understand visualizations of the results, allowing users to quickly understand the results of their analysis and make informed decisions based on the data. This is particularly useful for non-technical users who may not be familiar with complex mathematical models or graphs.
dabl also integrates well with other popular data analysis libraries such as
matplotlib, making it a convenient tool for those already familiar with these libraries.
So let us jump into the code shall we?
This code uses the
dabl library to perform regression analysis on the Titanic dataset. The dataset is loaded using the
pandas library and passed to the
dabl.SimpleRegressor function for analysis. The
fit method is used to fit the regression model to the data, and the
score method is used to evaluate the performance of the model. Finally, the
dabl.plot function is used to visualize the results of the regression analysis.
import dabl import pandas as pd import matplotlib.pyplot as plt # Load the Titanic dataset from the disk titanic = pd.read_csv(dabl.datasets.data_path("titanic.csv")) #check shape columns etc titanic.shape titanic.head #all that is good tons of stuff going on here but now let us ask dabl whats up: titanic_clean = dabl.clean(titanic, verbose=1) #a cool call to detect types types = dabl.detect_types(titanic_clean) print (types) #lets do some eye candy dabl.plot(titanic, 'survived') #lets check the distribution plt.show() #let us try simple regression if it works it works # Perform regression analysis fc = dabl.SimpleClassifier(random_state=0) X = titanic_clean.drop("survived", axis=1) y = titanic_clean.survived fc.fit(X, y)
Ok so lets break this down a little.
We load the data set: (make sure the target directory is the same)
# Load the Titanic dataset from the disk titanic = pd.read_csv(dabl.datasets.data_path("titanic.csv"))
Of note we loaded this in to a
pandas dataframe. Assuming we can use python and load a comma-separated values file lets now do some exploration:
#check shape columns etc titanic.shape titanic.head
You should see the following:
Which is [1309 rows x 14 columns]
pclass survived name \ 0 1 1 Allen, Miss. Elisabeth Walton 1 1 1 Allison, Master. Hudson Trevor 2 1 0 Allison, Miss. Helen Loraine 3 1 0 Allison, Mr. Hudson Joshua Creighton 4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) ... ... ... ... 1304 3 0 Zabour, Miss. Hileni 1305 3 0 Zabour, Miss. Thamine 1306 3 0 Zakarian, Mr. Mapriededer 1307 3 0 Zakarian, Mr. Ortin 1308 3 0 Zimmerman, Mr. Leo sex age sibsp parch ticket fare cabin embarked boat \ 0 female 29 0 0 24160 211.3375 B5 S 2 1 male 0.9167 1 2 113781 151.55 C22 C26 S 11 2 female 2 1 2 113781 151.55 C22 C26 S ? 3 male 30 1 2 113781 151.55 C22 C26 S ? 4 female 25 1 2 113781 151.55 C22 C26 S ? ... ... ... ... ... ... ... ... ... ... 1304 female 14.5 1 0 2665 14.4542 ? C ? 1305 female ? 1 0 2665 14.4542 ? C ? 1306 male 26.5 0 0 2656 7.225 ? C ? 1307 male 27 0 0 2670 7.225 ? C ? 1308 male 29 0 0 315082 7.875 ? S ? body home.dest 0 ? St Louis, MO 1 ? Montreal, PQ / Chesterville, ON 2 ? Montreal, PQ / Chesterville, ON 3 135 Montreal, PQ / Chesterville, ON 4 ? Montreal, PQ / Chesterville, ON ... ... ... 1304 328 ? 1305 ? ? 1306 304 ? 1307 ? ? 1308 ? ?
Wow tons of stuff going on here and really this is cool data from an awful disaster. Ok let dabl exercise some muscle here and ask it to clean it up a bit:
titanic_clean = dabl.clean(titanic, verbose=1) types = dabl.detect_types(titanic_clean) print (types)
verbose = 1 in this case and
dabl.detect_types() shows the types detected which i found helpful:
Detected feature types: continuous 0 dirty_float 3 low_card_int 2 categorical 5 date 0 free_string 4 useless 0 dtype: int64
However look what dabl did for us;
continuous dirty_float low_card_int categorical \ pclass False False False True survived False False False True name False False False False sex False False False True sibsp False False True False parch False False True False ticket False False False False cabin False False False False embarked False False False True boat False False False True home.dest False False False False age_? False False False True age_dabl_continuous True False False False fare_? False False False False fare_dabl_continuous True False False False body_? False False False True body_dabl_continuous True False False False date free_string useless pclass False False False survived False False False name False True False sex False False False sibsp False False False parch False False False ticket False True False cabin False True False embarked False False False boat False False False home.dest False True False age_? False False False age_dabl_continuous False False False fare_? False False True fare_dabl_continuous False False False body_? False False False body_dabl_continuous False False False Target looks like classification Linear Discriminant Analysis training set score: 0.578
Ah sweet! So data science, machine learning or data mining is 80% cleaning up the data. Take what you can get and go with it folks. dabl even informs us it appears the target method looks like a classification problem. As the name suggests, Classification means classifying the data on some grounds. It is a type of Supervised learning. In classification, the target column should be a Categorical column. If the target has only two categories like the one in the dataset above (Fit/Unfit), it’s called a Binary Classification Problem. When there are more than 2 categories, it’s a Multi-class Classification Problem. The “target” column is also called a “Class” in the Classification problem.
Now lets do some analysis. Yep we are just getting to some statistics. There are univariate and bivariate in this case.
Bivariate analysis is the simultaneous analysis of two variables. It explores the concept of the relationship between two variable whether there exists an association and the strength of this association or whether there are differences between two variables and the significance of these differences.
The main three types we will see here are:
- Categorical v/s Numerical
- Numerical V/s Numerical
- Categorical V/s Categorical data
Also of note Linear Discriminant Analysis or LDA is a dimensionality reduction technique. It is used as a pre-processing step in machine learning. The goal of LDA is to project the features in higher dimensional space onto a lower-dimensional space in order to avoid the curse of dimensionality and also reduce resources and dimensional costs. The original technique was developed in the year 1936 by Ronald A. Fisher and was named Linear Discriminant or Fisher’s Discriminant Analysis.
(NOTE there is another LDA (Latent Dirichlet Allocation which is used in Semantic Engineering that is quite different).
In the following plots that auto-magically happens is continuous feature plots for discriminant analysis.
In the plots you will also see PCA (Principle Component Analysis). PCA was invented in 1901 by Karl Pearson, as an analog of the principal axis theorem in mechanics; it was later independently developed and named by Harold Hotelling in the 1930s. Depending on the field of application, it is also named the discrete Karhunen–Loève transform (KLT) in signal processing, the Hotelling transform in multivariate quality control, proper orthogonal decomposition (POD) in mechanical engineering. PCA is used extensively in many and my first usage of it was in 1993 for three-dimensional rendering of sound.
What is old is new again.
The main difference is that the Linear discriminant analysis is a supervised dimensionality reduction technique that also achieves classification of the data simultaneously. LDA focuses on finding a feature subspace that maximizes the separability between the groups. While Principal component analysis is an unsupervised Dimensionality reduction technique, it ignores the class label. PCA focuses on capturing the direction of maximum variation in the data set.
Both reduce the dimensionality of the dataset and make it more computationally resourceful. LDA and PCA both form a new set of components.
The last plot is categorical versus target.
So now lets try as
dabl said a SimpleClassifier then fit the data to the line. (hey some machine learning!)
fc = dabl.SimpleClassifier(random_state=0) X = titanic_clean.drop("survived", axis=1) y = titanic_clean.survived fc.fit(X, y)
This should produce the following outputs with accuracy metrics:
Running DummyClassifier(random_state=0) accuracy: 0.618 average_precision: 0.382 roc_auc: 0.500 recall_macro: 0.500 f1_macro: 0.382 === new best DummyClassifier(random_state=0) (using recall_macro): accuracy: 0.618 average_precision: 0.382 roc_auc: 0.500 recall_macro: 0.500 f1_macro: 0.382 Running GaussianNB() accuracy: 0.970 average_precision: 0.975 roc_auc: 0.984 recall_macro: 0.964 f1_macro: 0.968 === new best GaussianNB() (using recall_macro): accuracy: 0.970 average_precision: 0.975 roc_auc: 0.984 recall_macro: 0.964 f1_macro: 0.968 Running MultinomialNB() accuracy: 0.964 average_precision: 0.988 roc_auc: 0.990 recall_macro: 0.956 f1_macro: 0.961 Running DecisionTreeClassifier(class_weight='balanced', max_depth=1, random_state=0) accuracy: 0.976 average_precision: 0.954 roc_auc: 0.971 recall_macro: 0.971 f1_macro: 0.974 === new best DecisionTreeClassifier(class_weight='balanced', max_depth=1, random_state=0) (using recall_macro): accuracy: 0.976 average_precision: 0.954 roc_auc: 0.971 recall_macro: 0.971 f1_macro: 0.974 Running DecisionTreeClassifier(class_weight='balanced', max_depth=5, random_state=0) accuracy: 0.969 average_precision: 0.965 roc_auc: 0.983 recall_macro: 0.965 f1_macro: 0.967 Running DecisionTreeClassifier(class_weight='balanced', min_impurity_decrease=0.01, random_state=0) accuracy: 0.976 average_precision: 0.954 roc_auc: 0.971 recall_macro: 0.971 f1_macro: 0.974 Running LogisticRegression(C=0.1, class_weight='balanced', max_iter=1000, random_state=0) accuracy: 0.974 average_precision: 0.991 roc_auc: 0.993 recall_macro: 0.970 f1_macro: 0.972 Running LogisticRegression(C=1, class_weight='balanced', max_iter=1000, random_state=0) accuracy: 0.975 average_precision: 0.991 roc_auc: 0.994 recall_macro: 0.971 f1_macro: 0.973 Best model: DecisionTreeClassifier(class_weight='balanced', max_depth=1, random_state=0) Best Scores: accuracy: 0.976 average_precision: 0.954 roc_auc: 0.971 recall_macro: 0.971 f1_macro: 0.974
This actually calls the
sklearn routines in aggregate. Looks like our old friend logistic regression works. keep it simple sam it ain’t gotta be complicated.
In conclusion, dabl is a highly recommended library for those looking to simplify their data analysis tasks. With its intuitive functions and visualizations, it provides a quick and easy way to perform data analysis, making it an ideal tool for both technical and non-technical user. Again, the real strength of
dabl is in providing simple interfaces for data exploration. For more information:
dabl github. <- click here
Muzak To Blog By: “Ballads For Two”, Chet Baker and Wolfgang Lackerschmid, trumpet meet vibraphone sparsity. The space between the note is where all of the action lives.