DISCLAIMER: This blog was written some time ago. Software breaks once in a while and there was a ghost in my LazyWebTM machine. We are back to our regularly scheduled program. Read on Dear Reader, and your humble narrator apologizes.
The other day i was talking to someone about file manipulations and whatnot and mentioned how awesome Pandas and the magic of the df.DoWhatEverYaWant( my_data_object)
via a dataframe
was and they weren’t really familiar with Pandas
. So being that no one knows everything i figured i would write a Snake_Byte[]
about Pandas
. i believe i met the author of pandas
– Wes Mckinney at a PyData conference years ago at Facebook. Really nice human and has created one of the most used libraries for data wrangling.
One of the most nagging issues with machine learning, in general, is the access of high integrity canonical training sets or even just high integrity data sets writ large.
By my estimate over the years having performed various types of learning systems and algorithm development, machine learning is 80% data preparation, 10% data piping, 5% training, and 5% banging your head against the keyboard. Caveat Emptor – variable rates apply, depending on the industry vertical.
It is well known that there are basically three main attributes to the integrity of the data: complete, atomic, and well-annotated.
Complete data sets mean analytical results for all required influent and effluent constituents as specified in the effluent standard for a specific site on a specific date.
Atomic data sets are data elements that represent the lowest level of detail. For example, in a daily sales report, the individual items that are sold are atomic data, whereas roll-ups such as invoices and summary totals from invoices are aggregate data.
Well-annotated data sets are the categorization and labeling of data for ML applications. Training data must be properly categorized and annotated for a specific use case. With high-quality, human-powered data annotation, companies can build and improve ML implementations. This is where we get into issues such as Gold Standard Sets and Provenance of Data.
Installing Pandas:
Note: Before you install Pandas, you must bear in mind that it supports only Python
versions 3.7, 3.8
, and 3.9
.
I am also assuming you are using some type of virtual environment.
As per the usual installation packages for most Python libraries:
pip install pandas
You can also choose to use a package manager in which case it’s probably already included.
#import pandas pd is the industry shorthand
import pandas as pd
#check the version
pd.__version__
[2]: '1.4.3'
Ok we have it set up correctly.
So what is pandas
?
Glad you asked, i have always thought of pandas
as enhancing numpy
as pandas
is built on numpy
. numpy
It is the fundamental library of python
, used to perform scientific computing. It provides high-performance multidimensional arrays and tools to deal with them. A numPy
array is a grid of values (of the same type) indexed by a tuple of positive integers, numpy
arrays are fast, easy to understand, and give users the right to perform calculations across arrays. pandas on the other hand provides high-performance, fast, easy-to-use data structures, and data analysis tools for manipulating numeric data and most importantly time series manipulation.
So lets start with the pandas series object
which is a one dimensional array of indexed data which can be created from a list
or an array
:
data = pd.Series([0.1,0.2,0.3,0.4, 0.5])
data
[5]: 0 0.1
1 0.2
2 0.3
3 0.4
4 0.5
dtype: float64
The cool thing about this output is that Series
creates and wraps both a sequence
and the related indices
; ergo we can access both the values
and index
attributes. To double check this we can access values:
[6]: data.values
[6]: array([0.1, 0.2, 0.3, 0.4, 0.5])
and the index
:
[7]: data.index
[7]: RangeIndex(start=0, stop=5, step=1)
You can access the associated values via the [ ]
square brackets just like numpy
however pandas.Series
is much more flexible than the numpy
counterpart that it emulates. They say imitation is the highest form of flattery.
Lets go grab some data from the LazyWebTM:
If one really thinks about the aspects of pandas.Series
it is really a specialized version of a python dictionary.
For those unfamiliar a dictionary
(dict) is python
structure that maps arbirtrary keys to a set of arbitrary values. Super powerful for data manipulation and data wrangling. Taking this is a step further pandas.Series
is a structure that maps typed keys
to a set of typed values
. The typing is very important whereas the type-specific compiled code within numpy
arrays makes it much more efficient than a python list
. In the same vein pandas.Series
is much more efficient python dictionaries
. pandas.Series
has an insane amount of commands:
Next, we move to what i consider the most powerful aspect of pandas
the DataFrame
. A DataFrame
is a data structure that organizes data into a 2-dimensional table of rows and columns, much like a spreadsheet. DataFrames
are one of the most common data structures used in modern data analytics because they are a flexible and intuitive way of storing and working with data.
# Python code demonstrate creating
# DataFrame from dict narray / lists
# By default addresses.
import pandas as pd
# intialise data of lists.
data = {'Name':['Bob', 'Carol', 'Alice', ''],
'Age':[18, 20, 22, 24]}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
print(df)
[8]:
Name Age 0 Bob 18 1 Carol 20 2 Alice 22 3 24
Lets grab some data. nba.csv is a flat file of NBA statistics of players:
i don’t watch or follow sports so i don’t know what is in this file. Just did a google search for csv statistics and this file came up.
# importing pandas package
import pandas as pd
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
# retrieving row by loc method
first = data.loc["Avery Bradley"]
second = data.loc["R.J. Hunter"]
print(first, "\n\n\n", second)
[9]:
Team Boston Celtics
Number 0.0
Position PG
Age 25.0
Height 6-2
Weight 180.0
College Texas
Salary 7730337.0
Name: Avery Bradley, dtype: object
Team Boston Celtics
Number 28.0
Position SG
Age 22.0
Height 6-5
Weight 185.0
College Georgia State
Salary 1148640.0
Name: R.J. Hunter, dtype: object
How nice is this? Easy Peasy. It seems almost too easy.
For reference here is the pandas.Dataframe
reference documentation.
Just to show how far reaching pandas is now in the data science world for all of you who think you may need to use Spark there is a package called PySpark
. In PySpark A DataFrame
is equivalent to a relational table in Spark SQL
, and can be created using various functions. Once created, it can be manipulated using the various domain-specific-language (DSL) functions much like your beloved SQL.
Which might be another Snake_Byte in the future.
i also found pandas
being used in ye ole #HealthIT #FHIR for as we started this off csv
manipulation. Think of this Snake_Byte as an Ouroboros.
This github repo converts csv2fhir
( can haz interoperability? ):
with pd.read_csv(file_path, **csv_reader_params) as buffer:
for chunk in buffer:
chunk: DataFrame = execute(chunk_tasks, chunk)
# increment the source row number for the next chunk/buffer processed
# add_row_num is the first task in the list
starting_row_num = chunk["rowNum"].max() + 1
chunk_tasks[0] = Task(name="add_row_num", params={"starting_index": starting_row_num})
chunk: Series = chunk.apply(_convert_row_to_fhir, axis=1)
for processing_exception, group_by_key, fhir_resources in chunk:
yield processing_exception, group_by_key, fhir_resources
So this brings us to the end of this Snake_Byte. Hope this gave you a little taste of a great python library that is used throughout the industry.
Muzak To Blog By:
Mike Patton & The Metropole Orchestra – Mondo Cane – June 12th 2008 (Full Show) <- A true genius at work!
One other mention on the Muzak To Blog By must go to the fantastic Greek Composer, Evángelos Odysséas Papathanassíou (aka Vangelis) who recently passed away. We must not let the music be lost like tears in the rain, Vangelis’ music will live forever. Rest In Power, Maestro Vangelis. i have spent many countless hours listening to your muzak and now the sheep are truly dreaming. Listen here -> Memories Of Green.