Groupings Of Pandas In A Frame

DISCLAIMER: This blog was written some time ago. Software breaks once in a while and there was a ghost in my LazyWebTM machine. We are back to our regularly scheduled program. Read on Dear Reader, and your humble narrator apologizes.

The other day i was talking to someone about file manipulations and whatnot and mentioned how awesome Pandas and the magic of the df.DoWhatEverYaWant( my_data_object) via a dataframe was and they weren’t really familiar with Pandas. So being that no one knows everything i figured i would write a Snake_Byte[] about Pandas. i believe i met the author of pandas – Wes Mckinney at a PyData conference years ago at Facebook. Really nice human and has created one of the most used libraries for data wrangling.

One of the most nagging issues with machine learning, in general, is the access of high integrity canonical training sets or even just high integrity data sets writ large.

By my estimate over the years having performed various types of learning systems and algorithm development, machine learning is 80% data preparation, 10% data piping, 5% training, and 5% banging your head against the keyboard. Caveat Emptor – variable rates apply, depending on the industry vertical.

It is well known that there are basically three main attributes to the integrity of the data: complete, atomic, and well-annotated.

Complete data sets mean analytical results for all required influent and effluent constituents as specified in the effluent standard for a specific site on a specific date.

Atomic data sets are data elements that represent the lowest level of detail. For example, in a daily sales report, the individual items that are sold are atomic data, whereas roll-ups such as invoices and summary totals from invoices are aggregate data.

Well-annotated data sets are the categorization and labeling of data for ML applications. Training data must be properly categorized and annotated for a specific use case. With high-quality, human-powered data annotation, companies can build and improve ML implementations. This is where we get into issues such as Gold Standard Sets and Provenance of Data.

Installing Pandas:

Note: Before you install Pandas, you must bear in mind that it supports only Python versions 3.7, 3.8, and 3.9.

I am also assuming you are using some type of virtual environment.

As per the usual installation packages for most Python libraries:

pip install pandas

You can also choose to use a package manager in which case it’s probably already included.

#import pandas pd is the industry shorthand
import pandas as pd
#check the version
pd.__version__
[2]: '1.4.3'

Ok we have it set up correctly.

So what is pandas?

Glad you asked, i have always thought of pandas as enhancing numpy as pandas is built on numpy. numpy It is the fundamental library of python, used to perform scientific computing. It provides high-performance multidimensional arrays and tools to deal with them. A numPy array is a grid of values (of the same type) indexed by a tuple of positive integers, numpy arrays are fast, easy to understand, and give users the right to perform calculations across arrays. pandas on the other hand provides high-performance, fast, easy-to-use data structures, and data analysis tools for manipulating numeric data and most importantly time series manipulation.

So lets start with the pandas series object which is a one dimensional array of indexed data which can be created from a list or an array:

data = pd.Series([0.1,0.2,0.3,0.4, 0.5])
data
[5]: 0    0.1
     1    0.2
     2    0.3
     3    0.4
     4    0.5
     dtype: float64

The cool thing about this output is that Series creates and wraps both a sequence and the related indices; ergo we can access both the values and index attributes. To double check this we can access values:

[6]: data.values
[6]: array([0.1, 0.2, 0.3, 0.4, 0.5])

and the index:

[7]: data.index
[7]: RangeIndex(start=0, stop=5, step=1)

You can access the associated values via the [ ] square brackets just like numpy however pandas.Series is much more flexible than the numpy counterpart that it emulates. They say imitation is the highest form of flattery.

Lets go grab some data from the LazyWebTM:

If one really thinks about the aspects of pandas.Series it is really a specialized version of a python dictionary. For those unfamiliar a dictionary (dict) is python structure that maps arbirtrary keys to a set of arbitrary values. Super powerful for data manipulation and data wrangling. Taking this is a step further pandas.Series is a structure that maps typed keys to a set of typed values. The typing is very important whereas the type-specific compiled code within numpy arrays makes it much more efficient than a python list. In the same vein pandas.Series is much more efficient python dictionaries. pandas.Series has an insane amount of commands:

Find Series Reference Here.

Next, we move to what i consider the most powerful aspect of pandas the DataFrame. A DataFrame is a data structure that organizes data into a 2-dimensional table of rows and columns, much like a spreadsheet. DataFrames are one of the most common data structures used in modern data analytics because they are a flexible and intuitive way of storing and working with data.

# Python code demonstrate creating 
# DataFrame from dict narray / lists 
# By default addresses.
 
import pandas as pd
 
# intialise data of lists.
data = {'Name':['Bob', 'Carol', 'Alice', ''],
        'Age':[18, 20, 22, 24]}
 
# Create DataFrame
df = pd.DataFrame(data)
 
# Print the output.
print(df)
 [8]:
    Name  Age
0    Bob   18
1  Carol   20
2  Alice   22
3          24       

Lets grab some data. nba.csv is a flat file of NBA statistics of players:

Get the NBA data file here.

i don’t watch or follow sports so i don’t know what is in this file. Just did a google search for csv statistics and this file came up.

# importing pandas package
import pandas as pd
 
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
 
# retrieving row by loc method
first = data.loc["Avery Bradley"]
second = data.loc["R.J. Hunter"]
 
 
print(first, "\n\n\n", second)
[9]:
Team        Boston Celtics
Number                 0.0
Position                PG
Age                   25.0
Height                 6-2
Weight               180.0
College              Texas
Salary           7730337.0
Name: Avery Bradley, dtype: object 


Team        Boston Celtics
Number                28.0
Position                SG
Age                   22.0
Height                 6-5
Weight               185.0
College      Georgia State
Salary           1148640.0
Name: R.J. Hunter, dtype: object

How nice is this? Easy Peasy. It seems almost too easy.

For reference here is the pandas.Dataframe reference documentation.

Just to show how far reaching pandas is now in the data science world for all of you who think you may need to use Spark there is a package called PySpark. In PySpark A DataFrame is equivalent to a relational table in Spark SQL, and can be created using various functions. Once created, it can be manipulated using the various domain-specific-language (DSL) functions  much like your beloved SQL.

Which might be another Snake_Byte in the future.

i also found pandas being used in ye ole #HealthIT #FHIR for as we started this off csv manipulation. Think of this Snake_Byte as an Ouroboros.

This github repo converts csv2fhir ( can haz interoperability? ):

with pd.read_csv(file_path, **csv_reader_params) as buffer:
        for chunk in buffer:

            chunk: DataFrame = execute(chunk_tasks, chunk)

            # increment the source row number for the next chunk/buffer processed
            # add_row_num is the first task in the list
            starting_row_num = chunk["rowNum"].max() + 1
            chunk_tasks[0] = Task(name="add_row_num", params={"starting_index": starting_row_num})

            chunk: Series = chunk.apply(_convert_row_to_fhir, axis=1)

            for processing_exception, group_by_key, fhir_resources in chunk:
                yield processing_exception, group_by_key, fhir_resources

So this brings us to the end of this Snake_Byte. Hope this gave you a little taste of a great python library that is used throughout the industry.

Muzak To Blog By:

Mike Patton & The Metropole Orchestra – Mondo Cane – June 12th 2008 (Full Show) <- A true genius at work!

One other mention on the Muzak To Blog By must go to the fantastic Greek Composer, Evángelos Odysséas Papathanassíou (aka Vangelis) who recently passed away. We must not let the music be lost like tears in the rain, Vangelis’ music will live forever. Rest In Power, Maestro Vangelis. i have spent many countless hours listening to your muzak and now the sheep are truly dreaming. Listen here -> Memories Of Green.

Leave a Reply

Your email address will not be published. Required fields are marked *