Snake_Byte[7]: Pandas (Not The Animal)

Groupings Of Pandas In A Frame

DISCLAIMER: This blog was written some time ago. Software breaks once in a while and there was a ghost in my LazyWebTM machine. We are back to our regularly scheduled program. Read on Dear Reader, and your humble narrator apologizes.

The other day i was talking to someone about file manipulations and whatnot and mentioned how awesome Pandas and the magic of the df.DoWhatEverYaWant( my_data_object) via a dataframe was and they weren’t really familiar with Pandas. So being that no one knows everything i figured i would write a Snake_Byte[] about Pandas. i believe i met the author of pandas – Wes Mckinney at a PyData conference years ago at Facebook. Really nice human and has created one of the most used libraries for data wrangling.

One of the most nagging issues with machine learning, in general, is the access of high integrity canonical training sets or even just high integrity data sets writ large.

By my estimate over the years having performed various types of learning systems and algorithm development, machine learning is 80% data preparation, 10% data piping, 5% training, and 5% banging your head against the keyboard. Caveat Emptor – variable rates apply, depending on the industry vertical.

It is well known that there are basically three main attributes to the integrity of the data: complete, atomic, and well-annotated.

Complete data sets mean analytical results for all required influent and effluent constituents as specified in the effluent standard for a specific site on a specific date.

Atomic data sets are data elements that represent the lowest level of detail. For example, in a daily sales report, the individual items that are sold are atomic data, whereas roll-ups such as invoices and summary totals from invoices are aggregate data.

Well-annotated data sets are the categorization and labeling of data for ML applications. Training data must be properly categorized and annotated for a specific use case. With high-quality, human-powered data annotation, companies can build and improve ML implementations. This is where we get into issues such as Gold Standard Sets and Provenance of Data.

Installing Pandas:

Note: Before you install Pandas, you must bear in mind that it supports only Python versions 3.7, 3.8, and 3.9.

I am also assuming you are using some type of virtual environment.

As per the usual installation packages for most Python libraries:

pip install pandas

You can also choose to use a package manager in which case it’s probably already included.

#import pandas pd is the industry shorthand
import pandas as pd
#check the version
pd.__version__
[2]: '1.4.3'

Ok we have it set up correctly.

So what is pandas?

Glad you asked, i have always thought of pandas as enhancing numpy as pandas is built on numpy. numpy It is the fundamental library of python, used to perform scientific computing. It provides high-performance multidimensional arrays and tools to deal with them. A numPy array is a grid of values (of the same type) indexed by a tuple of positive integers, numpy arrays are fast, easy to understand, and give users the right to perform calculations across arrays. pandas on the other hand provides high-performance, fast, easy-to-use data structures, and data analysis tools for manipulating numeric data and most importantly time series manipulation.

So lets start with the pandas series object which is a one dimensional array of indexed data which can be created from a list or an array:

data = pd.Series([0.1,0.2,0.3,0.4, 0.5])
data
[5]: 0    0.1
     1    0.2
     2    0.3
     3    0.4
     4    0.5
     dtype: float64

The cool thing about this output is that Series creates and wraps both a sequence and the related indices; ergo we can access both the values and index attributes. To double check this we can access values:

[6]: data.values
[6]: array([0.1, 0.2, 0.3, 0.4, 0.5])

and the index:

[7]: data.index
[7]: RangeIndex(start=0, stop=5, step=1)

You can access the associated values via the [ ] square brackets just like numpy however pandas.Series is much more flexible than the numpy counterpart that it emulates. They say imitation is the highest form of flattery.

Lets go grab some data from the LazyWebTM:

If one really thinks about the aspects of pandas.Series it is really a specialized version of a python dictionary. For those unfamiliar a dictionary (dict) is python structure that maps arbirtrary keys to a set of arbitrary values. Super powerful for data manipulation and data wrangling. Taking this is a step further pandas.Series is a structure that maps typed keys to a set of typed values. The typing is very important whereas the type-specific compiled code within numpy arrays makes it much more efficient than a python list. In the same vein pandas.Series is much more efficient python dictionaries. pandas.Series has an insane amount of commands:

Find Series Reference Here.

Next, we move to what i consider the most powerful aspect of pandas the DataFrame. A DataFrame is a data structure that organizes data into a 2-dimensional table of rows and columns, much like a spreadsheet. DataFrames are one of the most common data structures used in modern data analytics because they are a flexible and intuitive way of storing and working with data.

# Python code demonstrate creating 
# DataFrame from dict narray / lists 
# By default addresses.
 
import pandas as pd
 
# intialise data of lists.
data = {'Name':['Bob', 'Carol', 'Alice', ''],
        'Age':[18, 20, 22, 24]}
 
# Create DataFrame
df = pd.DataFrame(data)
 
# Print the output.
print(df)
 [8]:
    Name  Age
0    Bob   18
1  Carol   20
2  Alice   22
3          24       

Lets grab some data. nba.csv is a flat file of NBA statistics of players:

Get the NBA data file here.

i don’t watch or follow sports so i don’t know what is in this file. Just did a google search for csv statistics and this file came up.

# importing pandas package
import pandas as pd
 
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
 
# retrieving row by loc method
first = data.loc["Avery Bradley"]
second = data.loc["R.J. Hunter"]
 
 
print(first, "\n\n\n", second)
[9]:
Team        Boston Celtics
Number                 0.0
Position                PG
Age                   25.0
Height                 6-2
Weight               180.0
College              Texas
Salary           7730337.0
Name: Avery Bradley, dtype: object 


Team        Boston Celtics
Number                28.0
Position                SG
Age                   22.0
Height                 6-5
Weight               185.0
College      Georgia State
Salary           1148640.0
Name: R.J. Hunter, dtype: object

How nice is this? Easy Peasy. It seems almost too easy.

For reference here is the pandas.Dataframe reference documentation.

Just to show how far reaching pandas is now in the data science world for all of you who think you may need to use Spark there is a package called PySpark. In PySpark A DataFrame is equivalent to a relational table in Spark SQL, and can be created using various functions. Once created, it can be manipulated using the various domain-specific-language (DSL) functions  much like your beloved SQL.

Which might be another Snake_Byte in the future.

i also found pandas being used in ye ole #HealthIT #FHIR for as we started this off csv manipulation. Think of this Snake_Byte as an Ouroboros.

This github repo converts csv2fhir ( can haz interoperability? ):

with pd.read_csv(file_path, **csv_reader_params) as buffer:
        for chunk in buffer:

            chunk: DataFrame = execute(chunk_tasks, chunk)

            # increment the source row number for the next chunk/buffer processed
            # add_row_num is the first task in the list
            starting_row_num = chunk["rowNum"].max() + 1
            chunk_tasks[0] = Task(name="add_row_num", params={"starting_index": starting_row_num})

            chunk: Series = chunk.apply(_convert_row_to_fhir, axis=1)

            for processing_exception, group_by_key, fhir_resources in chunk:
                yield processing_exception, group_by_key, fhir_resources

So this brings us to the end of this Snake_Byte. Hope this gave you a little taste of a great python library that is used throughout the industry.

Muzak To Blog By:

Mike Patton & The Metropole Orchestra – Mondo Cane – June 12th 2008 (Full Show) <- A true genius at work!

One other mention on the Muzak To Blog By must go to the fantastic Greek Composer, Evángelos Odysséas Papathanassíou (aka Vangelis) who recently passed away. We must not let the music be lost like tears in the rain, Vangelis’ music will live forever. Rest In Power, Maestro Vangelis. i have spent many countless hours listening to your muzak and now the sheep are truly dreaming. Listen here -> Memories Of Green.

Snake_Byte[6] Algorithm Complexity

The Lighter Side of Complexity - The Complexity Project
Your software design?

Any intelligent fool can make things bigger, more complex, and more violent. It takes a touch of genius and a lot of courage to move in the opposite direction.

E.F. Schumacher

First, i hope everyone is safe.

Second, i had meant this for reading over Thanksgiving but transparently I was having technical difficulties with \LATEX rendering and it appears that both MATHJAX and native LATEX are not working on my site. For those interested i even injected the MATHJAX code into my .php header. Hence i had to rewrite a bunch of stuff alas with no equations. Although for some reason unbenowst to me my table worked.

Third, Hey its time for a Snake_Byte [] !

In this installment, i will be discussing Algorithm Complexity and will be using a Python method that i previously wrote about in Snake_Byte[5]: Range.

So what is algorithm complexity?  Well, you may remember in your mathematics or computer science classes “Big Oh” notation.  For those that don’t know this involves both space and time complexity not to be confused with Space-Time Continuums.  

Let’s hit the LazyWeb and particularly Wikipedia:

“Big O notation is a mathematical notation that describes the limiting behavior of a function when the argument tends towards a particular value or infinity. It is a member of a family of notations invented by Paul Bachmann, Edmund Landau, and others collectively called Bachmann–Landau notation or asymptotic notation.”

— Wikipedia’s definition of Big O notation

Hmmm.   Let’s try to parse that a little better shall we?

So you want to figure out how slow or hopefully how fast your code is using fancy algebraic terms and terminology.  So you want to measure the algorithmic behavior as a function of two variables with time complexity and space complexity.  Time is both the throughput as well as how fast from t0-tni1 the algorithm operates.  Then we have space complexity which is literally how much memory (either in memory or persistent memory) the algorithms require as a function of the input.  As an added bonus you can throw around the word asymptotic:

From Dictionary.com

/ (ˌæsɪmˈtɒtɪk) / adjective. of or referring to an asymptote. (of a function, series, formula, etc) approaching a given value or condition, as a variable or an expression containing a variable approaches a limit, usually infinity.

Ergo asymptotic analysis means how the algorithm responds “to” or “with” values that approach ∞.

So “Hey what’s the asymptotic response of the algorithm?”

Hence we need a language that will allow us to say that the computing time, as a function of (n), grows ‘on the order of n3,’ or ‘at most as fast as n3,’ or ‘at least as fast as n *log*n,’ etc.

There are five symbols that are used in the language of comparing the rates of growth of functions they are the following five: ‘o’ (read ‘is little oh of’), O (read ‘is big oh of’), ‘θ’ (read ‘is theta of’), ‘∼’ (read ‘is asymptotically equal to’ or, irreverently, as ‘twiddles’), and Ω (read ‘is omega of’). It is interesting to note there are discrepancies amongst the ranks of computer science and mathematics as to the accuracy and validity of each. We will just keep it simple and say Big-Oh.

So given f(x) and g(x) be two functions of x. Where each of the five symbols above are intended to compare the rapidity of growth of f and g. If we say that f(x) = o(g(x)), then informally we are saying that f grows more slowly than g does when x is very large.

Let’s address the time complexity piece i don’t want to get philosophical on What is Time? So for now and this blog i will make the bounds it just like an arrow t(0) – t(n-1)

That said the analysis of the algorithm is for an order of magnitude not the actual running time. There is a python function called time that we can use to do an exact analysis for the running time.  Remember this is to save you time upfront to gain an understanding of the time complexity before and while you are designing said algorithm.

Most arithmetic operations are constant time; multiplication usually takes longer than addition and subtraction, and division takes even longer, but these run times don’t depend on the magnitude of the operands. Very large integers are an exception; in that case, the run time increases with the number of digits.

So for Indexing operations whether reading or writing elements in a sequence or dictionary are also constant time, regardless of the size of the data structure.

A for loop that traverses a sequence or dictionary is usually linear, as long as all of the operations in the body of the loop are constant time.

The built-in function sum is also linear because it does the same thing, but it tends to be faster because it is a more efficient implementation; in the language of algorithmic analysis, it has a smaller leading coefficient.

If you use the same loop to “add” a list of strings, the run time is quadratic because string concatenation is linear.

The string method join is usually faster because it is linear in the total length of the strings.

So let’s look at an example using the previous aforementioned range built-in function:

So this is much like the linear example above: The lowest complexity is O(1). When we have a loop:


k = 0
for i in range(n):
    for j in range(m):
        print(i)
        k=k+1

In this case for nested loops we multiply the time complexity thus O(n*m). it also works the same for a loop with time complexity (n) we call a function a function with time complexity (m). When calculating complexity we omit the constant regardless if its execution 5 or 100 times.

When you are performing an analysis look for worst-case boundary conditions or examples.

Linear O(n):

for i in range(n):
 if t[i] == 0:
   return 0
return 1

Quadratic O(n**2):

res = 0
for i in range (n):
   for in range (m):
      res += 1
return (res)

There are other types if time complexity like exponential time and factorial time. Exponential Time is O(2**n) and Factorial Time is O(n!).

For space complexity memory has a limit especially if you have ever chased down a heap allocation or trash collection bug. Like we said earlier there is no free lunch you either trade space for time or time for space. Data-driven architectures respond to the input size of the data. Thus the dimensionality of the input space needs to be addressed. If you have a constant number of variables: O(1). If you need to declare an array like using numpy for instance with (n) elements then you have linear space complexity O(n). Remember these are independent of the size of the problem.

For a great book on Algorithm Design and Analysis i highly recommend:

The Algorithm Design Manual by Steven S. Skiena (click it takes you to amazon)

It goes in-depth to growth rates and dominance relations etc `as it relates to graph algorithms, search and sorting as well as cryptographic functions.

There is also a trilogy of sorts called Algorithms Unlocked and Illuminated by Roughgarden and Cormen which are great and less mathematically rigorous if that is not your forte.

Well, i hope this gave you a taste. i had meant this to be a much longer and more in-depth blog however i need to fix this latex issue so i can properly address the matters at hand.

Until then,

#iwishyouwater <- Alexey Molchanov new world freedive record. He is a really awesome human.

Muzak To Blog By: Maddalena (Original Motion Picture Soundtrack) by the Maestro Ennio Morricone – Rest in Power Maestro i have spent many hours listening to your works.

Snake_Byte[5]: Range

Now… We are going in a loop.

~ Ramakrishna, Springs of Indian Wisdom
1K+ Loop Pictures | Download Free Images on Unsplash
Loops All The Way Down

First, i trust everyone is safe.

Second, i’ll will be moving the frequency of Snake_Bytes [] to every other Wednesday. This is to provide higher quality information and also to allow me space and time to write other blogs. i trust dear reader y’all do not mind.

Third, i noticed i was remiss in explaining a function i used in a previous Snake_Byte [ ] that of the Python built-in function called range.

Range is a very useful function for, well, creating iterations on variables and loops.

# lets see how this works:
range(4)
[0,1,2,3]

How easy can that be?

Four items were returned. Now we can create a range or a for loop over that list – very meta huh?

Please note in the above example the list starts off with 0. So what if you want your range function to start with 1 base index instead of 0? You can specify that in the range function:

# Start with 1 for intial index
range (1,4)
[1,2,3]

Note the last number in the index in order to be inclusive for the entire index.

Lets try something a little more advanced with some eye candy:

%matplotlib inline
x_cords = range(-50,50)
y_cords = [x*x for x in x_cords]

plt.plot(x_cords, y_cords)
plt.show()

X^2 Function aka Parabola

We passed a computation into the loop to compute over the indices of range x in this case.

In one of the previous Snake_Bytes[] i utilized a for loop and range which is extremely powerful to iterate over sequences:

for i in range (3):
    print(i,"Pythons")
0 Pythons
1 Pythons 
2 Pythons

For those that really need power when it comes to indexing, sequencing and iteration you can change the list for instance, as we move across it. For example:

L = [1,2,3,4,5,6]
#no add one to each row 
# or L[1] = L[i] +1 used all 
# the time in matrix operations
for i in range(len(L)): 
    L[i] += 1
print (L)
[2,3,4,5,6,7]

Note there is a more “slick” way to do this with list comprehension without changing the original list in place. However, that’s outside the scope if you will of this Snake_Byte[] . Maybe i should do that for the next one?

Well, i hope you have a slight idea of the power of range.

Also, i think this was more “byte-able” and not tl;dr. Let me know!

Until Then,

#iwshyouwater <- another good one here click!

@tctjr

Muzak To Blog By: Roger Eno & Brian Eno – Mixing Colors (this album is spectacular)

Snake_Byte[4]: Random and PseudoRandom Numbers

Expose yourself to as much randomness as possible.


~ Ben Casnocha
Visualization of the algorithmic random data
A Visualization Of Randomness

First i trust everyone is safe.

Second it is WEDNESDAY and that must mean a Snake_Byte or you are working in a startup because every day is WEDNESDAY in a startup!

i almost didn’t get this one done because well life happens but i want to remain true to the goals herewith to the best of my ability.

So in today’s Snake_Byte we are going to cover Random and PseudoRandom Numbers.  i really liked this one because it was more in line with scientific computing and numerical optimization.

The random module in Python generates what is called pseudorandom numbers.  It is in the vernacular a pseudorandom number generator (PRNG).  This generation includes different types of distributions for said numbers. 

So what is a pseudorandom number:

“A pseudorandom number generator (PRNG), also known as a deterministic random bit generator, is an algorithm for generating a sequence of numbers whose properties approximate the properties of sequences of random numbers.” ~ Wikipedia

The important aspect here is:  the properties approximate sequences of random numbers.  So this means that it is statistically random even though it was generated by a deterministic response.

While i have used the random module and have even generated various random number algorithms i learned something new in this blog.  The pseudorandom number generator in Python uses an algorithm called the Mersenne Twister algorithm.  The period of said algorithm is length 2**19937-1 for the 32 bit version and there is also a 64-bit version.  The underlying implementation in C is both fast and thread-safe. The Mersenne Twister is one of the most extensively tested random number generators in existence. One issue though is that due to the deterministic nature of the algorithm it is not suitable for cryptographic methods.

Let us delve down into some code into the various random module offerings, shall we?

i like using %system in Jupyter Lab to create an interactive session. First we import random. Lets look at random.random() which returns a uniform distribution and when multiplied by a integer bounds it within that distribution range:

%system
import random
for i in range (5):
    x = random.random() * 100
    print (x)
63.281889167063035
0.13679757425121286
47.697874648329
96.66882808709684
76.63300711554905

Next let us check out random.choice(seq) which returns a random element from the non-empty sequence seq. If seq is empty, raises IndexError:

for z in range (5):
mySurfBoardlist = ["longboard", "shortboard", "boogieboard"]
print(random.choice(mySurfBoardlist))
longboard
boogieboard
boogieboard
longboard
shortboard

Next let us look at random.randrange(startstop[, step]) which returns a randomly selected element from range(start, stop, step). This is equivalent to choice(range(start, stop, step)) but doesn’t actually build a range object.

ParameterDescription
startOptional. An integer specifying at which position to start.
Default 0
stopRequired. An integer specifying at which position to end.
stepOptional. An integer specifying the incrementation.
Default 1
random.ranrange parameters
for i in range (5): 
      print(random.randrange(10, 100,1))
84
21
94
91
87

Now let us move on to some calls that you would use in signal processing, statistics or machine learning. The first one is gauss(). gauss() returns a gaussian distribution using the following mathematics:

\[\Large f(x) = \frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^{2}\right)\]

Gaussian distribution (also known as normal distribution) is a bell-shaped curve (aka the bell curve), and it is assumed that during any measurement values will follow a normal distribution with an equal number of measurements above and below the mean value.

ParameterDescription
muthe mean
sigmathe standard deviation
returns a random gaussian distribution floating number
gauss() parameters
# import the required libraries 
import random 
import matplotlib.pyplot as plt 
#set the inline magic
%matplotlib inline   
# store the random numbers in a list 
nums = [] 
mu = 100
sigma = 50
    
for i in range(100000): 
    temp = random.gauss(mu, sigma) 
    nums.append(temp) 
        
# plot the distribution 
plt.hist(nums, bins = 500, ec="red") 
plt.show()
Gaussian Distribution in Red

There are several more parameters in the random module, setter functions, seed functions and very complex statistical functions. Hit stack overflow and give it a try! Also it doesn’t hurt if you dust off that probability and statistics textbook!

As a last thought which came first the framework of entropy or the framework of randomness? As well as is everything truly random? i would love to hear your thought in the comments!

Until then,

#iwishyouwater <- click here on this one!

tctjr

References:

Python In A Nutshell by Alex Martelli

M. Matsumoto and T. Nishimura, “Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator”, ACM Transactions on Modeling and Computer Simulation Vol. 8, No. 1, January pp.3–30 1998

Muzak To Muzak To Blog By:  Black Sabbath  – The End: Live In Birmingham

Snake_Byte[3]: Getting Strung Out

There is geometry in the humming of the strings, there is music in the spacing of the spheres.

Pythagoras
How to Change Those Guitar Strings. | Superprof
We are not talking Guitar Strings

First, i  trust everyone is safe.

Second, this is the SB[3].  We are going to be covering some basics in Python of what constitutes a string, modifying a string, and explaining several string manipulation methods.

I also realized in the last Snake_Byte that i didn’t reference the book that i randomly open and choose the subject for the Snake_Byte.  I will be adding that as a reference at the end of the blog.

Strings can be used to represent just about anything.

They can be binary values of bytes, internet addresses, names, and Unicode for international localization.

They are part of a larger class of objects called sequences.  In fact, python strings are immutable sequences.  Immutability means you cannot change the sequence or the sequence does not change over time.

The most simplistic string is an empty string:

a = “ “ # with either singe or double quotes

There are numerous expression operations, modules, and methods for string manipulations.  

Python also supports much more advanced operations for those familiar with regular expressions (regex) it supports them via reEven more advanced operations are available such as XML parsing and the like.

Python is really into strings.

So let us get literal, shall we?

For String Literals there are countless ways to create and manipulate strings in your code:

Single Quotes:

a = `i w”ish you water’

Double Quotes;

A = “i w’ish you water”

Even triple quotes (made me think of the “tres commas” episode from Silicon Valley)

A = ```... i wish you water ```

Single and double quotes are by far the most used.  I prefer double quotes probably due to the other languages i learned before Python.

Python also supports the liberal use of backslashes aka escape sequences.  I’m sure everyone is familiar with said character `\`.  

Escape sequences let us embed bytecodes into strings that are otherwise difficult to type.

So let’s see here:

s = 't\nc\nt\njr'
print (s)
t
c
t
jr

So here i used ‘\n’ to represent the byte containing the binary value for newline character which is ASCII code 10.  There are several accessible representations:

‘\a\’ # bell

‘\b\’ #backspace

‘\f’ # formfeed for all the dot matrix printers we use 

‘\r’ #carriage return

You can even do different Unicode hex values:

‘\Uhhhhhhhh’ #32 bit hex count the number of h’s

With respect to binary file representations of note in Python 3.0 binary file content is represented by an actual byte string with operations similar to normal strings. 

One big difference between Python and another language like C is that that the zero (null) byte doesn’t terminate and in fact, there are no character string terminations in Python.  Also, the strings length and text reside in memory. 

s = 'a\0b\0c'
print (s)
len (s)
abc
5

So what can we do with strings in Python?

Well, we can concatenate:

a = "i wish"
print(len (a))
b = " you water"
print (len(b))
c = a + b
print (len(c))
print (c)
6
10
16
i wish you water

So adding two strings creates a new string object and a new address in memory.  It is also a form of operator overloading in place.  The ‘ + ‘ sign does the job for strings and can add numerics.  You also don’t have to “pre-declare” and allocate memory which is one of the advantages of Python.  In Python, computational processes are described directly by the programmer. A declarative language abstracts away procedural details however Python isn’t purely declarative which is outside the scope of the blog.  

So what else?  Well, there is indexing and slicing:

Strings are ordered collections of characters ergo we can access the characters by the positions within the ordering.

You access the component by providing a numerical offset via square brackets this is indexing.  

S = "i wish you water"
print (S[0], S[4], S[-1])
i s r

Since we can index we can slice:

S = "i wish you water"
print (S[1:3], S[2:10], S[9:10])
w wish you u

Slicing is a particular form of indexing more akin to parsing where you analyze the structure.

Python once again creates a new object containing the contiguous section identified by the offset pair.  It is important to note the left offset is taken to be the inclusive lower bound and the right is the non-inclusive upper bound. The inclusive definition is important here:  Including the endpoints of an interval. For example, “the interval from 1 to 2, inclusive” means the closed interval written [1, 2].  This means Python fetches all items from the lower bound up to but not including the upper bound. 

What about changing a string?

Let’s try it:

S = "i wish you water"
S[0] = "x"
---------------------------------------------------------------------------
TypeError Traceback (most recent call last) <ipython-input-67-a6fd56571822> 
in <module> 1 S = "i wish you water" ----> 2 S[0] = "x"
TypeError: 'str' object does not support item assignment

Ok, what just happened?  Well, remember the word immutable? You cannot change it in place.

To change a string you need to create a new one through various methods.  In the current case we will use a combination of concatenation, indexing, and slicing to bring it all together:

S = "i wish you water"
S = 'x ' + S[2]  +  S[3:17]
print (S)
x wish you water

This brings us to methods.

Stings in Python provide a set of methods that implements much more complex text processing.  Just like in other languages a method or function takes parameters and returns a value. A “method” is a specific type of function: it must be part of a “class”, so has access to the class’ member variables. A function is usually discrete and all variables must be passed into the function.

Given the previous example there is a replace method:

S = "i wish you water"
S = S.replace ('i wish you water', 'x wish you water')
print (S)
x wish you water

Let’s try some other methods;

# captialize the first letter in a string:
S = "i wish you water"
S.capitalize()
'I wish you water'

# capitalize all the letters in a string:
S = "i wish you water"
S.upper()
'I WISH YOU WATER'

# check if the string is a digit:
S = "i wish you water"
S.isdigit()
False

# check it again:
S = "999"
S.isdigit()
TRUE

# strip trailing spaces in a string:
S = "i wish you water     "
x = S.rstrip()
print("of all fruits", x, "is my favorite") 
of all fruits i wish you water is my favorite

The list is seemingly endless.  

One more caveat emptor you should use stings methods, not the original string module that was deprecated in Python 3.0

We could in fact write multiple chapters on strings by themselves.  However, this is supposed to be a little nibble of what the Snake language can offer.  We have added the reference that we used to make this blog at the end.  I believe it is one of the best books out there for learning Python.

Until Then,

#iwishyouwater

Tctjr

References:

Learning Python by Mark Lutz

Muzak To Blog By:  Mr. Robot Vol1 Original Television Soundtrack

Snake_Byte[2]: Comparisons and Equality

Contrariwise, continued Tweedledee, if it was so, it might be, and if it were so, it would be; but as it isn’t, it ain’t. That’s logic!

TweedleDee
Algebra, trigonometry and mathematical logic lessons by Janetvr | Fiverr
It’s all rational isn’t it?

First, i trust everyone is safe.

Second, i am going to be pushing a blog out every Wednesday called Snake_Bytes.  This is the second one hot off the press.  Snake as in Python and Bytes as in well you get it. Yes, it is a bad pun but hey most are bad. 

i will pick one of the myriads of python based books i have in my library and randomly open it to a page.  No matter how basic or advanced i will start from there and i will create a short concise blog on said subject.  For some possibly many the content will be rather pedantic for others i hope you gain a little insight.  As a former professor told me “to know a subject in many ways is to know it well.”  Just like martial arts or music performing the basics hopefully makes everything else effortless at some point.

Ok so in today’s installment we have Comparison and Equality.

I suppose more philosophically what is the Truth?

All Python objects at some level respond to some form of comparisons such as a test for equality or a magnitude comparison or even binary TRUE and FALSE.

For all comparisons in Python, the language traverses all parts of compound objects until a result can be ascertained and this includes nested objects and data structures.  The traversal for data structures is applied recursively from left to right.  

So let us jump into some simple snippets there starting with lists objects.  

List objects compare all of their components automatically.

%system #command line majik in Jupyterlab
# same value with unique objects
A1 = [2, (‘b’, 3)] 
A2 = [2, (‘b’, 3)]

#Are they equivalent?  Same Objects?
A1 == A2, A1 is A2
(True, False)

 So what happened here?  A1 and A2 are assigned lists which in fact are equivalent but distinct objects.  

So for comparisons how does that work?

  •  The ==  tests value equivalence

Python recursively tests nested comparisons until a result is ascertained.

  • The is operator tests object identity

Python tests whether the two are really the same object and live at the same address in memory.

So let’s compare some strings, shall we?

StringThing1 = "water"
StringThing2 = "water"
StringThing1 == StringThing2, StringThing1 is StringThing2
(True, True)

Ok, what just happened?  We need to be very careful here and i have seen this cause some really ugly bugs when performing long-chained regex stuff with health data.  Python internally caches and reuses some strings as an optimization technique.  Here there is really just a single string ‘water’ in memory shared by S1, S2 thus the identity operator evaluates to True.

The workaround is thus:

StringThing1 = "i wish you water"
StringThing2 = "i wish you water"
StringThing1 == StringThing2,StringThing1 is StringThing2
(True, False)

Given the logic of this lets see how we have conditional logic comparisons.

I believe Python 2.5 introduced ternary operators.  Once again interesting word:

Ternary operators ternary means composed of three parts or three as a base.

The operators are the fabled if/else you see in almost all programming languages.

Whentrue if condition else whenfalse

The condition is evaluated first.  If condition is true the result is whentrue; otherwise the result is whenfalse.  Only one of the two subexpressions whentrue and whenfalse evaluates depending on the truth value of condition.

Stylistically you want to palace parentheses around the whole expression.

Example of operator this was taken directly out the Python docs with a slight change as i thought it was funny:

is_nice = True
state = "nice" if is_nice else "ain’t nice"
print(state)

Which also shows how Python treats True and False.

In most programming languages an integer 0 is FALSE and an integer 1 is TRUE.

However, Python looks at an empty data structure as False.  True and False as illustrated above are inherent properties of every object in Python.

So in general Python compares types as follows:

  • Numbers are compared by the relative magnitude
  • Non-numeric mixed types comparisons where ( 3 < ‘water’) doesn’t fly in Python 3.0  However they are allowed in Python 2.6 where they use a fixed arbitrary rule.  Same with sorts non-numeric mixed type collections cannot be sorted in Python 3.0
  • Strings are compared lexicographically (ok cool word what does it mean?). Iin mathematics, the lexicographic or lexicographical order is a generalization of the alphabetical order of the dictionaries to sequences of ordered symbols or, more generally, of elements of a totally ordered set. In other words like a dictionary. Character by character where (“abc” < “ac”)
  • Lists and tuples are compared component by component left to right
  • Dictionaries are compared as equal if their sorted (key, value) lists are equal.  However relative magnitude comparisons are not supported in Python 3.0

With structured objects as one would think the comparison happens as though you had written the objects as literal and compared all the components one at a time left to right.  

Further, you can chain the comparisons such as:

a < b <= c < d

Which functionally is the same thing as:

a < b and b <= c and c < d

The chain form is more compact and more readable and evaluates each subexpression once at the most.

Being that most reading this should be using Python 3.0 a couple of words on dictionaries per the last commentary.  In Python 2.6 dictionaries supported magnitude comparisons as though you were comparing (key,value) lists.

In Python 3.0 magnitude comparisons for dictionaries are removed because they incur too much overhead when performing equality computations.  Python 3.0 from what i can gather uses an in-optimized scheme for equality comparisons.  So you write loops or compare them manually.  Once again no free lunch. The documentation can be found here: Ordering Comparisons in Python 3.0.

One last thing.  There is a special object called None.  It’s a special data type in Python in fact i think the only special data type.  None is equivalent to a Null pointer in C.  

This comes in handy if your list size is not known:

MyList = [None] * 50
Print (MyList)
[None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]

The output makes me think of a Monty Python skit. See what I did there? While the comparison to a NULL pointer is correct the way in which it allocates memory and doesn’t limit the size of the list it allocates presets an initial size to allow for future indexing assignments. In this way, it kind of reminds me of malloc in C.  Purist please don’t shoot the messenger. 

Well, i got a little long in the tooth as they say.  See what i did again?  Teeth, Snakes and Python.

See y’all next week.

Until Then,

#iwishyouwater

@tctjr

Muzak To Blog By: various tunes by : Pink Martini, Pixies, Steve Miller.

Snake_Byte[1]_PyForest

The joy of coding Python should be in seeing short, concise, readable classes that express a lot of action in a small amount of clear code — not in reams of trivial code that bores the reader to death.

Guido van Rossum

Hi all first always i trust everyone is doing well and safe.

Second i had started writing another blog on some first principles design issues in machine learning but this morning while i was just browsing i came across a python library called Pyforest. Pyforest claims to have 99% of your import library woes solved.

At one of my previous companies, we created this flow from the time you walk in get your rig and sit down you have access to a superpack.tar.gz with all of the necessary python dependencies in fact even any bash scripts that you might need once you got your rig to start doing PRs the same day you started work.  This was pre-anaconda days so it worked well then most moved over to dependency management via anaconda.  However, this didn’t solve one of the main issues.  What when and how do you import?

i am sure if you are like me i keep the proverbial “untitled.ipynb” sitting around just for a notepad of sorts for the main imports (just don’t click and press X accidentally!).

Which is where pyforest comes into the reptilian purview.

The github is funny it says:

pyforest – feel the bliss of automated imports.”

Then it goes on to say:

“Writing the same imports over and over again is below your capacity. Let pyforest do the job for you.”

Being this isn’t supposed to be tl;dr blog (only a little nibble from a reptile) lets get started.

Installation:

You need to have python version 3.6 or above.  The github is once again funny (we like f-strings).  

So first make sure you are in your venv. 

 python3 --version
 pip install --upgrade pyforest
 python -m pyforest install_extensions 

Low and behold: 

Downloading pyforest-1.0.3.tar.gz

Needless to say i was skeptical.  

Questions – autocomplete? Stomping on variables?  Grinding to a halt because maybe import *?

Nope. Here is proof:

Ok nice parlor trick.

Lets try plotting something because i always space out and just type plt:

Ok now you have my attention.

i then tried a simple linear regression with sklearn

So this library definitely saves you time and the folks over at bamboolib have a great sense of humor which i really appreciate.

You can check to see a list of imported libraries dir(pyforest) kinda like a micro pip freeze.

Here is the github: pyforest github.

+1 for my recommendation.

until then,

#iwishyouwater

@tctjr

Muzak To Blog To: Cold Fact by Rodriguez 

Fwiw if you get a chance watch “Searching For Sugarman” which is a documentary about Rodriguez.  Astounding.