Categories
Pandas

Part 1: PANDAS SERIES AND DATAFRAME OBJECT

Learn how to create and work with Pandas Series, DataFrame and Index objects.

In this first part, we will introduce two primary components of Pandas — Series and DataFrame objects.

1. PANDAS SERIES OBJECT

Pandas Series is a one-dimensional array of indexed data

  • Pandas Series is essentially a columns
  • Values inside the Numpy array have an implicitly defined integer index, whereas the Pandas Series have an explicitly defined index, which can be integer or any data type
  • Panda Series object can be created from a list or an array or dictionary Panda series constructor has following common parameters:
pd.Series(data= ,index=, dtype=)
  • data= keyword argument: The first keyword argument for pd.Series() constructor is data=, however, we don’t need to explicitly set it, if we provide data as first argument
  • index= keyword argument: Default index is integer from 0 to n-1, where n is the number of elements in the series. However, we can specify a custom index using the index= keyword argument. These integers or other data type is collectively called index of Series and each individual index element is called label
  • dtype= keyword argument is used to explicitly set the data type of Series object
  • Additional parameters includes, name and copy

1.1. Creating Pandas Series object

# importing pandas and numpy
import pandas as pd
import numpy as np
# creating panda series object
pd_series = pd.Series([0.25,0.50,0.75,1.0])

# printing panda series object
pd_series
0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

[0,1,2,3] is a sequence of index along with its sequence of values [0.25,0.50,0.75,1.0]

We can use the built-in methods of pandas object to fetch these indices and values

a. .values method

We use .values method to get values of Series object

# fetch values of given Series
pd_series.values
array([0.25, 0.5 , 0.75, 1.  ])

b. .index method

We use .index method to get indices of Series object

# fetch indices of given Series
pd_series.index
RangeIndex(start=0, stop=4, step=1)

1.2. Pandas Series as Generalized Numpy Array

We will first create the Pandas Series object by providing data in form of explicit list, index is automatically set to integer from 0 to n-1:

# creating Series object from list
pd.Series([1,2,3,4])
0    1
1    2
2    3
3    4
dtype: int64

We can also create the Series object by providing a previously defined 1D Numpy array

# defining numpy array
arr = np.array([1,2,3,4])

# creating Series from Numpy Array
pd.Series(arr)
0    1
1    2
2    3
3    4
dtype: int64

Numpy Array vs Pandas Series:

Contrary to Numpy array, that has implicit integer index, the index in Pandas object can be any data type (int,float,str or combination of them). Let explicitly set the string based index:

# string as index
data_index_string = pd.Series([0.25,0.50,0.75,1.0],
                             index=['w','x','y','z'])
data_index_string
w    0.25
x    0.50
y    0.75
z    1.00
dtype: float64

1.3. Pandas Series as Specialized Dictionary

Pandas Series object can also be created from the dictionary. To understand the conceptual parallel, remember this:

  • A dictionary is a structure that maps arbitrary keys to a set of arbitrary values
  • A Series is a structure that maps typed keys to a set of typed values
# defining dictionary, key-value pairs
population_dict = {'California': 38332521, 
                   'Texas': 26448193, 
                   'New York': 19651127, 
                   'Florida': 19552860, 
                   'Illinois': 12882135}

# creating Series from dictionary
population_series = pd.Series(population_dict)
population_series
California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

Indexing: We can use index label to fetch the corresponding value

population_series['New York']
19651127

This is equivalent of using the implicit integer index. As New York is at index position of 2 so we can also fetch its value in following manner:

population_series[2]
19651127

1.4. Other ways to create Pandas Series

In the following examples, we will see how we can use the index= keyword argument to construct the Series object from the subset of data provided

→ Using a scalar, with explicit index, that defines the number of scalar instances in a Series object. Look at the example below:

# using scalar
# number of instances of '10' is
# defined by index=

pd.Series(10,index=[1,2,3,4,5])
1    10
2    10
3    10
4    10
5    10
dtype: int64

→ Using dictionary, but its subset, by providing index of required values

# pd.Series() takes-in all dict values
# but Series will be made from only those values
# whose keys are explicitly mentioned in index=

pd.Series({'a':1, 'b':2, 'c':3}, index=['a','b'])
a    1
b    2
dtype: int64

2. PANDAS DATAFRAME OBJECT

If a Series is analogous to one-dimensional array with flexible indices, a DataFrame is analogous to a two-dimensional array with both flexible row indices and flexible column names Just as you might think of a 2D array as an ordered sequence of aligned (sharing same index) 1D columns, you can think of a DataFrame as a sequence of aligned (sharing same index) Series objects

Panda DataFrame constructor has following common parameters:

pd.DataFrame(data=, index=, columns=, dtype=)
  • DataFrame constructor has essentially the same keyword arguments as the Panda Series.
  • However, DataFrame can’t be constructed from a scalar(single value)
  • Besides, it also takes an additional columns= keyword argument, which represents the label for the column. The default value of columns is (0,1,2…n)

2.1 DataFrame from List or 2D Array

a. From a list

It seems similar to the Series object we created earlier, but we can set the column label in DataFrame object, by using the column= kwarg. In absence of this kwarg, the default value of first column is set to 0 as can be seen in the example below:

df_list = pd.DataFrame([1,2,3,4])
df_list
0
01
12
23
34

We can also explicitly set the label for column, as you can see in the example below:

df_list = pd.DataFrame([1,2,3,4], columns=['col1'])
df_list
col1
01
12
23
34

b. From a 2D Array

We can use 2D array to construct a DataFrame with more than one-columns. If we don’t provide the kwarg columns= the default is set to (0,1,2…n) See the example below:

df_2darray = pd.DataFrame([[1,2],[3,4]])
df_2darray
01
012
134

However, we can also set the custom(explicit) index and column names

df_2darray_custom = pd.DataFrame([[1,2],[3,4]],
                                index=['row1','row2'],
                                columns=['col1','col2'])
df_2darray_custom
col1col2
row112
row234

2.2. DataFrame from Series Object

We can also create DataFrame object from previously defined Series object

# defining Series object
pd_sr = pd.Series([100,200,300,400])

# constructing DataFrame from Series object
pd_df = pd.DataFrame(pd_sr)
pd_df
0
0100
1200
2300
3400

Let explicitly set the index and column labels

# defining Series object with index labels
pd_sr = pd.Series([100,200,300,400],
                 index=['a','b','c','d'])

# constructing DataFrame from Series object
# with custom columns labels
pd_df_custom = pd.DataFrame(pd_sr,
                            columns=['hundreds'])

pd_df_custom
hundreds
a100
b200
c300
d400

2.3. DataFrame from Dictionaries

In dictionary key-value pair, the value can be another dictionary. We will use this concept to construct our DataFrame object. Pay particular attention as how the key-values are used to assign the index and columns values of the DataFrame

# reproducing population dictionary
population_dict = {'California': 38332521, 
                   'Texas': 26448193, 
                   'New York': 19651127, 
                   'Florida': 19552860, 
                   'Illinois': 12882135}

# making the area dictionary 
area_dict = {'California': 423967, 
             'Texas': 695662, 
             'New York': 141297, 
             'Florida': 170312, 
             'Illinois': 149995}

# constructing DataFrame using dictionaries
states = pd.DataFrame({'population': population_dict,
                      'area': area_dict})
states
  • keys provided under pd.DataFrame() are used as column labels
  • keys provided under assigned dictionaries, are used as index labels
populationarea
California38332521423967
Texas26448193695662
New York19651127141297
Florida19552860170312
Illinois12882135149995

2.4. Creating DataFrame from ‘list of dictionaries’

In the following example, dictionaries are nested inside the list and we provide data= inside DataFrame in the form of this list. Pay special attention that how the keys of dictionaries are used as column labels

# first, we will use simple for loop 
# to construct the 'list of dictionaries'

list_of_dict = [{'a': i, 'b': 2*i, 'c': 3*i}
               for i in range(5)]

list_of_dict
[{'a': 0, 'b': 0, 'c': 0},
 {'a': 1, 'b': 2, 'c': 3},
 {'a': 2, 'b': 4, 'c': 6},
 {'a': 3, 'b': 6, 'c': 9},
 {'a': 4, 'b': 8, 'c': 12}]
# creating DataFrame from the above 'list of dictionaries'
pd.DataFrame(list_of_dict)
abc
0000
1123
2246
3369
44812

2.5. Other Concepts

a. Fetching Attributes of DataFrame:

We will fetch the commonly used attributes of a DataFrame:

print(f"Index: {states.index}")
print(f"Columns Names: {states.columns}")
print(f"Shape: {states.shape}")
print(f"Size: {states.size}")
print(f"Values: {states.values}")
Index: Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')
Columns Names: Index(['population', 'area'], dtype='object')
Shape: (5, 2)
Size: 10
Values: [[38332521   423967]
 [26448193   695662]
 [19651127   141297]
 [19552860   170312]
 [12882135   149995]]

b. Indexing 101 on DataFrame

Indexing (using []) a DataFrame object applies on the columns labels

# fetch all values where column label = population
states['population']
California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
Name: population, dtype: int64
# fetch all values where column label = area
states['area']
California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

3. PANDAS INDEX OBJECT

  • Both Pandas Series and DataFrame object contains an explicit index that lets us reference and modify its data. In some of the above examples, we explicitly provided the index= keyword argument under pd.Series and pd.DataFrame However, the index object can be predefined using pd.Index() constructor
  • This Index object can be considered either as an immutable array or as an ordered set
# creating Pandas Index object
index_obj = pd.Index([1,2,3,4,5])
index_obj
Int64Index([1, 2, 3, 4, 5], dtype='int64')

3.1. Index as Immutable Array

Index object works in many ways like an array, for example, we can use standard indexing techniques:

# fetch first index
index_obj[0]
1
# fetch every other index, starting from first
index_obj[::2]
Int64Index([1, 3, 5], dtype='int64')

However, Index object is immutable array i.e, values cant be changed. If we try to change, it results in TypeError: Index does not support mutable operations

3.2. Index as Ordered Set

The Index object follows many of the conventions used by Python’s built-in Set data structure, so that unions, intersections, differences, and other combinations can be computed in a familiar way:

# creating index objects
indx1 = pd.Index([1,3,5,7,9,10,12])
indx2 =pd.Index([0,2,4,6,7,8,9,10])
# Intersection of sets
indx1 & indx2
Int64Index([7, 9, 10], dtype='int64')
# union of sets
indx1 | indx2
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12], dtype='int64')
# symmetric differences
indx1 ^ indx2
Int64Index([0, 1, 2, 3, 4, 5, 6, 8, 12], dtype='int64')

Leave a Reply