Categories

# Part 11: VECTORIZED STRING OPERATIONS

We will learn to apply the famous Python string operations on Pandas Series and DataFrame objects

## 1. INTRODUCING PANDAS STRING OPERATIONS

``````import numpy as np
import pandas as pd
``````

Vectorization is process of doing an operation on multiple items (in an array, for example) in one go.

``````x  = np.array([1,2,3,4,5])

# performing vectorization of operations
x * 10
``````
``````array([10, 20, 30, 40, 50])
``````

However, it is not straightforward to perform vectorization on “array of strings” and Pandas addresses this need of performing vectorized string operations using various `str` methods

``````names_series  = pd.Series(['tom','JOhn','MARIA'])
names_series
``````
``````0      tom
1     JOhn
2    MARIA
dtype: object
``````
``````names_series.str.capitalize()
``````
``````0      Tom
1     John
2    Maria
dtype: object
``````

## 2. STRING OPERATIONS

Let’s first define a Pandas Series to work with:

``````# Panda series use in this section
names = pd.Series(['Walter White', 'Jesse Pinkman', 'Skyler White', 'Hank Shrader', 'Mike Ehrmantraut', 'Gus Fring'])
names
``````
``````0        Walter White
1       Jesse Pinkman
2        Skyler White
4    Mike Ehrmantraut
5           Gus Fring
dtype: objec
``````

### 2.1 Methods Similar to Python String Methods

Nearly all Python’s built-in string methods are mirrored by a Pandas vectorized string method. Visit this link to get the complete list.

``````# lets apply some of these string methods to panda series
# to upper case
names.str.upper()
``````
``````0        WALTER WHITE
1       JESSE PINKMAN
2        SKYLER WHITE
4    MIKE EHRMANTRAUT
5           GUS FRING
dtype: objectt
``````
``````# to check if it is digit
names.str.isdigit()
``````
``````0    False
1    False
2    False
3    False
4    False
5    False
dtype: bool
``````
``````# to get length of each item in the array
names.str.len()
``````
``````0    12
1    13
2    12
3    12
4    16
5     9
dtype: int64
``````
``````# to get boolean array, one that passes the condition
names.str.startswith('W')
``````
``````0     True
1    False
2    False
3    False
4    False
5    False
dtype: bool
``````

### 2.2. String Methods using Regular Expressions

Regular expression is a special syntax to find string or set of strings. This topic is very broad and can be very dry. However, we are going to taste plain-vanilla flavor of them here.

The following methods accept regular expressions to examine the content of each string element, and follow some of the API conventions of Python’s built-in `re` module

``````# let apply str.extract() method with regular expression to extract the first names
names.str.extract('([A-Za-z]+)')
``````
``````        0
0  Walter
1   Jesse
2  Skyler
3    Hank
4    Mike
5     Gus
``````

There are some good introductory examples on regular expressions usage in Python here

### 2.3. Vectorized indexing and slicing

``````# getting first letter of each element in the array
# using standard indexing method
names.str[0]
``````
``````0    W
1    J
2    S
3    H
4    M
5    G
dtype: object
``````
``````# getting first letter of each element in the array
# using  str.get() method
names.str.get(0)
``````
``````0    W
1    J
2    S
3    H
4    M
5    G
dtype: object
``````
``````# str.slice()
names.str.slice(0,2)
``````
``````0    Wa
1    Je
2    Sk
3    Ha
4    Mi
5    Gu
dtype: object
``````
``````# str.split()
names.str.split()
``````
``````0        [Walter, White]
1       [Jesse, Pinkman]
2        [Skyler, White]
4    [Mike, Ehrmantraut]
5           [Gus, Fring]
dtype: object
``````
``````# str.split() with str.get(0) to get first name
names.str.split().str.get(0)
``````
``````0    Walter
1     Jesse
2    Skyler
3      Hank
4      Mike
5       Gus
dtype: object
``````

### 2.4. `get_dummies`

The `get_dummies()` lets you quickly split out indicator variables into a DataFrame

``````dummy = pd.DataFrame({'info': ['A|B|C','A','A|C'],
'name': ['tom','dick','harry']})
print(dummy)
``````
``````    info   name
0  A|B|C    tom
1      A   dick
2    A|C  harry
``````
``````# using get_dummies
print(dummy['info'].str.get_dummies('|'))
``````
``````   A  B  C
0  1  1  1
1  1  0  0
2  1  0  1
``````