Categories
Pandas

Part 8: FILTERING AND SORTING IN PANDAS

In this article, we will learn how to apply various filtering and sorting techniques on Pandas Series and DataFrame

This article covers a very basic overview of filtering and sorting techniques in Pandas

import numpy as np
import pandas as pd 

1. FILTERING

1.1. Conditional Operators

In Pandas (just like we covered in Numpy), we can create ‘filter conditions’ for DataFrame. For example, we can use conditional operators on a DataFrame column features, which return the boolean Series representing DataFrame that passes the filter condition

# reading csv and saving its content as 'df'
df = pd.read_csv('data/pedata.csv')

df.head()
Industry NameNumber of firmsCurrent PETrailing PEForward PE
0Advertising4720.0123.7713.84
1Aerospace/Defense7735.1144.2622.91
2Air Transport1814.8710.5510.16
3Apparel5125.7654.5721.97
4Auto & Truck1314.7716.7620.52

Let’s apply a filter to column Current PE values:

df['Current PE'] > 50

Result, will be a boolean array with True returned for all cells that passes the conditional operator and False otherwise:

0     False
1     False
2     False
3     False
4     False
      ...  
91    False
92    False
93    False
94     True
95     True
Name: Current PE, Length: 96, dtype: bool

Applying filter to column, Number of firms

df['Number of firms'] > 50
0     False
1      True
2     False
3      True
4     False
      ...  
91    False
92    False
93    False
94     True
95     True
Name: Number of firms, Length: 96, dtype: bool

1.2. Filter using Functions

For columns with string values: we can use str.startswith, str.endswith and str.contains functions Using ~ before the operation, negates the filter conditions

df['Industry  Name'].str.startswith('A')
0      True
1      True
2      True
3      True
4      True
      ...  
91    False
92    False
93    False
94    False
95    False
Name: Industry  Name, Length: 96, dtype: bool
df['Industry  Name'].str.contains('Air')
0     False
1     False
2      True
3     False
4     False
      ...  
91    False
92    False
93    False
94    False
95    False
Name: Industry  Name, Length: 96, dtype: bool
# .isin function checks the value in the provided list

df['Current PE'].isin([10,20])
0     False
1     False
2     False
3     False
4     False
      ...  
91    False
92    False
93    False
94    False
95    False
Name: Current PE, Length: 96, dtype: bool

1.3. Masking (Retrieving Rows that passes the Filter condition)

  • In above two sections, we studied how to create the filter that returns a list of boolean (True or False) along rows that passes or fails the test
  • We can apply the same filter under the square brackets [] of DataFrame variable to actually retrieve the rows, passing the filter condition.
df[df['Current PE'] > 150]
Industry NameNumber of firmsCurrent PETrailing PEForward PE
14Cable TV14156.5680.5726.37
30Entertainment107150.3947.6840.25
38Healthcare Products242159.8584.4371.38
43Hotel/Gaming65170.91134.2022.75
45Information Services69283.3746.2328.00
46Insurance (General)19693.0567.5724.42
55Oil/Gas Distribution24605.7269.4116.78
67Reinsurance2153.2157.4015.90
70Retail (Building Supply)17201.74238.8018.33
74Retail (Online)70319.22243.8286.28
df[df['Industry  Name'].str.startswith('A')]
Industry NameNumber of firmsCurrent PETrailing PEForward PE
0Advertising4720.0123.7713.84
1Aerospace/Defense7735.1144.2622.91
2Air Transport1814.8710.5510.16
3Apparel5125.7654.5721.97
4Auto & Truck1314.7716.7620.52
5Auto Parts4616.4217.5815.63

Using multiple conditions. Let suppose we need to know the industries with Current PE greater than 100 and Forward PE greater than 120

df[(df['Current PE'] > 100) & (df['Forward PE'] > 120)] 
Industry NameNumber of firmsCurrent PETrailing PEForward PE
77Semiconductor72109.3697.09248.11
87Telecom. Services67115.24742.09121.34

As another example, let suppose we are interested to know which industries has Forward PE greater than its Current PE value

df[df['Forward PE'] > df['Current PE']] 
Industry NameNumber of firmsCurrent PETrailing PEForward PE
4Auto & Truck1314.7716.7620.52
6Bank (Money Center)710.5610.2312.17
9Beverage (Soft)3434.4939.87143.56
11Brokerage & Investment Banking3914.0818.0516.34
15Chemical (Basic)4314.4016.1122.42
16Chemical (Diversified)69.6310.4810.13
20Computers/Peripherals4824.1328.9230.93
21Construction Supplies4422.3339.5826.20
24Drugs (Pharmaceutical)26722.4858.1835.43
25Education3521.3022.2026.03
27Electronics (Consumer & Office)2018.4064.2418.80
47Insurance (Life)2415.2721.0566.72
53Oil/Gas (Integrated)412.7322.6731.99
54Oil/Gas (Production and Exploration)26919.208.6634.96
65Real Estate (Operations & Services)5723.2032.4633.99
77Semiconductor72109.3697.09248.11
78Semiconductor Equip3925.5739.7328.46
81Software (Entertainment)8660.2533.9882.81
82Software (Internet)3090.1466.75100.71
84Steel3210.6114.3424.60
85Telecom (Wireless)1827.2125.6629.17
87Telecom. Services67115.24742.09121.34
91Trucking3317.5618.3623.54

2. SORTING

2.1. Sort by Feature

  • We can use sort_values function to sort DataFrame by one or more of its columns
  • we can either provide a single column label or list of column labels to sort by
  • Keyword arguments, ascending=True tells to sort in ascending order, ascending=False will sort in descending order
df.sort_values('Current PE')
Industry NameNumber of firmsCurrent PETrailing PEForward PE
18Coal & Related Energy227.0610.307.04
16Chemical (Diversified)69.6310.4810.13
6Bank (Money Center)710.5610.2312.17
84Steel3210.6114.3424.60
53Oil/Gas (Integrated)412.7322.6731.99
70Retail (Building Supply)17201.74238.8018.33
45Information Services69283.3746.2328.00
74Retail (Online)70319.22243.8286.28
55Oil/Gas Distribution24605.7269.4116.78
46Insurance (General)19693.0567.5724.42

96 rows × 5 columns

df.sort_values('Number of firms', ascending=False)
Industry NameNumber of firmsCurrent PETrailing PEForward PE
94Total Market705360.5270.8535.79
95Total Market (without financials)587862.4976.8339.72
7Banks (Regional)61116.9915.4113.70
23Drugs (Biotechnology)50377.3077.5630.21
83Software (System & Application)363144.40110.9076.82
6Bank (Money Center)710.5610.2312.17
16Chemical (Diversified)69.6310.4810.13
53Oil/Gas (Integrated)412.7322.6731.99
76Rubber& Tires415.2721.558.95
67Reinsurance2153.2157.4015.90

96 rows × 5 columns

# when using a list of column labels, any label used after first 
# acts as a tiebreaker for its preceding label
df.sort_values(['Current PE', 'Number of firms'])
Industry NameNumber of firmsCurrent PETrailing PEForward PE
18Coal & Related Energy227.0610.307.04
16Chemical (Diversified)69.6310.4810.13
6Bank (Money Center)710.5610.2312.17
84Steel3210.6114.3424.60
53Oil/Gas (Integrated)412.7322.6731.99
70Retail (Building Supply)17201.74238.8018.33
45Information Services69283.3746.2328.00
74Retail (Online)70319.22243.8286.28
55Oil/Gas Distribution24605.7269.4116.78
46Insurance (General)19693.0567.5724.42

96 rows × 5 columns

3. MORE EXAMPLES

In this section, we will load IMDB dataset. It is not a complete dataset of all IMDB, but a subset of 1,000 popular movies on IMDB from 2006 to 2016

imdb = pd.read_csv('data/imdb.csv')

imdb.head()
RankTitleGenreDescriptionDirectorActorsYearRuntime (Minutes)RatingVotesRevenue (Millions)Metascore
01Guardians of the GalaxyAction,Adventure,Sci-FiA group of intergalactic criminals are forced …James GunnChris Pratt, Vin Diesel, Bradley Cooper, Zoe S…20141218.1757074333.1376.0
12PrometheusAdventure,Mystery,Sci-FiFollowing clues to the origin of mankind, a te…Ridley ScottNoomi Rapace, Logan Marshall-Green, Michael Fa…20121247.0485820126.4665.0
23SplitHorror,ThrillerThree girls are kidnapped by a man with a diag…M. Night ShyamalanJames McAvoy, Anya Taylor-Joy, Haley Lu Richar…20161177.3157606138.1262.0
34SingAnimation,Comedy,FamilyIn a city of humanoid animals, a hustling thea…Christophe LourdeletMatthew McConaughey,Reese Witherspoon, Seth Ma…20161087.260545270.3259.0
45Suicide SquadAction,Adventure,FantasyA secret government agency recruits some of th…David AyerWill Smith, Jared Leto, Margot Robbie, Viola D…20161236.2393727325.0240.0
print(imdb.shape)
(1000, 12)

🤔 Show us all movies from 2016 with rating greater than 8.5

imdb[(imdb['Year'] == 2016) & (imdb['Rating'] >= 8.5)].head()
RankTitleGenreDescriptionDirectorActorsYearRuntime (Minutes)RatingVotesRevenue (Millions)Metascore
9697Kimi no na waAnimation,Drama,FantasyTwo strangers find themselves linked in a biza…Makoto ShinkaiRyûnosuke Kamiki, Mone Kamishiraishi, Ryô Nari…20161068.6341104.6879.0
117118DangalAction,Biography,DramaFormer wrestler Mahavir Singh Phogat and his t…Nitesh TiwariAamir Khan, Sakshi Tanwar, Fatima Sana Shaikh,…20161618.84896911.15NaN

🤔 Which movies generated revenues greater than 500 million?

imdb[imdb['Revenue (Millions)'] >= 500].head()
RankTitleGenreDescriptionDirectorActorsYearRuntime (Minutes)RatingVotesRevenue (Millions)Metascore
1213Rogue OneAction,Adventure,Sci-FiThe Rebel Alliance makes a risky move to steal…Gareth EdwardsFelicity Jones, Diego Luna, Alan Tudyk, Donnie…20161337.9323118532.1765.0
5051Star Wars: Episode VII – The Force AwakensAction,Adventure,FantasyThree decades after the defeat of the Galactic…J.J. AbramsDaisy Ridley, John Boyega, Oscar Isaac, Domhna…20151368.1661608936.6381.0
5455The Dark KnightAction,Crime,DramaWhen the menace known as the Joker wreaks havo…Christopher NolanChristian Bale, Heath Ledger, Aaron Eckhart,Mi…20081529.01791916533.3282.0
7677The AvengersAction,Sci-FiEarth’s mightiest heroes must come together an…Joss WhedonRobert Downey Jr., Chris Evans, Scarlett Johan…20121438.11045588623.2869.0
8586Jurassic WorldAction,Adventure,Sci-FiA new theme park, built on the original site o…Colin TrevorrowChris Pratt, Bryce Dallas Howard, Ty Simpkins,…20151247.0455169652.1859.0

🤔 Sorting the DataFrame by rating

imdb.sort_values('Rating').head()
RankTitleGenreDescriptionDirectorActorsYearRuntime (Minutes)RatingVotesRevenue (Millions)Metascore
829830Disaster MovieComedyOver the course of one evening, an unsuspectin…Jason FriedbergCarmen Electra, Vanessa Lachey,Nicole Parker, …2008871.97720714.1715.0
4243Don’t Fuck in the WoodsHorrorA group of friends are going on a camping trip…Shawn BurkettBrittany Blanton, Ayse Howard, Roman Jossart,N…2016732.7496NaNNaN
871872Dragonball EvolutionAction,Adventure,FantasyThe young warrior Son Goku sets out on a quest…James WongJustin Chatwin, James Marsters, Yun-Fat Chow, …2009852.7595129.3545.0
647648Tall MenFantasy,Horror,ThrillerA challenged man is stalked by tall phantoms i…Jonathan HolbrookDan Crisafulli, Kay Whitney, Richard Garcia, P…20161333.2173NaN57.0
968969WreckerAction,Horror,ThrillerBest friends Emily and Lesley go on a road tri…Micheal BafaroAnna Hutchison, Andrea Whitburn, Jennifer Koen…2015833.51210NaN37.0

🤔 Apply Filtering and Sorting at once: Which movies score greater than one million votes on IMDB, sort the result by rating

imdb[imdb['Votes'] > 1000000].sort_values('Rating', ascending=False).head()
RankTitleGenreDescriptionDirectorActorsYearRuntime (Minutes)RatingVotesRevenue (Millions)Metascore
5455The Dark KnightAction,Crime,DramaWhen the menace known as the Joker wreaks havo…Christopher NolanChristian Bale, Heath Ledger, Aaron Eckhart,Mi…20081529.01791916533.3282.0
8081InceptionAction,Adventure,Sci-FiA thief, who steals corporate secrets through …Christopher NolanLeonardo DiCaprio, Joseph Gordon-Levitt, Ellen…20101488.81583625292.5774.0
3637InterstellarAdventure,Drama,Sci-FiA team of explorers travel through a wormhole …Christopher NolanMatthew McConaughey, Anne Hathaway, Jessica Ch…20141698.61047747187.9974.0
124125The Dark Knight RisesAction,ThrillerEight years after the Joker’s reign of anarchy…Christopher NolanChristian Bale, Tom Hardy, Anne Hathaway,Gary …20121648.51222645448.1378.0
144145Django UnchainedDrama,WesternWith the help of a German bounty hunter , a fr…Quentin TarantinoJamie Foxx, Christoph Waltz, Leonardo DiCaprio…20121658.41039115162.8081.0

🤔 Which movies did business greater than 100 million but have IMDB rating of 6 or less. Then sort results by rating:

imdb[(imdb['Revenue (Millions)'] > 100) & (imdb['Rating'] <= 6.0)].sort_values('Rating', ascending=True).head()
RankTitleGenreDescriptionDirectorActorsYearRuntime (Minutes)RatingVotesRevenue (Millions)Metascore
6364Fifty Shades of GreyDrama,Romance,ThrillerLiterature student Anastasia Steele’s life cha…Sam Taylor-JohnsonDakota Johnson, Jamie Dornan, Jennifer Ehle,El…20151254.1244474166.1546.0
580581Kickboxer: VengeanceActionA kick boxer is out to avenge his brother.John StockwellDave Bautista, Alain Moussi, Gina Carano, Jean…2016904.96809131.5637.0
925926The Twilight Saga: Breaking Dawn – Part 1Adventure,Drama,FantasyThe Quileutes close in on expecting parents Ed…Bill CondonKristen Stewart, Robert Pattinson, Taylor Laut…20111174.9190244281.2845.0
941942The Twilight Saga: EclipseAdventure,Drama,FantasyAs a string of mysterious killings grips Seatt…David SladeKristen Stewart, Robert Pattinson, Taylor Laut…20101244.9192740300.5258.0
165166TwilightDrama,Fantasy,RomanceA teenage girl risks everything when she falls…Catherine HardwickeKristen Stewart, Robert Pattinson, Billy Burke…20081225.2361449191.4556.0

Fifty shades of grey did make money for producers but people didn’t like the movie that much, do they?

Leave a Reply