Categories
Pandas

Part 3: I/O FILE READ and WRITE

In this article, we will learn the basics of I/O: how to read/write files in csv, json and xls format

1. READING THE FILE

We will learn how to read the three most commonly used file format – csv, xls(xlsx) and json

1.1 Reading a CSV file

  • CSV is a comma separated key-value pairs, where key is the column name and value its corresponding column values
  • pd.read_csv() constructor is used to read the csv file
  • First (and only required) keyword argument is the location of the file. The returned output is the DataFrame with integer indexes and comma-separated column names as column labels for the DataFrame
  • However, when we specify the keyword argument index_col, we can specify which columns we want to use as row labels

In the example below, we are going to read the file which is available here. I dropped some columns in the file to make it presentable because our main purpose is to read/write files rather than working with actual content of the file. In addition, following short terms are used column labels:

  • Number of firms = No
  • Average Unlevered Beta = AU Beta
  • Average Levered Beta = AL Beta
import numpy as np
import pandas as pd 

# location of csv file we are going to read
csv_location = "data/totalbeta.csv"

csv_df = pd.read_csv(csv_location)

# .head() shows only the first five rows in output
csv_df.head() 
IndustryNo.AU BetaAL Beta
0Advertising470.931.44
1Aerospace/Defense771.081.23
2Air Transport180.841.44
3Apparel510.831.06
4Auto & Truck130.531.10

Let use the kwarg index_col to use specified column value as row label:

# Assigning column "Industry Name" as row labels
# Index positon of "industry Name" is 0 

pd.read_csv(csv_location, index_col=0).head()
No.AU BetaAL Beta
Industry
Advertising470.931.44
Aerospace/Defense771.081.23
Air Transport180.841.44
Apparel510.831.06
Auto & Truck130.531.10

1.2. Reading an Excel File

  • We can read all spreadsheets in an excel workbook or individual sheets inside the workbook
  • pd.read_excel constructor is used to read an excel file
  • We can provide keyword argument, sheet_name to read a specific sheet in workbook — it can be either integer index or exact sheet name as string. sheet_name=None returns all sheets
  • We can provide keyword argument, index_col to set the column name as row label (just like we do in pd.read_csv)
xls_location  = "data/totalbeta.xls"

xls_df = pd.read_excel(xls_location)
xls_df.head()
IndustryNo.AU BetaAL Beta
0Advertising470.9349531.439612
1Aerospace/Defense771.0785221.231584
2Air Transport180.8436731.435348
3Apparel510.8296411.055097
4Auto & Truck130.5257351.095074
  • Let’s fetch all sheet names inside the workbook using kwarg sheet_name=None and .keys method:
# reading by sheet_name
# first, let fetch all the sheet names in the workbook

pd.read_excel(xls_location, sheet_name=None).keys()
odict_keys(['US', 'Global', 'Europe', 'Emerging'])
  • Let’s store each sheet content inside the variable
# storing each sheet in its own dataframe

us_sheet = pd.read_excel(xls_location, sheet_name='US')
global_sheet = pd.read_excel(xls_location, sheet_name='Global')
europe_sheet = pd.read_excel(xls_location, sheet_name='Europe')
emerging_sheet = pd.read_excel(xls_location, sheet_name='Emerging')
# read the first sheet by integer indexing

pd.read_excel(xls_location, sheet_name=0).head()
IndustryNo.AU BetaAL Beta
0Advertising470.9349531.439612
1Aerospace/Defense771.0785221.231584
2Air Transport180.8436731.435348
3Apparel510.8296411.055097
4Auto & Truck130.5257351.095074
# read the first sheet by its sheet name

pd.read_excel(xls_location, sheet_name='US').head()
IndustryNo.AU BetaAL Beta
0Advertising470.9349531.439612
1Aerospace/Defense771.0785221.231584
2Air Transport180.8436731.435348
3Apparel510.8296411.055097
4Auto & Truck130.5257351.095074
# reading more than one sheets

excel_file = pd.read_excel(xls_location, sheet_name=[0,-1])
excel_file.keys()
odict_keys([0, -1])
excel_file[-1].head()
Industry NameNo.AU BetaAL Beta
0Advertising1031.4638801.495675
1Aerospace/Defense891.0995781.175447
2Air Transport860.5902831.111339
3Apparel9070.6350540.749096
4Auto & Truck831.1274851.379411

1.3. Reading a JSON file

  • Think of json file as a python dictionary with key-value pair
  • pd.read_json() constructor is used to read json file
  • When no keyword argument is used: outer keys are used as column labels and inner keys are used as row labels
  • When we use orient='index' keyword argument, outer keys are used as row labels and inner keys are used as column labels
json_location = "data/totalbeta.json"

json_df = pd.read_json(json_location)
json_df.head()
IndustryNo.AU BetaAL Beta
0Advertising470.931.44
1Aerospace/Defense771.081.23
2Air Transport180.841.44
3Apparel510.831.06
4Auto & Truck130.531.10

2. Writing to File

We will write in three most commonly used formats – csv, xls(xlsx) and json

2.1. Writing to a CSV file

  • .to_csv() constructor is used to write in csv format
  • The first keyword argument is the name of file
  • By default, to.csv() uses the index labels as first column in the csv file, which can be non-meaningful if just integer, we can provide index=False keyword argument to not write the row label into a csv file
  • for the purpose of demonstration, we will use the data we have stored above under variables csv_df, excel_df and json_df
# writing
csv_df.to_csv('data/tocsv_file.csv') 

# reading the written file
pd.read_csv("data/tocsv_file.csv").head()
Unnamed: 0IndustryNo.AU BetaAL Beta
00Advertising470.931.44
11Aerospace/Defense771.081.23
22Air Transport180.841.44
33Apparel510.831.06
44Auto & Truck130.531.10

We can see that the index label is used as first column, which is meaningless in our case, let use index=False to not use the index as first column

# setting index=False

# writing
csv_df.to_csv('data/tocsv_index_false.csv', index=False) 

# reading the written file
pd.read_csv("data/tocsv_index_false.csv").head()
IndustryNo.AU BetaAL Beta
0Advertising470.931.44
1Aerospace/Defense771.081.23
2Air Transport180.841.44
3Apparel510.831.06
4Auto & Truck130.531.10

2.2. Writing to Excel File

  • .to_excel() constructor is used to write single sheet to the excel workbook
  • To write multiple sheets in excel workbook, first we need to load the excel file using pd.ExcelWriter, then use it as first argument in the .to_excel()
  • default label for sheet is Sheet1, Sheet2 unless we provide it explicitly using sheet_name keyword argument
  • to avoid writing row label as first column, use index=False
# writing
xls_df.to_excel('data/toexcel_file.xls', index=False)

# reading the written file
pd.read_excel('data/toexcel_file.xls').head()
IndustryNo.AU BetaAL Beta
0Advertising470.9349531.439612
1Aerospace/Defense771.0785221.231584
2Air Transport180.8436731.435348
3Apparel510.8296411.055097
4Auto & Truck130.5257351.095074
# writing individual sheets to excel workbook

with pd.ExcelWriter('data/toexcel_file_multiple_sheets.xls') as writer:
    us_sheet.to_excel(writer, index=False, sheet_name='usa')
    europe_sheet.to_excel(writer, index=False, sheet_name='europe')
    global_sheet.to_excel(writer, index=False, sheet_name='global')
    emerging_sheet.to_excel(writer, index=False, sheet_name='emerging')
    
# reading the written file

xls_df_new = pd.read_excel('data/toexcel_file_multiple_sheets.xls', sheet_name=None)
xls_df_new.keys()
odict_keys(['usa', 'europe', 'global', 'emerging'])
xls_df_new['usa'].head()
IndustryNo.AU BetaAL Beta
0Advertising470.9349531.439612
1Aerospace/Defense771.0785221.231584
2Air Transport180.8436731.435348
3Apparel510.8296411.055097
4Auto & Truck130.5257351.095074

2.3. Write to JSON File

  • .to_json() constructor is used
  • Meaning and application of orient='index' in .to_json will remain the same as explained in pd.read_json()
# writing
json_df.to_json('data/tojson_file.json')

# reading the written file
pd.read_json('data/tojson_file.json').head()
IndustryNo.AU BetaAL Beta
0Advertising470.931.44
1Aerospace/Defense771.081.23
2Air Transport180.841.44
3Apparel510.831.06
4Auto & Truck130.531.10

Leave a Reply