Pandas

What is Pandas?

Pandas is an amazing Python library designed to be used when working with data. Use it to clean, transform, analyse, or visualise data. The main data structure in pandas is a DataFrame. It is basically like a table with rows and columns. The header is first row of the CSV, excel or text file that contains the column names. To start with import pandas with:

import pandas as pd

Loading Data

# From a CSV file:
df = pd.read_csv("filename.csv")
# From an Excel file:
df = pd.read_excel("filename.xlsx")
# From a tab-separated text file:
df = pd.read_csv("file.txt", delimiter="\t")

Exploring the Data

df.head()           # First 5 rows
df.head(2)          # First 2 rows
df.tail(3)          # Last 3 rows
df.describe()       # Summary stats: mean, std, min, max, quartiles
df.columns          # List of column names

To rename the columns:

df.columns = ['col1', 'col2', 'col3']

Accessing Data

df['column name']	# Accessing a Column
df.column_name        # Only works if name has no spaces
df['column name'][0:5]    # First 5 entries of a column
df[['col1', 'col2']]      # Multiple columns
df.iloc[1]                # Row at index 1
df.iloc[1:4]              # Rows 1 to 3
df.iloc[2, 1]             # Value at 3rd row, 2nd column

# Iterating over rows
for index, row in df.iterrows():
    print(index, row)
    print(row['column name'])

Filtering and Sorting

df.loc[df['column name'] == 'value'] # Filter rows by value
# Combining conditions
df.loc[(df['col1'] == 'value1') & (df['col2'] == 'value2')]
df.loc[(df['col1'] == 'value1') | (df['col2'] == 'value2')]
# Sorting values
df.sort_values('column name')
df.sort_values('column name', ascending=False)
df.sort_values(['col1', 'col2'])  # Sort by multiple columns
df.sort_values(['col1', 'col2'], ascending=[True, False])

Creating and Editing Columns

# Creating a column
df['new_col'] = df['col1'] + df['col2']
df['sum'] = df.iloc[:, 4:9].sum(axis=1)  # Sum across columns 4-8
# Drop a column:
df.drop(columns=['col_name'], inplace=True)
# Rearrange columns:
df = df[['col1', 'col2', 'col4', 'col3']]
cols = list(df.columns.values)
df = df[cols[0:4] + [cols[-1]] + cols[4:12]]

Saving Data

df.to_csv('modified.csv')                        # Includes index
df.to_csv('file.csv', index=False)               # Excludes index
df.to_excel('file.xlsx', index=False)            # Excel file
df.to_csv('file.txt', index=False, sep='\t')     # Tab-separated file

Resetting Index

In Pandas, resetting the index refers to moving the current row labels (index) back to the default integer index (0, 1, 2, …) and optionally keeping the old index as a column.

df = df.reset_index()                     # Resets and keeps old index
df = df.reset_index(drop=True)           # Resets and drops old index
df.reset_index(drop=True, inplace=True)  # Same as above, in-place

Filtering by Strings

df.loc[df['col1'].str.contains('string')]
df.loc[~df['col1'].str.contains('string')]
df.loc[df['col1'].str.contains('pattern', regex=True)]
import re
df.loc[df['col1'].str.contains('string', flags=re.IGNORECASE, regex=True)]

Updating Values

df.loc[df['col1'] == 'Old Value', 'col1'] = 'New Value'
df.loc[df['col1'] == 'Old Value', 'col2'] = 'New Value'
df.loc[df['col1'] > 100, ['col2', 'col3']] = ['Value2', 'Value3']

Grouping Data

df.groupby(['col1']).mean()
df.groupby(['col1']).sum()
df.groupby(['col1']).count()

Tips

  • Use print(df.info()) to get a quick overview of your DataFrame, including data types and non-null counts.