Python Data Science Cheatsheet

Pandas

import pandas as pd

Importing Data

pd.read_csv(filename) # From a CSV file
pd.read_table(filename) # From a delimited text file (like TSV)
pd.read_excel(filename) # From an Excel file
pd.read_sql(query, connection_object) # Read from a SQL table/database
pd.read_json(json_string) # Read from a JSON formatted string, URL or file
pd.read_html(url) # Parses an html URL, string or file and extracts tables to a list of dataframes
pd.read_clipboard() # Takes the contents of your clipboard and passes it to read_table()
pd.DataFrame(dict) # From a dict, keys for columns names, values for data as lists

CSV with Custom Separator

df = pd.read_csv("data/entryRequests.csv", sep=";")

CSV with Date Parsing

Combines Date and Time columns into a single Date_Time column:

df = pd.read_csv("data/entryRequests.csv", sep=";", parse_dates=[["Date", "Time"]])

Inspecting Data

df.head(n) # First n rows of the DataFrame
df.tail(n) # Last n rows of the DataFrame
df.shape # Number of rows and columns
df.info() # Index, Datatype and Memory information
df.describe() # Summary statistics for numerical columns
df.columns # Column names
s.value_counts(dropna=False) # View unique values and counts
df.apply(pd.Series.value_counts) # Unique values and counts for all columns

Data Operations

Select Columns

df_short = df[['color', 'food', 'score']]

Select Rows

df.iloc[i:n, :] # Rows from i-th to n-th
df.iloc[i, :] # i-th row

Delete Row

modDfObj = dfObj.drop(i)

Create Time Bins

Frequency options reference

minute_bins = pd.date_range(start='1/1/2020', end='1/08/2020', freq='min')

Group By

Use as_index=False to keep the grouped column as a regular column:

df.groupby("classroom", as_index=False).count()

Apply Function to Rows

df["new_column"] = df.apply(lambda row: some_function(row["existing_column"]), axis=1)

Note: axis=0 applies to columns instead.

Exporting Data

df.to_csv('file_name.csv', index=False) # Without row index column