Python Data Science Cheatsheet
Pandas
import pandas as pd
Importing Data
pd.read_csv(filename) # From a CSV file
pd.read_table(filename) # From a delimited text file (like TSV)
pd.read_excel(filename) # From an Excel file
pd.read_sql(query, connection_object) # Read from a SQL table/database
pd.read_json(json_string) # Read from a JSON formatted string, URL or file
pd.read_html(url) # Parses an html URL, string or file and extracts tables to a list of dataframes
pd.read_clipboard() # Takes the contents of your clipboard and passes it to read_table()
pd.DataFrame(dict) # From a dict, keys for columns names, values for data as lists
CSV with Custom Separator
df = pd.read_csv("data/entryRequests.csv", sep=";")
CSV with Date Parsing
Combines Date and Time columns into a single Date_Time column:
df = pd.read_csv("data/entryRequests.csv", sep=";", parse_dates=[["Date", "Time"]])
Inspecting Data
df.head(n) # First n rows of the DataFrame
df.tail(n) # Last n rows of the DataFrame
df.shape # Number of rows and columns
df.info() # Index, Datatype and Memory information
df.describe() # Summary statistics for numerical columns
df.columns # Column names
s.value_counts(dropna=False) # View unique values and counts
df.apply(pd.Series.value_counts) # Unique values and counts for all columns
Data Operations
Select Columns
df_short = df[['color', 'food', 'score']]
Select Rows
df.iloc[i:n, :] # Rows from i-th to n-th
df.iloc[i, :] # i-th row
Delete Row
modDfObj = dfObj.drop(i)
Create Time Bins
minute_bins = pd.date_range(start='1/1/2020', end='1/08/2020', freq='min')
Group By
Use as_index=False to keep the grouped column as a regular column:
df.groupby("classroom", as_index=False).count()
Apply Function to Rows
df["new_column"] = df.apply(lambda row: some_function(row["existing_column"]), axis=1)
Note: axis=0 applies to columns instead.
Exporting Data
df.to_csv('file_name.csv', index=False) # Without row index column