Python data science cheatsheet

Pandas

import pandas as pd

Load CSV with ‘;’ separator

df = pd.read_csv(“data/entryRequests.csv”, sep = “;”, parse_dates=[[“Date”, “Time”]])

print(df.columns)

Importing

pd.read_csv(filename) # From a CSV file
pd.read_table(filename) # From a delimited text file (like TSV)
pd.read_excel(filename) # From an Excel file
pd.read_sql(query, connection_object) # Read from a SQL table/database
pd.read_json(json_string) # Read from a JSON formatted string, URL or file.
pd.read_html(url) # Parses an html URL, string or file and extracts tables to a list of dataframes
pd.read_clipboard() # Takes the contents of your clipboard and passes it to read_table()
pd.DataFrame(dict) # From a dict, keys for columns names, values for data as lists

From CSV

Parsing date and time columns. A new column Date_Time is created.

df = pd.read_csv("data/entryRequests.csv", sep = ";", parse_dates=[["Date", "Time"]]) 

Inspecting data

df.head(n) # First n rows of the DataFrame
df.tail(n) # Last n rows of the DataFrame
df.shape # Number of rows and columns
df.info() # Index, Datatype and Memory information
df.describe() # Summary statistics for numerical columns
s.value_counts(dropna=False) # View unique values and counts
df.apply(pd.Series.value_counts) # Unique values and counts for all columns

Other operations

Keep a subset of the columns

df_short = df[['color', 'food', 'score']]

Keep a subset of rows from i-th to n-th

df.iloc[ i:n, : ]

get the i-th row

df.iloc[ i, : ]

Delete i-th row

modDfObj = dfObj.drop(i)

Create time bins.

m/d/aaaa format (for more freq options check here

minute_bins = pd.date_range(start='1/1/2020', end='1/08/2020', freq='min')

Group by

If as_index=True (by default) then the column classroom would be converted into an index by the groupby, and wouldn’t be a column anymore (can’t imagine why this bahaviour, but…).

df.groupby("classroom", as_index=False).count()

Export as CSV

Withoud a column for indexes

df.to_csv('file_name.csv', index=False)

Apply an operation to all rows and save to a new column

df["new_column"] = df.apply(lambda row: some_function(row["existing_column"]), axis=1)

Note: axis=0 would apply it to all columns