20 Polars concepts with Before-and-After Examples
1. Creating DataFrames (pl.DataFrame) 🏗️
Boilerplate Code:
import polars as pl
Use Case: Create a DataFrame to hold your data, similar to pandas. 🏗️
Goal: Store and manipulate data in a high-performance DataFrame structure. 🎯
Sample Code:
# Create a DataFrame
df = pl.DataFrame({
"column1": [1, 2, 3],
"column2": ["A", "B", "C"]
})
# Now you have a DataFrame with two columns!
Before Example: has raw data but no structure to manipulate it easily. 🤔
Data: [1, 2, 3], ["A", "B", "C"]
After Example: With pl.DataFrame(), the data is structured and ready to work with! 🏗️
DataFrame:
shape: (3, 2)
┌─────────┬─────────┐
│ column1 │ column2 │
├─────────┼─────────┤
│ 1 │ A │
│ 2 │ B │
│ 3 │ C │
└─────────┴─────────┘
Challenge: 🌟 Try creating a DataFrame with more columns and different data types.
2. Selecting Columns (df.select) 🔎
Boilerplate Code:
df.select("column_name")
Use Case: Select specific columns from your DataFrame to work with or analyze. 🔎
Goal: Extract just the columns you need from the DataFrame. 🎯
Sample Code:
# Select a single column
df.select("column1")
# Select multiple columns
df.select(["column1", "column2"])
Before Example: has a large DataFrame but only needs specific columns for their analysis. 🤔
DataFrame: all columns.
After Example: With df.select(), only the needed columns are extracted! 🔎
Selected Columns: ["column1"]
Challenge: 🌟 Try selecting columns that match a specific pattern using wildcards like "column_*".
3. Filtering Rows (df.filter) 🔍
Boilerplate Code:
df.filter(pl.col("column_name") > value)
Use Case: Filter rows based on a condition to narrow down your data. 🔍
Goal: Extract only the rows that meet certain criteria. 🎯
Sample Code:
# Filter rows where column1 > 1
df.filter(pl.col("column1") > 1)
Before Example: has a DataFrame with all the rows but only needs a subset that matches specific criteria. 🤔
DataFrame: all rows.
After Example: With df.filter(), only the rows that meet the condition are included! 🔍
Filtered DataFrame: Rows where column1 > 1.
Challenge: 🌟 Try filtering using multiple conditions (e.g., column1 > 1 and column2 == "B").
4. Adding Columns (df.with_columns) ➕
Boilerplate Code:
df.with_columns([pl.col("existing_column") * 2])
Use Case: Add a new column based on existing columns in the DataFrame. ➕
Goal: Extend your DataFrame by creating new columns derived from existing data. 🎯
Sample Code:
# Add a new column that multiplies column1 by 2
df.with_columns([
(pl.col("column1") * 2).alias("new_column")
])
Before Example: Need to add a new column but doesn’t know how. 🤷♂️
DataFrame: columns1, column2
After Example: With df.with_columns(), the DataFrame now has an additional column! ➕
New Column: 'new_column' added.
Challenge: 🌟 Try adding a column that combines values from multiple existing columns.
5. Grouping Data (df.groupby) 📊
Boilerplate Code:
df.groupby("column_name").agg([pl.col("another_column").mean()])
Use Case: Group data by a column and apply aggregate functions like mean, sum, or count. 📊
Goal: Summarize data by grouping similar entries and applying calculations. 🎯
Sample Code:
# Group by column2 and calculate the mean of column1
df.groupby("column2").agg([
pl.col("column1").mean()
])
Before Example: has unsummarized data and wants to compute statistics for each group. 🤔
DataFrame: ungrouped data.
After Example: With df.groupby(), the data is grouped, and the mean is calculated for each group! 📊
Grouped DataFrame: mean of column1 by column2.
Challenge: 🌟 Try grouping by multiple columns and applying multiple aggregate functions.
6. Sorting Data (df.sort) 🔢
Boilerplate Code:
df.sort("column_name", reverse=True)
Use Case: Sort the DataFrame by one or more columns, either in ascending or descending order. 🔢
Goal: Rearrange your data for better visualization or analysis. 🎯
Sample Code:
# Sort by column1 in descending order
df.sort("column1", reverse=True)
Before Example: data is unsorted, making it harder to analyze. 🤔
DataFrame: unsorted.
After Example: With df.sort(), the data is sorted in the desired order! 🔢
Sorted DataFrame: column1 sorted in descending order.
Challenge: 🌟 Try sorting by multiple columns, with one in ascending and the other in descending order.
7. Joining DataFrames (df.join) 🔗
Boilerplate Code:
df1.join(df2, on="column_name", how="inner")
Use Case: Join two DataFrames on a common column to combine data. 🔗
Goal: Merge data from two sources based on shared columns. 🎯
Sample Code:
# Perform an inner join on column1
df1.join(df2, on="column1", how="inner")
Before Example: has two separate DataFrames but needs to merge them into one. 🤷♂️
Two separate DataFrames.
After Example: With df.join(), the DataFrames are now combined into one! 🔗
Joined DataFrame: merged on column1.
Challenge: 🌟 Try different join types like how="left" or how="outer" to see how the output changes.
8. Pivoting Data (df.pivot) 🔄
Boilerplate Code:
df.pivot(values="value_column", index="index_column", columns="pivot_column")
Use Case: Pivot your data to reshape it, turning unique values into columns. 🔄
Goal: Rearrange your DataFrame from long format to wide format. 🎯
Sample Code:
# Pivot the DataFrame
df.pivot(values="value_column", index="index_column", columns="pivot_column")
Before Example: has data in long format but wants to transform it into a more readable structure. 🤔
Long format DataFrame: stacked rows.
After Example: With df.pivot(), the data is reshaped into a more readable format! 🔄
Pivoted DataFrame: values turned into columns.
Challenge: 🌟 Try applying different aggregation functions during pivoting (e.g., sum or mean).
9. Lazy Evaluation (df.lazy) 💤
Boilerplate Code:
df.lazy().select(...)
Use Case: Use lazy evaluation to defer execution of operations until explicitly needed, improving performance. 💤
Goal: Chain multiple operations without executing them immediately. 🎯
Sample Code:
# Use lazy evaluation
df.lazy().filter(pl.col("column1") > 1).select("column2").collect()
Before Example: Run every operation immediately, slowing down performance with large datasets. 🐢
Eager
evaluation: every step executed immediately.
After Example: With lazy evaluation, operations are deferred and executed in one go! 💤
Lazy evaluation: operations executed only when collected.
Challenge: 🌟 Try chaining multiple operations together and see the performance improvement.
10. Exploratory Data Analysis (df.describe) 🔍
Boilerplate Code:
df.describe()
Use Case: Get a quick summary of your DataFrame for exploratory data analysis. 🔍
Goal: View statistics like count, mean, and standard deviation for numeric columns. 🎯
Sample Code:
# Get summary statistics
df.describe()
Before Example: Has a DataFrame but no quick overview of its statistics. 🤔
DataFrame: raw data, no summary.
After Example: With df.describe(), the intern gets a useful summary of the data! 🔍
DataFrame Summary: count, mean, std, min, max.
Challenge: 🌟 Try getting descriptive statistics for specific columns only.
11. Renaming Columns (df.rename) 🔤
Boilerplate Code:
df.rename({"old_column": "new_column"})
Use Case: Rename columns in your DataFrame for clarity or consistency. 🔤
Goal: Change column names to something more descriptive or standardized. 🎯
Sample Code:
# Rename a column
df.rename({"column1": "renamed_column1"})
Before Example: has unclear or inconsistent column names. 🤔
Columns: ["column1", "column2"]
After Example: With df.rename(), the columns have more descriptive names! 🔤
Renamed Columns: ["renamed_column1", "column2"]
Challenge: 🌟 Try renaming multiple columns at once.
12. Handling Null Values (df.fill_null) 🚫
Boilerplate Code:
df.fill_null("default_value")
Use Case: Handle missing values by filling them with a default value. 🚫
Goal: Replace null or missing values with something meaningful to avoid issues in analysis. 🎯
Sample Code:
# Fill null values with a default value
df.fill_null(0)
Before Example: DataFrame has null values that can cause errors in calculations. 😬
DataFrame: some rows with null values.
After Example: With df.fill_null(), the missing values are filled in with a default! 🚫
Filled DataFrame: null values replaced with 0.
Challenge: 🌟 Try using different fill strategies like filling with the column mean or forward filling (ffill).
13. Concatenating DataFrames (pl.concat) 🛠️
Boilerplate Code:
pl.concat([df1, df2], how="vertical")
Use Case: Concatenate two or more DataFrames either vertically or horizontally. 🛠️
Goal: Combine multiple DataFrames into one for easier analysis. 🎯
Sample Code:
# Concatenate two DataFrames vertically
pl.concat([df1, df2], how="vertical")
Before Example: has separate DataFrames but wants to combine them. 🤷♂️
Two separate DataFrames: df1, df2
After Example: With pl.concat(), the DataFrames are now combined! 🛠️
Concatenated DataFrame: df1 and df2 merged.
Challenge: 🌟 Try horizontal concatenation by setting how="horizontal".
14. Window Functions (df.select with .over) 🔄
Boilerplate Code:
df.select([pl.col("column").sum().over("group_column")])
Use Case: Apply window functions like moving averages or rolling sums within groups. 🔄
Goal: Perform calculations over a subset of rows based on a window. 🎯
Sample Code:
# Apply a rolling sum over groups
df.select([pl.col("column1").sum().over("column2")])
Before Example: needs to calculate a sum or average for rows within each group. 🤔
Grouped DataFrame: no summary within groups.
After Example: With window functions, the intern can compute rolling calculations within groups! 🔄
Windowed DataFrame: sum of column1 over groups in column2.
Challenge: 🌟 Try using other window functions like .mean() or .max().
15. Date and Time Manipulation (pl.date_range, pl.col.dt) 📅
Boilerplate Code:
pl.date_range(start="2023-01-01", end="2023-01-10", interval="1d")
Use Case: Manipulate date and time data, such as creating date ranges or extracting parts of a date. 📅
Goal: Work with date data to generate time-series data or extract specific parts like year or month. 🎯
Sample Code:
# Create a date range
pl.date_range(start="2023-01-01", end="2023-01-10", interval="1d")
# Extract year from a date column
df.with_columns(pl.col("date_column").dt.year())
Before Example: has date data but struggles with extracting specific date components. 🤔
Dates: full date format (YYYY-MM-DD).
After Example: With date manipulation, the intern can now extract or manipulate specific date components! 📅
Extracted Year: [2023, 2023, ...]
Challenge: 🌟 Try extracting other parts of the date like month, day, or week using .dt.
16. Cumulative Operations (df.select with .cumsum) ➕
Boilerplate Code:
df.select([pl.col("column").cumsum()])
Use Case: Perform cumulative operations like cumulative sum or cumulative product. ➕
Goal: Calculate cumulative values across rows to see how a value builds up over time. 🎯
Sample Code:
# Calculate cumulative sum for a column
df.select([pl.col("column1").cumsum()])
Before Example: Need to track the running total but only has individual values. 🤔
Data: individual values [10, 20, 30]
After Example: With cumulative operations, the intern gets a running total! ➕
Cumulative Sum: [10, 30, 60]
Challenge: 🌟 Try applying cumulative product (.cumprod()) for a different calculation.
17. Melt (Wide to Long Format) 🔄
Boilerplate Code:
df.melt(id_vars="id_column", value_vars=["col1", "col2"])
Use Case: Melt DataFrames from wide to long format, useful for pivoting data. 🔄
Goal: Transform your DataFrame by melting multiple columns into rows. 🎯
Sample Code:
# Melt the DataFrame
df.melt(id_vars="id_column", value_vars=["col1", "col2"])
Before Example: Thas data in wide format but needs to convert it to long format. 🤔
Wide Format: col1, col2 as columns.
After Example: With melt, the DataFrame is transformed into long format! 🔄
Long Format: col1, col2 as rows.
Challenge: 🌟 Try melting a DataFrame with more columns and using different id_vars.
18. Pivot (Long to Wide Format) 🔄
Boilerplate Code:
df.pivot(values="value_column", index="index_column", columns="pivot_column")
Use Case: Pivot DataFrames from long to wide format, turning unique values into columns. 🔄
Goal: Reshape your DataFrame by pivoting data for easier analysis. 🎯
Sample Code:
# Pivot the DataFrame
df.pivot(values="value_column", index="index_column", columns="pivot_column")
Before Example: has data in long format but needs to reshape it into wide format. 🤔
Long Format: rows with duplicated entries.
After Example: With pivot, the DataFrame is transformed into wide format! 🔄
Wide Format: values turned into columns.
Challenge: 🌟 Try applying different aggregate functions during the pivot process, like sum or mean.
19. Exploding Columns (df.explode) 💥
Boilerplate Code:
df.explode("list_column")
Use Case: Explode a list or array column into multiple rows, flattening nested data. 💥
Goal: Convert a column of lists into individual rows for further analysis. 🎯
Sample Code:
# Explode a list column into separate rows
df.explode("list_column")
Before Example: has a column with lists but can’t easily analyze the data inside. 🤔
Data: [1, [2, 3], 4]
After Example: With explode(), each element in the list becomes its own row! 💥
Exploded DataFrame: individual elements [2, 3] in separate rows.
Challenge: 🌟 Try exploding multiple list columns simultaneously.
20. Reversing Columns (df.select with .reverse) 🔄
Boilerplate Code:
df.select([pl.col("column_name").reverse()])
Use Case: Reverse the order of values in a column or DataFrame. 🔄
Goal: Flip the order of rows for better analysis or reporting. 🎯
Sample Code:
# Reverse the order of a column
df.select([pl.col("column1").reverse()])
Before Example: The intern’s DataFrame is sorted, but they want to reverse the order. 🤔
DataFrame: ascending order.
After Example: With reverse(), the DataFrame is now in descending order! 🔄
Reversed DataFrame: rows flipped.
Challenge: 🌟 Try reversing specific columns while leaving others unchanged.