Skip to main content

Command Palette

Search for a command to run...

20 Polars concepts with Before-and-After Examples

Updated
10 min read
A
AI data engineer wiring agents, infra, and unapologetic build logs

1. Creating DataFrames (pl.DataFrame) 🏗️

Boilerplate Code:

import polars as pl

Use Case: Create a DataFrame to hold your data, similar to pandas. 🏗️

Goal: Store and manipulate data in a high-performance DataFrame structure. 🎯

Sample Code:

# Create a DataFrame
df = pl.DataFrame({
    "column1": [1, 2, 3],
    "column2": ["A", "B", "C"]
})

# Now you have a DataFrame with two columns!

Before Example: has raw data but no structure to manipulate it easily. 🤔

Data: [1, 2, 3], ["A", "B", "C"]

After Example: With pl.DataFrame(), the data is structured and ready to work with! 🏗️

DataFrame: 
shape: (3, 2)
┌─────────┬─────────┐
│ column1 │ column2 │
├─────────┼─────────┤
│       1 │ A       │
│       2 │ B       │
│       3 │ C       │
└─────────┴─────────┘

Challenge: 🌟 Try creating a DataFrame with more columns and different data types.


2. Selecting Columns (df.select) 🔎

Boilerplate Code:

df.select("column_name")

Use Case: Select specific columns from your DataFrame to work with or analyze. 🔎

Goal: Extract just the columns you need from the DataFrame. 🎯

Sample Code:

# Select a single column
df.select("column1")

# Select multiple columns
df.select(["column1", "column2"])

Before Example: has a large DataFrame but only needs specific columns for their analysis. 🤔

DataFrame: all columns.

After Example: With df.select(), only the needed columns are extracted! 🔎

Selected Columns: ["column1"]

Challenge: 🌟 Try selecting columns that match a specific pattern using wildcards like "column_*".


3. Filtering Rows (df.filter) 🔍

Boilerplate Code:

df.filter(pl.col("column_name") > value)

Use Case: Filter rows based on a condition to narrow down your data. 🔍

Goal: Extract only the rows that meet certain criteria. 🎯

Sample Code:

# Filter rows where column1 > 1
df.filter(pl.col("column1") > 1)

Before Example: has a DataFrame with all the rows but only needs a subset that matches specific criteria. 🤔

DataFrame: all rows.

After Example: With df.filter(), only the rows that meet the condition are included! 🔍

Filtered DataFrame: Rows where column1 > 1.

Challenge: 🌟 Try filtering using multiple conditions (e.g., column1 > 1 and column2 == "B").


4. Adding Columns (df.with_columns)

Boilerplate Code:

df.with_columns([pl.col("existing_column") * 2])

Use Case: Add a new column based on existing columns in the DataFrame. ➕

Goal: Extend your DataFrame by creating new columns derived from existing data. 🎯

Sample Code:

# Add a new column that multiplies column1 by 2
df.with_columns([
    (pl.col("column1") * 2).alias("new_column")
])

Before Example: Need to add a new column but doesn’t know how. 🤷‍♂️

DataFrame: columns1, column2

After Example: With df.with_columns(), the DataFrame now has an additional column! ➕

New Column: 'new_column' added.

Challenge: 🌟 Try adding a column that combines values from multiple existing columns.


5. Grouping Data (df.groupby) 📊

Boilerplate Code:

df.groupby("column_name").agg([pl.col("another_column").mean()])

Use Case: Group data by a column and apply aggregate functions like mean, sum, or count. 📊

Goal: Summarize data by grouping similar entries and applying calculations. 🎯

Sample Code:

# Group by column2 and calculate the mean of column1
df.groupby("column2").agg([
    pl.col("column1").mean()
])

Before Example: has unsummarized data and wants to compute statistics for each group. 🤔

DataFrame: ungrouped data.

After Example: With df.groupby(), the data is grouped, and the mean is calculated for each group! 📊

Grouped DataFrame: mean of column1 by column2.

Challenge: 🌟 Try grouping by multiple columns and applying multiple aggregate functions.


6. Sorting Data (df.sort) 🔢

Boilerplate Code:

df.sort("column_name", reverse=True)

Use Case: Sort the DataFrame by one or more columns, either in ascending or descending order. 🔢

Goal: Rearrange your data for better visualization or analysis. 🎯

Sample Code:

# Sort by column1 in descending order
df.sort("column1", reverse=True)

Before Example: data is unsorted, making it harder to analyze. 🤔

DataFrame: unsorted.

After Example: With df.sort(), the data is sorted in the desired order! 🔢

Sorted DataFrame: column1 sorted in descending order.

Challenge: 🌟 Try sorting by multiple columns, with one in ascending and the other in descending order.


7. Joining DataFrames (df.join) 🔗

Boilerplate Code:

df1.join(df2, on="column_name", how="inner")

Use Case: Join two DataFrames on a common column to combine data. 🔗

Goal: Merge data from two sources based on shared columns. 🎯

Sample Code:

# Perform an inner join on column1
df1.join(df2, on="column1", how="inner")

Before Example: has two separate DataFrames but needs to merge them into one. 🤷‍♂️

Two separate DataFrames.

After Example: With df.join(), the DataFrames are now combined into one! 🔗

Joined DataFrame: merged on column1.

Challenge: 🌟 Try different join types like how="left" or how="outer" to see how the output changes.


8. Pivoting Data (df.pivot) 🔄

Boilerplate Code:

df.pivot(values="value_column", index="index_column", columns="pivot_column")

Use Case: Pivot your data to reshape it, turning unique values into columns. 🔄

Goal: Rearrange your DataFrame from long format to wide format. 🎯

Sample Code:

# Pivot the DataFrame
df.pivot(values="value_column", index="index_column", columns="pivot_column")

Before Example: has data in long format but wants to transform it into a more readable structure. 🤔

Long format DataFrame: stacked rows.

After Example: With df.pivot(), the data is reshaped into a more readable format! 🔄

Pivoted DataFrame: values turned into columns.

Challenge: 🌟 Try applying different aggregation functions during pivoting (e.g., sum or mean).


9. Lazy Evaluation (df.lazy) 💤

Boilerplate Code:

df.lazy().select(...)

Use Case: Use lazy evaluation to defer execution of operations until explicitly needed, improving performance. 💤

Goal: Chain multiple operations without executing them immediately. 🎯

Sample Code:

# Use lazy evaluation
df.lazy().filter(pl.col("column1") > 1).select("column2").collect()

Before Example: Run every operation immediately, slowing down performance with large datasets. 🐢

Eager

 evaluation: every step executed immediately.

After Example: With lazy evaluation, operations are deferred and executed in one go! 💤

Lazy evaluation: operations executed only when collected.

Challenge: 🌟 Try chaining multiple operations together and see the performance improvement.


10. Exploratory Data Analysis (df.describe) 🔍

Boilerplate Code:

df.describe()

Use Case: Get a quick summary of your DataFrame for exploratory data analysis. 🔍

Goal: View statistics like count, mean, and standard deviation for numeric columns. 🎯

Sample Code:

# Get summary statistics
df.describe()

Before Example: Has a DataFrame but no quick overview of its statistics. 🤔

DataFrame: raw data, no summary.

After Example: With df.describe(), the intern gets a useful summary of the data! 🔍

DataFrame Summary: count, mean, std, min, max.

Challenge: 🌟 Try getting descriptive statistics for specific columns only.


11. Renaming Columns (df.rename) 🔤

Boilerplate Code:

df.rename({"old_column": "new_column"})

Use Case: Rename columns in your DataFrame for clarity or consistency. 🔤

Goal: Change column names to something more descriptive or standardized. 🎯

Sample Code:

# Rename a column
df.rename({"column1": "renamed_column1"})

Before Example: has unclear or inconsistent column names. 🤔

Columns: ["column1", "column2"]

After Example: With df.rename(), the columns have more descriptive names! 🔤

Renamed Columns: ["renamed_column1", "column2"]

Challenge: 🌟 Try renaming multiple columns at once.


12. Handling Null Values (df.fill_null) 🚫

Boilerplate Code:

df.fill_null("default_value")

Use Case: Handle missing values by filling them with a default value. 🚫

Goal: Replace null or missing values with something meaningful to avoid issues in analysis. 🎯

Sample Code:

# Fill null values with a default value
df.fill_null(0)

Before Example: DataFrame has null values that can cause errors in calculations. 😬

DataFrame: some rows with null values.

After Example: With df.fill_null(), the missing values are filled in with a default! 🚫

Filled DataFrame: null values replaced with 0.

Challenge: 🌟 Try using different fill strategies like filling with the column mean or forward filling (ffill).


13. Concatenating DataFrames (pl.concat) 🛠️

Boilerplate Code:

pl.concat([df1, df2], how="vertical")

Use Case: Concatenate two or more DataFrames either vertically or horizontally. 🛠️

Goal: Combine multiple DataFrames into one for easier analysis. 🎯

Sample Code:

# Concatenate two DataFrames vertically
pl.concat([df1, df2], how="vertical")

Before Example: has separate DataFrames but wants to combine them. 🤷‍♂️

Two separate DataFrames: df1, df2

After Example: With pl.concat(), the DataFrames are now combined! 🛠️

Concatenated DataFrame: df1 and df2 merged.

Challenge: 🌟 Try horizontal concatenation by setting how="horizontal".


14. Window Functions (df.select with .over) 🔄

Boilerplate Code:

df.select([pl.col("column").sum().over("group_column")])

Use Case: Apply window functions like moving averages or rolling sums within groups. 🔄

Goal: Perform calculations over a subset of rows based on a window. 🎯

Sample Code:

# Apply a rolling sum over groups
df.select([pl.col("column1").sum().over("column2")])

Before Example: needs to calculate a sum or average for rows within each group. 🤔

Grouped DataFrame: no summary within groups.

After Example: With window functions, the intern can compute rolling calculations within groups! 🔄

Windowed DataFrame: sum of column1 over groups in column2.

Challenge: 🌟 Try using other window functions like .mean() or .max().


15. Date and Time Manipulation (pl.date_range, pl.col.dt) 📅

Boilerplate Code:

pl.date_range(start="2023-01-01", end="2023-01-10", interval="1d")

Use Case: Manipulate date and time data, such as creating date ranges or extracting parts of a date. 📅

Goal: Work with date data to generate time-series data or extract specific parts like year or month. 🎯

Sample Code:

# Create a date range
pl.date_range(start="2023-01-01", end="2023-01-10", interval="1d")

# Extract year from a date column
df.with_columns(pl.col("date_column").dt.year())

Before Example: has date data but struggles with extracting specific date components. 🤔

Dates: full date format (YYYY-MM-DD).

After Example: With date manipulation, the intern can now extract or manipulate specific date components! 📅

Extracted Year: [2023, 2023, ...]

Challenge: 🌟 Try extracting other parts of the date like month, day, or week using .dt.


16. Cumulative Operations (df.select with .cumsum)

Boilerplate Code:

df.select([pl.col("column").cumsum()])

Use Case: Perform cumulative operations like cumulative sum or cumulative product. ➕

Goal: Calculate cumulative values across rows to see how a value builds up over time. 🎯

Sample Code:

# Calculate cumulative sum for a column
df.select([pl.col("column1").cumsum()])

Before Example: Need to track the running total but only has individual values. 🤔

Data: individual values [10, 20, 30]

After Example: With cumulative operations, the intern gets a running total! ➕

Cumulative Sum: [10, 30, 60]

Challenge: 🌟 Try applying cumulative product (.cumprod()) for a different calculation.


17. Melt (Wide to Long Format) 🔄

Boilerplate Code:

df.melt(id_vars="id_column", value_vars=["col1", "col2"])

Use Case: Melt DataFrames from wide to long format, useful for pivoting data. 🔄

Goal: Transform your DataFrame by melting multiple columns into rows. 🎯

Sample Code:

# Melt the DataFrame
df.melt(id_vars="id_column", value_vars=["col1", "col2"])

Before Example: Thas data in wide format but needs to convert it to long format. 🤔

Wide Format: col1, col2 as columns.

After Example: With melt, the DataFrame is transformed into long format! 🔄

Long Format: col1, col2 as rows.

Challenge: 🌟 Try melting a DataFrame with more columns and using different id_vars.


18. Pivot (Long to Wide Format) 🔄

Boilerplate Code:

df.pivot(values="value_column", index="index_column", columns="pivot_column")

Use Case: Pivot DataFrames from long to wide format, turning unique values into columns. 🔄

Goal: Reshape your DataFrame by pivoting data for easier analysis. 🎯

Sample Code:

# Pivot the DataFrame
df.pivot(values="value_column", index="index_column", columns="pivot_column")

Before Example: has data in long format but needs to reshape it into wide format. 🤔

Long Format: rows with duplicated entries.

After Example: With pivot, the DataFrame is transformed into wide format! 🔄

Wide Format: values turned into columns.

Challenge: 🌟 Try applying different aggregate functions during the pivot process, like sum or mean.


19. Exploding Columns (df.explode) 💥

Boilerplate Code:

df.explode("list_column")

Use Case: Explode a list or array column into multiple rows, flattening nested data. 💥

Goal: Convert a column of lists into individual rows for further analysis. 🎯

Sample Code:

# Explode a list column into separate rows
df.explode("list_column")

Before Example: has a column with lists but can’t easily analyze the data inside. 🤔

Data: [1, [2, 3], 4]

After Example: With explode(), each element in the list becomes its own row! 💥

Exploded DataFrame: individual elements [2, 3] in separate rows.

Challenge: 🌟 Try exploding multiple list columns simultaneously.


20. Reversing Columns (df.select with .reverse) 🔄

Boilerplate Code:

df.select([pl.col("column_name").reverse()])

Use Case: Reverse the order of values in a column or DataFrame. 🔄

Goal: Flip the order of rows for better analysis or reporting. 🎯

Sample Code:

# Reverse the order of a column
df.select([pl.col("column1").reverse()])

Before Example: The intern’s DataFrame is sorted, but they want to reverse the order. 🤔

DataFrame: ascending order.

After Example: With reverse(), the DataFrame is now in descending order! 🔄

Reversed DataFrame: rows flipped.

Challenge: 🌟 Try reversing specific columns while leaving others unchanged.


More from this blog

Anix Lynch – Technical Notes & Engineering Playbooks

176 posts

Deploying mode 🚀 one csv at a time.