Python Performance & Memory Optimization Cheat Sheet¶
Python is beloved for its readability, but its flexibility can lead to significant performance bottlenecks if you aren't careful. Whether you are battling high RAM usage or agonizingly slow loops, optimizing Python requires moving beyond basic syntax. This guide provides a battle-tested roadmap—from "quick win" data structure swaps to advanced JIT compilation—to help you write Python code that is both elegant and lightning-fast.
0. The Golden Rules¶
- Don't Optimize Prematurely: "Premature optimization is the root of all evil" (Knuth). Readable code is easier to debug and maintain. Only optimize when you have identified a clear problem.
- Measure, Don't Guess: Human intuition about performance is often wrong. Use a profiler to find the actual bottleneck. Optimizing a function that only runs for 1% of the execution time yields negligible gains.
- Algorithmic Complexity Wins: A better algorithm ($O(\log n)$ or $O(1)$) will always beat a micro-optimized bad algorithm ($O(n^2)$). No amount of C-level optimization can fix a bubble sort on a million items.
- I/O is Often the Real Bottleneck: Before optimizing CPU cycles, check if your program is waiting on Disk or Network. If so, CPU optimizations won't help; you need concurrency (see Section 9).
1. Data Structures (Speed & Lookup)¶
Choosing the right container is the single most impactful decision for performance.
Use set for Lookups¶
Checking if an item exists in a list requires scanning the whole list ($O(n)$). Sets use hashing ($O(1)$), making lookups instant regardless of size.
Bad:
valid_users = ["alice", "bob", "charlie", "dave", "eve"] * 1000
# Python must scan up to 5000 items to find "zara"
if "zara" in valid_users:
pass
Good:
valid_users = {"alice", "bob", "charlie", "dave", "eve"}
# Python hashes "zara" and jumps directly to that memory slot.
if "zara" in valid_users:
pass
Use collections.deque for Queues¶
Python lists are implemented as dynamic arrays. Appending to the end is fast ($O(1)$), but inserting or popping from the start requires shifting all other elements in memory ($O(n)$).
Bad:
queue = [1, 2, 3, 4]
queue.pop(0) # Triggers a memory shift of all remaining items
queue.insert(0, 5) # Triggers another memory shift
Good:
from collections import deque
queue = deque([1, 2, 3, 4])
queue.popleft() # O(1) - Instant
queue.appendleft(5) # O(1) - Instant
Use dict for mapping¶
If you need to associate keys with values, dictionaries are highly optimized hash maps. Avoid iterating over lists of tuples to find a key.
Prefer Tuples over Lists for fixed data¶
Tuples are immutable. They have a smaller memory footprint, and the Python runtime can cache them more effectively than mutable lists.
2. Looping & Iteration (Speed)¶
Loops are where the CPU spends most of its time. Moving logic from Python-space to C-space (the runtime) is key.
List Comprehensions vs. For Loops¶
List comprehensions are optimized at the C-level. They avoid the overhead of the append attribute lookup and function call inside the loop.
Bad:
squares = []
for i in range(1000):
# .append is a method lookup every single time
squares.append(i * i)
Good:
# The loop and construction happen in C
squares = [i * i for i in range(1000)]
Avoid Dot Notation Inside Loops¶
In Python, accessing obj.method triggers a dictionary lookup every single time. If you do this inside a loop running millions of times, it adds up.
Bad:
result = []
for item in huge_list:
result.append(item.upper()) # method lookup happens every iteration
Good:
# Cache the method reference locally
func = str.upper
result = [func(item) for item in huge_list]
Use Built-in Functions & itertools¶
Built-ins like sum(), max(), min(), map(), and filter() are implemented in C. The itertools module provides C-optimized iterators for efficient looping.
Bad:
nested = [[1, 2], [3, 4], [5, 6]]
flat = []
for sublist in nested:
for item in sublist:
flat.append(item)
Good:
import itertools
nested = [[1, 2], [3, 4], [5, 6]]
# itertools.chain creates an iterator that loops over the sublists in C
flat = list(itertools.chain.from_iterable(nested))
3. String Manipulation (Speed & Memory)¶
String Concatenation¶
Strings are immutable. Every time you do str + str, Python creates a brand new string object, copying the old contents. This leads to quadratic time complexity $O(n^2)$.
Bad:
s = ""
for word in words:
s += word + " " # Creates a new string object every iteration
Good:
# Calculates total size once, allocates memory once.
s = " ".join(words)
F-Strings¶
F-strings (Python 3.6+) are not only more readable but also faster than % formatting or .format(). They are evaluated at runtime but optimized to minimize overhead.
Good:
name = "World"
msg = f"Hello {name}"
4. Memory Optimization (RAM)¶
Reducing memory usage often leads to speed improvements because it reduces CPU cache misses and Garbage Collection overhead.
Generators vs. Lists¶
This is the single biggest memory tip. Lists store all data in RAM at once. Generators produce one item at a time (lazy evaluation), utilizing yield.
Bad (Memory Hog):
# Creates a list of 1,000,000 integers in RAM immediately (~40MB)
squares = [i**2 for i in range(1000000)]
for s in squares:
if s > 100: break # We wasted memory calculating the rest
Good (Memory Efficient):
# Creates a generator. Zero memory overhead regardless of size.
squares = (i**2 for i in range(1000000))
for s in squares:
if s > 100: break # No wasted computation or memory
__slots__ in Classes¶
By default, Python objects store attributes in a dictionary (__dict__). This is flexible but consumes significant RAM. If you have a class where you will create millions of instances, use __slots__ to store attributes in a fixed-size C-array.
Code:
class Point:
__slots__ = ['x', 'y'] # Define attributes explicitly
def __init__(self, x, y):
self.x = x
self.y = y
Impact: Reduces memory usage per object by ~40-50%. Note: You cannot add new attributes to these objects dynamically after creation.
Use the array module¶
If you are storing millions of simple numbers and don't need NumPy's math features, use the standard library array module. It stores data as C-types (compact) rather than Python Objects (fat).
import array
# 'd' is for double-precision float. Uses much less RAM than a list of floats.
numbers = array.array('d', [1.0, 2.0, 3.14])
5. Caching & Memoization¶
The fastest code is the code you don't run.
functools.lru_cache¶
If you have a function that is computationally expensive and is often called with the same arguments, use the Least-Recently-Used cache decorator.
Code:
from functools import lru_cache
@lru_cache(maxsize=128)
def fib(n):
if n < 2: return n
return fib(n-1) + fib(n-2)
Impact: Turns exponential time complexity $O(2^n)$ into linear time $O(n)$ for recursion by storing previous results.
6. Numerical & Scientific Computing (The Pro Level)¶
Use NumPy over Python Lists¶
Python lists contain pointers to objects. NumPy arrays contain raw C-data packed contiguously in memory.
- Speed: 10x to 100x faster (uses SIMD CPU instructions).
- Memory: Much smaller footprint.
Bad:
a = [1, 2, 3]
b = [4, 5, 6]
c = [x + y for x, y in zip(a, b)]
Good:
import numpy as np
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
c = a + b # Vectorized operation (happens in C, instantly)
Pandas¶
For tabular data, always use Pandas instead of lists of dictionaries. It utilizes NumPy under the hood. Avoid iterating over rows in Pandas; use vectorization or .apply() as a last resort.
7. Compiler & JIT Optimizations (Breaking the Speed Limit)¶
If Python is still too slow, you can compile it.
-
Numba: Just add a decorator. It compiles Python functions to machine code (LLVM) at runtime. Excellent for math-heavy loops.
from numba import jit @jit(nopython=True) def calculate_pi(n): # ... massive loop ... return result -
PyPy: A replacement interpreter for standard Python (CPython). It uses a JIT (Just-In-Time) compiler. Can make pure Python code 5x-10x faster with no code changes.
- Cython: Allows you to add static type declarations to Python code and compile it into a C-extension.
8. Profiling Tools (How to find the slow parts)¶
Don't guess! Use these tools.
-
timeit: For micro-benchmarking small snippets.import timeit timeit.timeit('"-".join(str(n) for n in range(100))', number=10000) -
cProfile: Built-in standard profiler. Shows you exactly how many times every function was called and how long it took.python -m cProfile -s time my_script.py -
scalene: A modern, high-precision CPU and Memory profiler. It separates Python time from C time (system time) and even GPU time.
9. Concurrency & Parallelism (New Section)¶
Python has a Global Interpreter Lock (GIL) which prevents multiple native threads from executing Python bytecodes at once. This means multithreading does not make CPU-bound tasks faster.
CPU-Bound Tasks (Heavy Math/Image Processing)¶
Use multiprocessing. It spawns separate Python processes, each with its own memory space and GIL, utilizing all CPU cores.
Good:
from multiprocessing import Pool
def heavy_computation(x):
return x * x
if __name__ == '__main__':
with Pool() as p:
# Runs on all available CPU cores
print(p.map(heavy_computation, range(1000000)))
I/O-Bound Tasks (Network Requests, Disk Writes)¶
Use threading or asyncio. Since the CPU is mostly waiting, the GIL is released, allowing other threads to run.
Good:
import threading
def fetch_url(url):
# Network request...
pass
threads = [threading.Thread(target=fetch_url, args=(url,)) for url in urls]
for t in threads: t.start()
for t in threads: t.join()
Summary Table¶
| Technique | Area | Complexity Improvement | Difficulty |
|---|---|---|---|
| Sets instead of Lists | Speed | $O(n) \to O(1)$ | Easy |
deque for Queues |
Speed | $O(n) \to O(1)$ (pop(0)) | Easy |
| Generators | Memory | $O(n) \to O(1)$ | Easy |
| List Comprehension | Speed | Constant Factor | Easy |
lru_cache |
Speed | varies | Easy |
__slots__ |
Memory | Constant Factor | Medium |
| NumPy/Pandas | Speed/Mem | Vectorization | Medium |
| Multiprocessing | Speed | Parallelism (CPU) | Hard |
| Asyncio/Threading | Speed | Parallelism (I/O) | Hard |
| Cython/Numba | Speed | Machine Code | Hard |
Related reads¶
- Check our crash course on FastAPI to learn how to develop fast and performant APIs in Python.