| single |
# url-normalize
[tests]
[Coveralls]
[PyPI]
[Python Versions]
[License]
[Ruff]
A Python library for standardizing and normalizing URLs. Ideal for database
deduplication, caching, web crawling, and anywhere you need to ensure that
equivalent URLs resolve to the exact same string.
```python
from url_normalize import url_normalize
# Fixes IDN, lowercases host/scheme, removes default ports, resolves path
segments
url_normalize("HTTP://User:Pass@www.FOO.com:80///foo/../bar/./baz?q=1#frag")
# -> 'http://User:Pass@www.foo.com/bar/baz?q=1#frag'
```
## Features
url-normalize provides a robust URI normalization function that handles IDN
domains, scheme/host lowercasing, and RFC-compliant path normalization.
- **IDN Support**: Full internationalized domain name handling (using
IDNA2008 with UTS46).
- **Humanization**: Convert normalized URLs to a readable display format
while preserving round-trip normalization.
- **RFC Compliance**:
- Proper percent-encoding (minimal, uppercase hex).
- Dot-segment removal in paths.
- Default port and authority handling.
- UTF-8 NFC normalization.
- **Configurable Defaults**:
- Customizable default scheme (https by default).
- Configurable default domain for absolute paths.
- **Query Parameter Control**:
- Parameter filtering with allowlists.
- Support for domain-specific parameter rules.
- **Versatile URL Handling**: Handles empty strings, double-slash URLs
(//domain.tld), and shebang (#!) URLs.
- **Developer Friendly**:
- Python 3.10+ compatibility.
- 100% test coverage.
- Modern type hints and string handling.
Inspired by Sam Ruby's [urlnorm.py].
## Installation
Install as a library:
```sh
pip install url-normalize
```
Or install as a standalone CLI tool using [uv]:
```sh
uv tool install url-normalize
```
## Usage
### Python API
#### Basic Normalization
```python
from url_normalize import url_normalize
# Basic normalization (uses https by default)
print(url_normalize("www.foo.com:80/foo"))
# Output: https://www.foo.com/foo
# With custom default scheme
print(url_normalize("www.foo.com/foo", default_scheme="http"))
# Output: http://www.foo.com/foo
```
#### Query Parameter Filtering
You can strip out tracking parameters and only keep the ones you care about
using allowlists.
```python
# With query parameter filtering enabled (strips all params by default)
print(url_normalize("www.google.com/search?q=test&utm_source=test",
filter_params=True))
# Output: https://www.google.com/search?q=test
# With custom parameter allowlist as a list
print(url_normalize(
"example.com?page=1&id=123&ref=test",
filter_params=True,
param_allowlist=["page", "id"]
))
|