Python正则表达式入门与实战技巧详解

原创 2025-08-14 09:28:55编程技术

427

正则表达式（Regular Expression）是处理字符串的强大工具，在Python中通过re模块实现。它能够帮助开发者进行字符串匹配、查找、替换和验证等操作，广泛应用于网络爬虫、数据清洗、日志分析等领域。本文ZHANID工具网将从基础语法入手，结合实战案例，系统讲解Python正则表达式的核心技巧。

一、正则表达式基础语法

1. 元字符与特殊符号

正则表达式由普通字符和元字符组成，元字符具有特殊含义，用于定义匹配规则：

.：匹配除换行符外的任意单个字符。例如a.b可匹配aab、acb，但无法匹配a\nb。
^：匹配字符串开头。例如^Hello仅匹配以Hello开头的字符串。
$：匹配字符串结尾。例如world$仅匹配以world结尾的字符串。
*：匹配前一个字符0次或多次。例如ab*可匹配a、ab、abb。
+：匹配前一个字符1次或多次。例如ab+可匹配ab、abb，但无法匹配a。
?：匹配前一个字符0次或1次。例如ab?可匹配a或ab。
{n}：匹配前一个字符恰好n次。例如\d{3}仅匹配3位数字。
{n,}：匹配前一个字符至少n次。例如\d{3,}匹配3位或更多数字。
{n,m}：匹配前一个字符至少n次且最多m次。例如\d{2,4}匹配2到4位数字。

2. 字符类与转义

[ ]：定义字符集合，匹配其中任意一个字符。例如[abc]匹配a、b或c；[0-9]匹配任意数字。
[^ ]：否定字符集，匹配不在集合中的字符。例如[^0-9]匹配非数字字符。
\：转义特殊字符。例如要匹配.本身，需写作\.；匹配(需写作\(。

3. 预定义字符类

Python正则表达式提供简化符号，等价于常见字符类：

\d：匹配数字，等价于[0-9]。
\D：匹配非数字，等价于[^0-9]。
\w：匹配字母、数字及下划线，等价于[a-zA-Z0-9_]。
\W：匹配非单词字符，等价于[^a-zA-Z0-9_]。
\s：匹配空白字符（空格、制表符、换行符等）。
\S：匹配非空白字符。

4. 分组与引用

( )：创建分组，用于捕获匹配内容。例如(\d{3})-(\d{4})可捕获电话号码的区号和号码部分。
\num：引用第num个分组的内容。例如(\w+)\s+\1可匹配重复单词（如hello hello）。
|：或操作符，匹配左侧或右侧表达式。例如apple|pear匹配apple或pear。

二、Python中`re`模块的核心方法

1. 匹配与搜索

re.match(pattern, string)：从字符串开头匹配模式，若成功返回匹配对象，否则返回None。

import re
result = re.match(r'^Hello', 'Hello world')
if result:
  print("匹配成功:", result.group()) # 输出: Hello

re.search(pattern, string)：搜索整个字符串，返回第一个匹配对象。

result = re.search(r'world', 'Hello world')
if result:
  print("找到匹配:", result.group()) # 输出: world

2. 查找与替换

re.findall(pattern, string)：返回所有匹配的子串列表。

numbers = re.findall(r'\d+', 'abc123def456')
print(numbers) # 输出: ['123', '456']

re.sub(pattern, repl, string)：替换匹配的子串。

text = '车主说:你的刹车片应该更换了啊,嘿嘿'
cleaned = re.sub(r'呢|吧|哈|啊|啦|嘿|嘿嘿', '', text)
print(cleaned) # 输出: 车主说:你的刹车片应该更换了,

3. 分割字符串

re.split(pattern, string)：按模式分割字符串。

parts = re.split(r'\s+', 'Hello  world\nPython')
print(parts) # 输出: ['Hello', 'world', 'Python']

三、实战技巧与高级用法

1. 命名捕获组

通过(?P<name>...)语法为分组命名，提升代码可读性：

text = "我的名字是邓哥,今年30岁。"
pattern = r"我的名字是(?P<name>\w+),今年(?P<age>\d+)岁。"
match = re.search(pattern, text)
if match:
  print(f"姓名: {match.group('name')}, 年龄: {match.group('age')}")
# 输出: 姓名: 邓哥, 年龄: 30

2. 非捕获组

使用(?:...)创建不捕获内容的分组，优化性能：

text = "apple banana orange"
pattern = r"(?:apple|banana) orange"
match = re.search(pattern, text)
if match:
  print(match.group(0)) # 输出: banana orange

3. 零宽断言

正向预查（(?=...)）：匹配后面紧跟特定模式的位置。

text = "Windows95, Windows98, WindowsXP"
matches = re.findall(r"Windows(?=95|98|NT|2000)", text)
print(matches) # 输出: ['Windows', 'Windows']

负向预查（(?!...)）：匹配后面不紧跟特定模式的位置。

text = "user123, admin456, guest789"
matches = re.findall(r"\w+(?!\d)", text) # 匹配不以数字结尾的用户名
print(matches) # 输出: ['user', 'admin', 'guest']

4. 贪婪与非贪婪模式

贪婪模式（默认）：尽可能多地匹配字符。

text = "hello<div>world</div>python"
match = re.search(r'<div>.*</div>', text)
print(match.group()) # 输出: <div>world</div>python（匹配到最后一个</div>）

非贪婪模式（*?、+?）：尽可能少地匹配字符。

match = re.search(r'<div>.*?</div>', text)
print(match.group()) # 输出: <div>world</div>（仅匹配第一个</div>）

5. 预编译正则表达式

若需多次使用同一模式，预编译可提升性能：

pattern = re.compile(r'\d+')
for _ in range(1000):
  result = pattern.match('123') # 避免重复编译

6. 多行模式与Unicode匹配

re.MULTILINE：使^和$匹配每行的开头和结尾。

text = """Line1: Hello
Line2: World"""
matches = re.findall(r'^\w+', text, re.MULTILINE)
print(matches) # 输出: ['Line1', 'Line2']

Unicode范围匹配：例如匹配中文字符：

text = "你好，Hello"
chinese_chars = re.findall(r'[\u4e00-\u9fa5]', text)
print(chinese_chars) # 输出: ['你', '好']

四、实战案例解析

案例1：验证手机号码格式

def validate_phone(number):
  pattern = r'^1[3-9]\d{9}$' # 1开头，第二位3-9，共11位
  return bool(re.fullmatch(pattern, number))

print(validate_phone('13812345678')) # 输出: True
print(validate_phone('12345678901')) # 输出: False

案例2：提取日志中的错误信息

log = "ERROR at 2025-08-13 10:00:00: File not found: /data/test.txt"
pattern = r'ERROR at (\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}): (.+)'
match = re.search(pattern, log)
if match:
  timestamp, message = match.groups()
  print(f"时间: {timestamp}, 错误: {message}")
# 输出: 时间: 2025-08-13 10:00:00, 错误: File not found: /data/test.txt

案例3：清理HTML标签

html = "<p>Hello, <b>World</b>!</p>"
clean_text = re.sub(r'<[^>]+>', '', html)
print(clean_text) # 输出: Hello, World!

案例4：替换日期格式

text = "Meeting on 12/25/2022. Deadline is 3/8/2023."
pattern = r'(\d{1,2})/(\d{1,2})/(\d{4})'
new_text = re.sub(pattern, r'\3-\1-\2', text) # 转换为YYYY-MM-DD
print(new_text) # 输出: Meeting on 2022-12-25. Deadline is 2023-3-8.