When working with XML or HTML content that contains surrogate pairs (which are used in UTF-16 to represent characters outside the Basic Multilingual Plane, such as many emojis or rare scripts), Python's lxml library may not handle them correctly by default. This is because lxml expects valid UTF-8 or UTF-16 encoded input, and surrogate pairs can cause parsing errors if not handled properly.
How to Make lxml Read Surrogate Pairs
1. Encode the Input Properly
- Ensure your input string is encoded in UTF-8 or UTF-16 before passing it to
lxml.
- If your string contains surrogate pairs, encode it as UTF-16, then decode it back to a string, or handle it as bytes.
2. Use bytes Input with lxml
- If you have a string with surrogate pairs, encode it to bytes (UTF-8 or UTF-16) and pass the bytes to
lxml.etree.fromstring or lxml.etree.parse.
from lxml import etree
# Example string with surrogate pair (e.g., an emoji)
xml_string = "<root>Hello, world! ๐</root>"
# Encode to UTF-8 bytes
xml_bytes = xml_string.encode("utf-8")
# Parse from bytes
root = etree.fromstring(xml_bytes)
print(etree.tostring(root, encoding="unicode"))
3. Handle Surrogate Pairs in UTF-16
- If your data is in UTF-16 (e.g., from a file or API), read it as bytes and decode it properly:
with open("file.xml", "rb") as f:
xml_bytes = f.read()
# Decode as UTF-16 if needed
xml_string = xml_bytes.decode("utf-16")
root = etree.fromstring(xml_string.encode("utf-8"))
4. Use recover=True for Malformed Input
- If you expect malformed XML (e.g., broken surrogate pairs), you can use
recover=True to ignore errors:
parser = etree.XMLParser(recover=True)
root = etree.fromstring(xml_string, parser=parser)
5. Normalize the String (Optional)
- If you want to ensure surrogate pairs are handled as single characters, normalize the string first:
import unicodedata
normalized_string = unicodedata.normalize("NFC", xml_string)
root = etree.fromstring(normalized_string.encode("utf-8"))
Key Takeaways
- Always encode strings containing surrogate pairs to UTF-8 or UTF-16 before passing them to
lxml.
- Use
bytes input for lxml.etree.fromstring or lxml.etree.parse to avoid encoding issues.
- If the input is malformed, use
recover=True to parse it anyway.
- Normalize the string if you need to handle surrogate pairs as single characters.