When working with XML or HTML content that contains surrogate pairs (which are used in UTF-16 to represent characters outside the Basic Multilingual Plane, such as many emojis or rare scripts), Python's lxml library may not handle them correctly by default. This is because lxml expects valid UTF-8 or UTF-16 encoded input, and surrogate pairs can cause parsing errors if not handled properly.
lxml Read Surrogate Pairslxml.bytes Input with lxmllxml.etree.fromstring or lxml.etree.parse.pythonfrom lxml import etree # Example string with surrogate pair (e.g., an emoji) xml_string = "<root>Hello, world! 😊</root>" # Encode to UTF-8 bytes xml_bytes = xml_string.encode("utf-8") # Parse from bytes root = etree.fromstring(xml_bytes) print(etree.tostring(root, encoding="unicode"))
pythonwith open("file.xml", "rb") as f: xml_bytes = f.read() # Decode as UTF-16 if needed xml_string = xml_bytes.decode("utf-16") root = etree.fromstring(xml_string.encode("utf-8"))
recover=True for Malformed Inputrecover=True to ignore errors:pythonparser = etree.XMLParser(recover=True) root = etree.fromstring(xml_string, parser=parser)
pythonimport unicodedata normalized_string = unicodedata.normalize("NFC", xml_string) root = etree.fromstring(normalized_string.encode("utf-8"))
lxml.bytes input for lxml.etree.fromstring or lxml.etree.parse to avoid encoding issues.recover=True to parse it anyway.Le Chat can make mistakes. Check answers. Learn more