In Python why

```python
from lxml import etree

etree.fromstring('''<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<hierarchy a="b&#55357;&#56855;c"/>'''.encode('utf-8'))
```

gives the error:

```
Traceback (most recent call last):
File "<tmp 1>", line 2, in <module>
etree.fromstring('''<?xml version="1.0" encoding="UTF-8" standalone="no"?>
File "src/lxml/etree.pyx", line 3428, in lxml.etree.fromstring
File "src/lxml/parser.pxi", line 2066, in lxml.etree._parseMemoryDocument
File "src/lxml/parser.pxi", line 1919, in lxml.etree._parseDoc
File "src/lxml/parser.pxi", line 1944, in lxml.etree._parseDoc_bytes
File "src/lxml/parser.pxi", line 1194, in lxml.etree._BaseParser._parseDoc
File "src/lxml/parser.pxi", line 647, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 765, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 689, in lxml.etree._raiseParseError
File "<string>", line 2
lxml.etree.XMLSyntaxError: xmlParseCharRef: invalid xmlChar value 55357, line 2, column 24
```

?
Feb 28, 3:02am

The error you're encountering is due to an invalid XML character reference in your string. Let's break down what's happening:

The Problem

  • The string b&#55357;&#56855;c contains two XML character references: &#55357; and &#56855;.
  • These references correspond to Unicode code points U+D83D and U+DCA7, which are surrogate pairs used in UTF-16 encoding to represent characters outside the Basic Multilingual Plane (BMP), such as emojis.
  • In XML, character references must refer to valid Unicode code points. U+D83D is a high surrogate and U+DCA7 is a low surrogate. Surrogate pairs are not valid on their own in XML; they must be combined to form a single Unicode character (e.g., the emoji 💥).

Why the Error?

  • XML parsers expect character references to refer to valid Unicode scalar values (i.e., code points that are not surrogate halves).
  • The parser sees &#55357; (U+D83D) and &#56855; (U+DCA7) as two separate, invalid references, hence the error:
    xmlParseCharRef: invalid xmlChar value 55357

How to Fix It

You have two options:

1. Use the actual Unicode character (emoji) directly in your XML:

python
xml_string = '''<?xml version="1.0" encoding="UTF-8" standalone="no"?> <hierarchy a="b💥c"/>''' etree.fromstring(xml_string.encode('utf-8'))

This works because the emoji is a single Unicode character.

2. Use the combined code point for the emoji:

The emoji 💥 is actually U+1F4A5, which is the combination of the surrogate pair. You can use this directly in your XML:

python
xml_string = '''<?xml version="1.0" encoding="UTF-8" standalone="no"?> <hierarchy a="b&#128165;c"/>''' etree.fromstring(xml_string.encode('utf-8'))

Here, &#128165; is the decimal representation of U+1F4A5.


Summary Table

XML Character Reference Fixes
&#55357;&#56855;
&#128165;
💥

Why This Matters

  • XML parsers do not handle surrogate pairs as separate entities; they must be combined into a single valid Unicode character.
  • Always use the actual Unicode code point for characters outside the BMP, or use the character itself if your encoding supports it (UTF-8 does).
Feb 28, 3:02am

This is a shared conversation and may include unverified content that doesn't reflect Mistral AI's views. Login to resume chatting.

xml_string = '''<?xml version="1.0" encoding="UTF-8" standalone="no"?> <hierarchy a="b💥c"/>''' etree.fromstring(xml_string.encode('utf-8'))
xml_string = '''<?xml version="1.0" encoding="UTF-8" standalone="no"?> <hierarchy a="b&#128165;c"/>''' etree.fromstring(xml_string.encode('utf-8'))