[Docutils-develop] [docutils:bugs] #305 Inline markup end strings and the Unicode PO character set (rule 7 breakage).

Let me try that text formatting again:

~~~
``name``Â¶
~~~

---

** [bugs:#305] Inline markup end strings and the Unicode PO character set (rule 7 breakage).**

**Status:** open
**Group:** Default
**Created:** Fri Dec 30, 2016 12:10 PM UTC by Edward d'Auvergne
**Last Updated:** Fri Dec 30, 2016 12:10 PM UTC
**Owner:** nobody

The following text reveals a bug in docutils::

``name``Â¶

The pilcrow sign belongs to the [Unicode PO class](http://www.fileformat.info/info/unicode/category/Po/list.htm). According to Inline markup recognition rule 7, this is an allowed character after the inline markup end string. However according to rstcheck, and the other rst converters, this is broken::

test.rst:1: (WARNING/2) Inline literal start-string without end-string.

---

Sent from sourceforge.net because docutils-***@lists.sourceforge.net is subscribed to https://sourceforge.net/p/docutils/bugs/

To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/docutils/admin/bugs/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.

Günter Milde

2016-12-30 21:01:37 UTC

Thank you for reporting the problem.

The reason for the current behaviour is an outdated database in the unicodedata Python 2.7 module:

Docutils uses the "unicodedata" Python module to generate the lists of characters allowed around inline markup in utils/punctuation_chars.py.

The character class of the pilcrow sign # changed:
~~~~~
#> python3
Python 3.5.2+ (default, Nov 22 2016, 01:00:20)

import unicodedata
unicodedata.category('Â¶')

'Po'
#> python
Python 2.7.12+ (default, Nov 22 2016, 00:48:54)

import unicodedata
unicodedata.category(u'Â¶')

'So'
~~~~~
Python 2.7 uses UnicodeData.txt version 5.2.0 (ftp://ftp.unicode.org/Public/5.2.0/ucd/UnicodeData.txt). This file says:
00B6;PILCROW SIGN;So;0;ON;;;;;N;PARAGRAPH SIGN;;;;
while the current version of UnicodeData.txt (9.0.0) says
00B6;PILCROW SIGN;Po;0;ON;;;;;N;PARAGRAPH SIGN;;;;

---

** [bugs:#305] Inline markup end strings and the Unicode PO character set (rule 7 breakage).**

**Status:** open
**Group:** Default
**Created:** Fri Dec 30, 2016 12:10 PM UTC by Edward d'Auvergne
**Last Updated:** Fri Dec 30, 2016 12:11 PM UTC
**Owner:** nobody

The following text reveals a bug in docutils::

``name``Â¶

The pilcrow sign belongs to the [Unicode PO class](http://www.fileformat.info/info/unicode/category/Po/list.htm). According to Inline markup recognition rule 7, this is an allowed character after the inline markup end string. However according to rstcheck, and the other rst converters, this is broken::

test.rst:1: (WARNING/2) Inline literal start-string without end-string.

---

Sent from sourceforge.net because docutils-***@lists.sourceforge.net is subscribed to https://sourceforge.net/p/docutils/bugs/

To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/docutils/admin/bugs/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.

Guenter Milde

2016-12-31 09:54:49 UTC

``name``¶
The pilcrow sign belongs to the [Unicode PO class]. According to
Inline markup recognition rule 7, this is an allowed character after
the inline markup end string.

The reason for the current behaviour is an outdated database in the
Docutils uses the "unicodedata" Python module to generate the lists of
characters allowed around inline markup in utils/punctuation_chars.py.
The character class of the pilcrow sign # changed,

some other characters changed, too. Testing with Py3 says::

- openers = ur"""b'"\'(<\\[{\xe0\xbc\xba\xe0\xbc\xbc\xe1\x9a\x9b\xe2\x81\x85\xe2\x81\xbd\xe2\x82\x8d\xe2\x8c\xa9\xe2\x9d\xa8\xe2\x9d\xaa\xe2\x9d\xac\xe2\x9d\xae\xe2\x9d\xb0\xe2\x9d\xb2\xe2\x9d\xb4\xe2\x9f\x85\xe2\x9f\xa6\xe2\x9f\xa8\xe2\x9f\xaa\xe2\x9f\xac\xe2\x9f\xae\xe2\xa6\x83\xe2\xa6\x85\xe2\xa6\x87\xe2\xa6\x89\xe2\xa6\x8b\xe2\xa6\x8d\xe2\xa6\x8f\xe2\xa6\x91\xe2\xa6\x93\xe2\xa6\x95\xe2\xa6\x97\xe2\xa7\x98\xe2\xa7\x9a\xe2\xa7\xbc\xe2\xb8\xa2\xe2\xb8\xa4\xe2\xb8\xa6\xe2\xb8\xa8\xe3\x80\x88\xe3\x80\x8a\xe3\x80\x8c\xe3\x80\x8e\xe3\x80\x90\xe3\x80\x94\xe3\x80\x96\xe3\x80\x98\xe3\x80\x9a\xe3\x80\x9d\xe3\x80\x9d\xef\xb4\xbe\xef\xb8\x97\xef\xb8\xb5\xef\xb8\xb7\xef\xb8\xb9\xef\xb8\xbb\xef\xb8\xbd\xef\xb8\xbf\xef\xb9\x81\xef\xb9\x83\xef\xb9\x87\xef\xb9\x99\xef\xb9\x9b\xef\xb9\x9d\xef\xbc\x88\xef\xbc\xbb\xef\xbd\x9b\xef\xbd\x9f\xef\xbd\xa2\xc2\xab\xe2\x80\x98\xe2\x80\x9c\xe2\x80\xb9\xe2\xb8\x82\xe2\xb8\x84\xe2\xb8\x89\xe2\xb8\x8c\xe2\xb8\x9c\xe2\xb8\xa0\xe2\x80\x9a\xe2\x80\x9e\xc2\xbb\xe2\x80\x99\xe2\x80\x9d\xe2\x80\xba\xe2\xb8\x83\xe2\xb8\x85\xe2\xb8\x8a\xe2\xb8\x8d\xe2\xb8\x9d\xe2\xb8\xa1\xe2\x80\x9b\xe2\x80\x9f'"""
+ openers = ur"""b'"\'(<\\[{\xe0\xbc\xba\xe0\xbc\xbc\xe1\x9a\x9b\xe2\x81\x85\xe2\x81\xbd\xe2\x82\x8d\xe2\x8c\x88\xe2\x8c\x8a\xe2\x8c\xa9\xe2\x9d\xa8\xe2\x9d\xaa\xe2\x9d\xac\xe2\x9d\xae\xe2\x9d\xb0\xe2\x9d\xb2\xe2\x9d\xb4\xe2\x9f\x85\xe2\x9f\xa6\xe2\x9f\xa8\xe2\x9f\xaa\xe2\x9f\xac\xe2\x9f\xae\xe2\xa6\x83\xe2\xa6\x85\xe2\xa6\x87\xe2\xa6\x89\xe2\xa6\x8b\xe2\xa6\x8d\xe2\xa6\x8f\xe2\xa6\x91\xe2\xa6\x93\xe2\xa6\x95\xe2\xa6\x97\xe2\xa7\x98\xe2\xa7\x9a\xe2\xa7\xbc\xe2\xb8\xa2\xe2\xb8\xa4\xe2\xb8\xa6\xe2\xb8\xa8\xe2\xb9\x82\xe3\x80\x88\xe3\x80\x8a\xe3\x80\x8c\xe3\x80\x8e\xe3\x80\x90\xe3\x80\x94\xe3\x80\x96\xe3\x80\x98\xe3\x80\x9a\xe3\x80\x9d\xe3\x80\x9d\xef\xb4\xbf\xef\xb8\x97\xef\xb8\xb5\xef\xb8\xb7\xef\xb8\xb9\xef\xb8\xbb\xef\xb8\xbd\xef\xb8\xbf\xef\xb9\x81\xef\xb9\x83\xef\xb9\x87\xef\xb9\x99\xef\xb9\x9b\xef\xb9\x9d\xef\xbc\x88\xef\xbc\xbb\xef\xbd\x9b\xef\xbd\x9f\xef\xbd\xa2\xc2\xab\xe2\x80\x98\xe2\x80\x9c\xe2\x80\xb9\xe2\xb8\x82\xe2\xb8\x84\xe2\xb8\x89\xe2\xb8\x8c\xe2\xb8\x9c\xe2\xb8\xa0\xe2\x80\x9a\xe2\x80\x9e\xc2\xbb\xe2\x80\x99\xe2\x80\x9d\xe2\x80\xba\xe2\xb8\x83\xe2\xb8\x85\xe2\xb8\x8a\xe2\xb8\x8d\xe2\xb8\x9d\xe2\xb8\xa1\xe2\x80\x9b\xe2\x80\x9f'"""
- closers = ur"""b'"\')>\\]}\xe0\xbc\xbb\xe0\xbc\xbd\xe1\x9a\x9c\xe2\x81\x86\xe2\x81\xbe\xe2\x82\x8e\xe2\x8c\xaa\xe2\x9d\xa9\xe2\x9d\xab\xe2\x9d\xad\xe2\x9d\xaf\xe2\x9d\xb1\xe2\x9d\xb3\xe2\x9d\xb5\xe2\x9f\x86\xe2\x9f\xa7\xe2\x9f\xa9\xe2\x9f\xab\xe2\x9f\xad\xe2\x9f\xaf\xe2\xa6\x84\xe2\xa6\x86\xe2\xa6\x88\xe2\xa6\x8a\xe2\xa6\x8c\xe2\xa6\x8e\xe2\xa6\x90\xe2\xa6\x92\xe2\xa6\x94\xe2\xa6\x96\xe2\xa6\x98\xe2\xa7\x99\xe2\xa7\x9b\xe2\xa7\xbd\xe2\xb8\xa3\xe2\xb8\xa5\xe2\xb8\xa7\xe2\xb8\xa9\xe3\x80\x89\xe3\x80\x8b\xe3\x80\x8d\xe3\x80\x8f\xe3\x80\x91\xe3\x80\x95\xe3\x80\x97\xe3\x80\x99\xe3\x80\x9b\xe3\x80\x9e\xe3\x80\x9f\xef\xb4\xbf\xef\xb8\x98\xef\xb8\xb6\xef\xb8\xb8\xef\xb8\xba\xef\xb8\xbc\xef\xb8\xbe\xef\xb9\x80\xef\xb9\x82\xef\xb9\x84\xef\xb9\x88\xef\xb9\x9a\xef\xb9\x9c\xef\xb9\x9e\xef\xbc\x89\xef\xbc\xbd\xef\xbd\x9d\xef\xbd\xa0\xef\xbd\xa3\xc2\xbb\xe2\x80\x99\xe2\x80\x9d\xe2\x80\xba\xe2\xb8\x83\xe2\xb8\x85\xe2\xb8\x8a\xe2\xb8\x8d\xe2\xb8\x9d\xe2\xb8\xa1\xe2\x80\x9b\xe2\x80\x9f\xc2\xab\xe2\x80\x98\xe2\x80\x9c\xe2\x80\xb9\xe2\xb8\x82\xe2\xb8\x84\xe2\xb8\x89\xe2\xb8\x8c\xe2\xb8\x9c\xe2\xb8\xa0\xe2\x80\x9a\xe2\x80\x9e'"""
+ closers = ur"""b'"\')>\\]}\xe0\xbc\xbb\xe0\xbc\xbd\xe1\x9a\x9c\xe2\x81\x86\xe2\x81\xbe\xe2\x82\x8e\xe2\x8c\x89\xe2\x8c\x8b\xe2\x8c\xaa\xe2\x9d\xa9\xe2\x9d\xab\xe2\x9d\xad\xe2\x9d\xaf\xe2\x9d\xb1\xe2\x9d\xb3\xe2\x9d\xb5\xe2\x9f\x86\xe2\x9f\xa7\xe2\x9f\xa9\xe2\x9f\xab\xe2\x9f\xad\xe2\x9f\xaf\xe2\xa6\x84\xe2\xa6\x86\xe2\xa6\x88\xe2\xa6\x8a\xe2\xa6\x8c\xe2\xa6\x8e\xe2\xa6\x90\xe2\xa6\x92\xe2\xa6\x94\xe2\xa6\x96\xe2\xa6\x98\xe2\xa7\x99\xe2\xa7\x9b\xe2\xa7\xbd\xe2\xb8\xa3\xe2\xb8\xa5\xe2\xb8\xa7\xe2\xb8\xa9\xe3\x80\x89\xe3\x80\x8b\xe3\x80\x8d\xe3\x80\x8f\xe3\x80\x91\xe3\x80\x95\xe3\x80\x97\xe3\x80\x99\xe3\x80\x9b\xe3\x80\x9e\xe3\x80\x9f\xef\xb4\xbe\xef\xb8\x98\xef\xb8\xb6\xef\xb8\xb8\xef\xb8\xba\xef\xb8\xbc\xef\xb8\xbe\xef\xb9\x80\xef\xb9\x82\xef\xb9\x84\xef\xb9\x88\xef\xb9\x9a\xef\xb9\x9c\xef\xb9\x9e\xef\xbc\x89\xef\xbc\xbd\xef\xbd\x9d\xef\xbd\xa0\xef\xbd\xa3\xc2\xbb\xe2\x80\x99\xe2\x80\x9d\xe2\x80\xba\xe2\xb8\x83\xe2\xb8\x85\xe2\xb8\x8a\xe2\xb8\x8d\xe2\xb8\x9d\xe2\xb8\xa1\xe2\x80\x9b\xe2\x80\x9f\xc2\xab\xe2\x80\x98\xe2\x80\x9c\xe2\x80\xb9\xe2\xb8\x82\xe2\xb8\x84\xe2\xb8\x89\xe2\xb8\x8c\xe2\xb8\x9c\xe2\xb8\xa0\xe2\x80\x9a\xe2\x80\x9e'"""
- delimiters = ur"b'\\-/:\xd6\x8a\xc2\xa1\xc2\xb7\xc2\xbf\xcd\xbe\xce\x87\xd5\x9a-\xd5\x9f\xd6\x89\xd6\xbe\xd7\x80\xd7\x83\xd7\x86\xd7\xb3\xd7\xb4\xd8\x89\xd8\x8a\xd8\x8c\xd8\x8d\xd8\x9b\xd8\x9e\xd8\x9f\xd9\xaa-\xd9\xad\xdb\x94\xdc\x80-\xdc\x8d\xdf\xb7-\xdf\xb9\xe0\xa0\xb0-\xe0\xa0\xbe\xe0\xa5\xa4\xe0\xa5\xa5\xe0\xa5\xb0\xe0\xb7\xb4\xe0\xb9\x8f\xe0\xb9\x9a\xe0\xb9\x9b\xe0\xbc\x84-\xe0\xbc\x92\xe0\xbe\x85\xe0\xbf\x90-\xe0\xbf\x94\xe1\x81\x8a-\xe1\x81\x8f\xe1\x83\xbb\xe1\x8d\xa1-\xe1\x8d\xa8\xe1\x90\x80\xe1\x99\xad\xe1\x99\xae\xe1\x9b\xab-\xe1\x9b\xad\xe1\x9c\xb5\xe1\x9c\xb6\xe1\x9f\x94-\xe1\x9f\x96\xe1\x9f\x98-\xe1\x9f\x9a\xe1\xa0\x80-\xe1\xa0\x8a\xe1\xa5\x84\xe1\xa5\x85\xe1\xa7\x9e\xe1\xa7\x9f\xe1\xa8\x9e\xe1\xa8\x9f\xe1\xaa\xa0-\xe1\xaa\xa6\xe1\xaa\xa8-\xe1\xaa\xad\xe1\xad\x9a-\xe1\xad\xa0\xe1\xb0\xbb-\xe1\xb0\xbf\xe1\xb1\xbe\xe1\xb1\xbf\xe1\xb3\x93\xe2\x80\x90-\xe2\x80\x97\xe2\x80\xa0-\xe2\x80\xa7\xe2\x80\xb0-\xe2\x80\xb8\xe2\x80\xbb-\xe2\x80\xbe\xe2\x81\x81-\xe2\x81\x83\xe2\x81\x87-\xe2\x81\x91\xe2\x81\x93\xe2\x81\x95-\xe2\x81\x9e\xe2\xb3\xb9-\xe2\xb3\xbc\xe2\xb3\xbe\xe2\xb3\xbf\xe2\xb8\x80\xe2\xb8\x81\xe2\xb8\x86-\xe2\xb8\x88\xe2\xb8\x8b\xe2\xb8\x8e-\xe2\xb8\x9b\xe2\xb8\x9e\xe2\xb8\x9f\xe2\xb8\xaa-\xe2\xb8\xae\xe2\xb8\xb0\xe2\xb8\xb1\xe3\x80\x81-\xe3\x80\x83\xe3\x80\x9c\xe3\x80\xb0\xe3\x80\xbd\xe3\x82\xa0\xe3\x83\xbb\xea\x93\xbe\xea\x93\xbf\xea\x98\x8d-\xea\x98\x8f\xea\x99\xb3\xea\x99\xbe\xea\x9b\xb2-\xea\x9b\xb7\xea\xa1\xb4-\xea\xa1\xb7\xea\xa3\x8e\xea\xa3\x8f\xea\xa3\xb8-\xea\xa3\xba\xea\xa4\xae\xea\xa4\xaf\xea\xa5\x9f\xea\xa7\x81-\xea\xa7\x8d\xea\xa7\x9e\xea\xa7\x9f\xea\xa9\x9c-\xea\xa9\x9f\xea\xab\x9e\xea\xab\x9f\xea\xaf\xab\xef\xb8\x90-\xef\xb8\x96\xef\xb8\x99\xef\xb8\xb0-\xef\xb8\xb2\xef\xb9\x85\xef\xb9\x86\xef\xb9\x89-\xef\xb9\x8c\xef\xb9\x90-\xef\xb9\x92\xef\xb9\x94-\xef\xb9\x98\xef\xb9\x9f-\xef\xb9\xa1\xef\xb9\xa3\xef\xb9\xa8\xef\xb9\xaa\xef\xb9\xab\xef\xbc\x81-\xef\xbc\x83\xef\xbc\x85-\xef\xbc\x87\xef\xbc\x8a\xef\xbc\x8c-\xef\xbc\x8f\xef\xbc\x9a\xef\xbc\x9b\xef\xbc\x9f\xef\xbc\xa0\xef\xbc\xbc\xef\xbd\xa1\xef\xbd\xa4\xef\xbd\xa5\xf0\x90\x84\x80\xf0\x90\x84\x81\xf0\x90\x8e\x9f\xf0\x90\x8f\x90\xf0\x90\xa1\x97\xf0\x90\xa4\x9f\xf0\x90\xa4\xbf\xf0\x90\xa9\x90-\xf0\x90\xa9\x98\xf0\x90\xa9\xbf\xf0\x90\xac\xb9-\xf0\x90\xac\xbf\xf0\x91\x82\xbb\xf0\x91\x82\xbc\xf0\x91\x82\xbe-\xf0\x91\x83\x81\xf0\x92\x91\xb0-\xf0\x92\x91\xb3'"
+ delimiters = ur"b'\\-/:\xd6\x8a\xc2\xa1\xc2\xa7\xc2\xb6\xc2\xb7\xc2\xbf\xcd\xbe\xce\x87\xd5\x9a-\xd5\x9f\xd6\x89\xd6\xbe\xd7\x80\xd7\x83\xd7\x86\xd7\xb3\xd7\xb4\xd8\x89\xd8\x8a\xd8\x8c\xd8\x8d\xd8\x9b\xd8\x9e\xd8\x9f\xd9\xaa-\xd9\xad\xdb\x94\xdc\x80-\xdc\x8d\xdf\xb7-\xdf\xb9\xe0\xa0\xb0-\xe0\xa0\xbe\xe0\xa1\x9e\xe0\xa5\xa4\xe0\xa5\xa5\xe0\xa5\xb0\xe0\xab\xb0\xe0\xb7\xb4\xe0\xb9\x8f\xe0\xb9\x9a\xe0\xb9\x9b\xe0\xbc\x84-\xe0\xbc\x92\xe0\xbc\x94\xe0\xbe\x85\xe0\xbf\x90-\xe0\xbf\x94\xe0\xbf\x99\xe0\xbf\x9a\xe1\x81\x8a-\xe1\x81\x8f\xe1\x83\xbb\xe1\x8d\xa0-\xe1\x8d\xa8\xe1\x90\x80\xe1\x99\xad\xe1\x99\xae\xe1\x9b\xab-\xe1\x9b\xad\xe1\x9c\xb5\xe1\x9c\xb6\xe1\x9f\x94-\xe1\x9f\x96\xe1\x9f\x98-\xe1\x9f\x9a\xe1\xa0\x80-\xe1\xa0\x8a\xe1\xa5\x84\xe1\xa5\x85\xe1\xa8\x9e\xe1\xa8\x9f\xe1\xaa\xa0-\xe1\xaa\xa6\xe1\xaa\xa8-\xe1\xaa\xad\xe1\xad\x9a-\xe1\xad\xa0\xe1\xaf\xbc-\xe1\xaf\xbf\xe1\xb0\xbb-\xe1\xb0\xbf\xe1\xb1\xbe\xe1\xb1\xbf\xe1\xb3\x80-\xe1\xb3\x87\xe1\xb3\x93\xe2\x80\x90-\xe2\x80\x97\xe2\x80\xa0-\xe2\x80\xa7\xe2\x80\xb0-\xe2\x80\xb8\xe2\x80\xbb-\xe2\x80\xbe\xe2\x81\x81-\xe2\x81\x83\xe2\x81\x87-\xe2\x81\x91\xe2\x81\x93\xe2\x81\x95-\xe2\x81\x9e\xe2\xb3\xb9-\xe2\xb3\xbc\xe2\xb3\xbe\xe2\xb3\xbf\xe2\xb5\xb0\xe2\xb8\x80\xe2\xb8\x81\xe2\xb8\x86-\xe2\xb8\x88\xe2\xb8\x8b\xe2\xb8\x8e-\xe2\xb8\x9b\xe2\xb8\x9e\xe2\xb8\x9f\xe2\xb8\xaa-\xe2\xb8\xae\xe2\xb8\xb0-\xe2\xb9\x81\xe3\x80\x81-\xe3\x80\x83\xe3\x80\x9c\xe3\x80\xb0\xe3\x80\xbd\xe3\x82\xa0\xe3\x83\xbb\xea\x93\xbe\xea\x93\xbf\xea\x98\x8d-\xea\x98\x8f\xea\x99\xb3\xea\x99\xbe\xea\x9b\xb2-\xea\x9b\xb7\xea\xa1\xb4-\xea\xa1\xb7\xea\xa3\x8e\xea\xa3\x8f\xea\xa3\xb8-\xea\xa3\xba\xea\xa3\xbc\xea\xa4\xae\xea\xa4\xaf\xea\xa5\x9f\xea\xa7\x81-\xea\xa7\x8d\xea\xa7\x9e\xea\xa7\x9f\xea\xa9\x9c-\xea\xa9\x9f\xea\xab\x9e\xea\xab\x9f\xea\xab\xb0\xea\xab\xb1\xea\xaf\xab\xef\xb8\x90-\xef\xb8\x96\xef\xb8\x99\xef\xb8\xb0-\xef\xb8\xb2\xef\xb9\x85\xef\xb9\x86\xef\xb9\x89-\xef\xb9\x8c\xef\xb9\x90-\xef\xb9\x92\xef\xb9\x94-\xef\xb9\x98\xef\xb9\x9f-\xef\xb9\xa1\xef\xb9\xa3\xef\xb9\xa8\xef\xb9\xaa\xef\xb9\xab\xef\xbc\x81-\xef\xbc\x83\xef\xbc\x85-\xef\xbc\x87\xef\xbc\x8a\xef\xbc\x8c-\xef\xbc\x8f\xef\xbc\x9a\xef\xbc\x9b\xef\xbc\x9f\xef\xbc\xa0\xef\xbc\xbc\xef\xbd\xa1\xef\xbd\xa4\xef\xbd\xa5\xf0\x90\x84\x80-\xf0\x90\x84\x82\xf0\x90\x8e\x9f\xf0\x90\x8f\x90\xf0\x90\x95\xaf\xf0\x90\xa1\x97\xf0\x90\xa4\x9f\xf0\x90\xa4\xbf\xf0\x90\xa9\x90-\xf0\x90\xa9\x98\xf0\x90\xa9\xbf\xf0\x90\xab\xb0-\xf0\x90\xab\xb6\xf0\x90\xac\xb9-\xf0\x90\xac\xbf\xf0\x90\xae\x99-\xf0\x90\xae\x9c\xf0\x91\x81\x87-\xf0\x91\x81\x8d\xf0\x91\x82\xbb\xf0\x91\x82\xbc\xf0\x91\x82\xbe-\xf0\x91\x83\x81\xf0\x91\x85\x80-\xf0\x91\x85\x83\xf0\x91\x85\xb4\xf0\x91\x85\xb5\xf0\x91\x87\x85-\xf0\x91\x87\x89\xf0\x91\x87\x8d\xf0\x91\x87\x9b\xf0\x91\x87\x9d-\xf0\x91\x87\x9f\xf0\x91\x88\xb8-\xf0\x91\x88\xbd\xf0\x91\x8a\xa9\xf0\x91\x93\x86\xf0\x91\x97\x81-\xf0\x91\x97\x97\xf0\x91\x99\x81-\xf0\x91\x99\x83\xf0\x91\x9c\xbc-\xf0\x91\x9c\xbe\xf0\x92\x91\xb0-\xf0\x92\x91\xb4\xf0\x96\xa9\xae\xf0\x96\xa9\xaf\xf0\x96\xab\xb5\xf0\x96\xac\xb7-\xf0\x96\xac\xbb\xf0\x96\xad\x84\xf0\x9b\xb2\x9f\xf0\x9d\xaa\x87-\xf0\x9d\xaa\x8b'"

and generates the updated code::

openers = ('"\'(<\\[{༺༼᚛⁅⁽₍⌈⌊〈❨❪❬❮❰❲❴⟅⟦⟨⟪⟬⟮⦃⦅⦇⦉⦋⦍⦏⦑⦓⦕⦗⧘⧚⧼⸢⸤⸦⸨⹂〈《「『【〔〖〘〚〝〝﴿︗︵︷︹︻︽︿﹁﹃﹇﹙﹛﹝（［｛｟｢«‘“‹⸂⸄⸉⸌⸜⸠‚„»’”›⸃⸅⸊⸍⸝⸡‛‟')
closers = ('"\')>\\]}༻༽᚜⁆⁾₎⌉⌋〉❩❫❭❯❱❳❵⟆⟧⟩⟫⟭⟯⦄⦆⦈⦊⦌⦎⦐⦒⦔⦖⦘⧙⧛⧽⸣⸥⸧⸩〉》」』】〕〗〙〛〞〟﴾︘︶︸︺︼︾﹀﹂﹄﹈﹚﹜﹞）］｝｠｣»’”›⸃⸅⸊⸍⸝⸡‛‟«‘“‹⸂⸄⸉⸌⸜⸠‚„')
delimiters = ('\\-/:֊¡§¶·¿;·՚-՟։־׀׃׆׳״؉؊،؍؛؞؟٪-٭۔܀-܍߷-߹࠰-࠾࡞।॥॰૰෴๏๚๛༄-༒༔྅࿐-࿔࿙࿚၊-၏჻፠-፨᐀᙭᙮᛫-᛭᜵᜶។-៖៘-៚᠀-᠊᥄᥅᨞᨟᪠-᪦᪨-᪭᭚-᭠᯼-᯿᰻-᰿᱾᱿᳀-᳇᳓‐-‗†-‧‰-‸※-‾⁁-⁃⁇-⁑⁓⁕-⁞⳹-⳼⳾⳿⵰⸀⸁⸆-⸈⸋⸎-⸛⸞⸟⸪-⸮⸰-⹁、-〃〜〰〽゠・꓾꓿꘍-꘏꙳꙾꛲-꛷꡴-꡷꣎꣏꣸-꣺꣼꤮꤯꥟꧁-꧍꧞꧟꩜-꩟꫞꫟꫰꫱꯫︐-︖︙︰-︲﹅﹆﹉-﹌﹐-﹒﹔-﹘﹟-﹡﹣﹨﹪﹫！-＃％-＇＊，-／：；？＠＼｡､･𐄀-𐄂𐎟𐏐𐕯𐡗𐤟𐤿𐩐-𐩘𐩿𐫰-𐫶𐬹-𐬿𐮙-𐮜𑁇-𑁍𑂻𑂼𑂾-𑃁𑅀-𑅃𑅴𑅵𑇅-𑇉𑇍𑇛𑇝-𑇟𑈸-𑈽𑊩𑓆𑗁-𑗗𑙁-𑙃𑜼-𑜾𒑰-𒑴𖩮𖩯𖫵𖬷-𖬻𖭄𛲟𝪇-𝪋')
if sys.maxunicode >= 0x10FFFF: # "wide" build
delimiters += ('𐄀-𐄂𐎟𐏐𐕯𐡗𐤟𐤿𐩐-𐩘𐩿𐫰-𐫶𐬹-𐬿𐮙-𐮜𑁇-𑁍𑂻𑂼𑂾-𑃁𑅀-𑅃𑅴𑅵𑇅-𑇉𑇍𑇛𑇝-𑇟𑈸-𑈽𑊩𑓆𑗁-𑗗𑙁-𑙃𑜼-𑜾𒑰-𒑴𖩮𖩯𖫵𖬷-𖬻𖭄𛲟𝪇-𝪋')
closing_delimiters = '\\\\.,;!?'

The current code in docutils/utils/punctuation_chars.py, uses the \u...
notation::

openers = (u'"\'(<\\[{\u0f3a\u0f3c\u169b\u2045\u207d\u208d\u2329\u2768'
u'\u276a\u276c\u276e\u2770\u2772\u2774\u27c5\u27e6\u27e8\u27ea'
u'\u27ec\u27ee\u2983\u2985\u2987\u2989\u298b\u298d\u298f\u2991'
u'\u2993\u2995\u2997\u29d8\u29da\u29fc\u2e22\u2e24\u2e26\u2e28'
u'\u3008\u300a\u300c\u300e\u3010\u3014\u3016\u3018\u301a\u301d'
u'\u301d\ufd3e\ufe17\ufe35\ufe37\ufe39\ufe3b\ufe3d\ufe3f\ufe41'
u'\ufe43\ufe47\ufe59\ufe5b\ufe5d\uff08\uff3b\uff5b\uff5f\uff62'
u'\xab\u2018\u201c\u2039\u2e02\u2e04\u2e09\u2e0c\u2e1c\u2e20'
u'\u201a\u201e\xbb\u2019\u201d\u203a\u2e03\u2e05\u2e0a\u2e0d'
u'\u2e1d\u2e21\u201b\u201f')
...

Under Py3 the auto-converted code uses literal Unicode characters.

How should we procede:

a) close the bug as invalid and document we are using
Unicode 5.2 for inline markup rules

b) use the Py-3 generated strings with
``# -*- coding: utf-8 -*-`` and prepended ``u``

c) leave this open until someone updates either Py2.7's "unicodedata" or the
code-generating part of punctuation_chars.py so that the original format
can be pertained?

Günter

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Docutils-develop mailing list
Docutils-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/docutils-develop

Please use "Reply All" to reply to the li

Edward d'Auvergne

2016-12-31 11:35:07 UTC

``name``¶
The pilcrow sign belongs to the [Unicode PO class]. According to
Inline markup recognition rule 7, this is an allowed character after
the inline markup end string.

[snip]

Post by Guenter Milde
Under Py3 the auto-converted code uses literal Unicode characters.
a) close the bug as invalid and document we are using
Unicode 5.2 for inline markup rules
b) use the Py-3 generated strings with
``# -*- coding: utf-8 -*-`` and prepended ``u``
c) leave this open until someone updates either Py2.7's "unicodedata" or the
code-generating part of punctuation_chars.py so that the original format
can be pertained?

From a complete outsider's perspective, to me it would make sense to
split the question: For the reStructuredText standard, it would be
logical to pin it to the latest Unicode standard (rather than be tied
down based on the docutils implementation being restricted by the
underlying Python install). For the docutils implementation,
implementing a basic module, possibly auto-generated by a script, to
eliminate problems in the Unicode standard implementation in the local
Python installation so that docutils performs idenically on all
systems would make a lot of sense.

Regards,

Edward

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Docutils-develop mailing list
Docutils-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/docutils-develop

Please use "Reply A

Guenter Milde

2016-12-31 14:12:48 UTC

...

The pilcrow sign belongs to the [Unicode PO class]. According to
Inline markup recognition rule 7, this is an allowed character after
the inline markup end string.

...

Post by GÃ¼nter Milde
The character class of the pilcrow sign # changed,

...

Post by Guenter Milde
a) close the bug as invalid and document we are using
Unicode 5.2 for inline markup rules
b) use the Py-3 generated strings with
``# -*- coding: utf-8 -*-`` and prepended ``u``
c) leave this open until someone updates either Py2.7's "unicodedata" or the
code-generating part of punctuation_chars.py so that the original format
can be pertained?

From a complete outsider's perspective, to me it would make sense to

Agreed.

Post by Edward d'Auvergne
For the reStructuredText standard, it would be
logical to pin it to the latest Unicode standard (rather than be tied
down based on the docutils implementation being restricted by the
underlying Python install).

However, rather than specifying "the latest", we should state a specific
version in the docs and update it together with the actual code.
Why? We don't track the Unicode standard development and auto-update our
patterns.

Post by Edward d'Auvergne
For the docutils implementation,
implementing a basic module, possibly auto-generated by a script, to
eliminate problems in the Unicode standard implementation in the local
Python installation so that docutils performs idenically on all
systems would make a lot of sense.

This is already the case: Docutils uses (auto-generated) strings with
characters of the several classes in punctuation_chars.py. These do not
depend on the Python installation (but are actually from Python 2.x and
hence match Unicode 5.2.1.).

My question b/c was a Docutils-internal one: there is a style-guide saying
that literal Unicode should be avoided (except for language files) dating
from times when Unicode support in editors, fonts, ... was patchy.

Günter

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Docutils-develop mailing list
Docutils-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/docutils-develop

Pl

Edward d'Auvergne

2017-01-02 10:03:22 UTC

Post by Guenter Milde
...

The pilcrow sign belongs to the [Unicode PO class]. According to
Inline markup recognition rule 7, this is an allowed character after
the inline markup end string.

...

Post by GÃ¼nter Milde
The character class of the pilcrow sign # changed,

...

[snip]

https://docs.python.org/3/library/codecs.html
My perspective is rather different. I'm looking at the
reStructuredText spec from the hypothetical viewpoint of someone
wanting to create a clean-room implementation. To make the
reStructuredText specification dependent on what code is in docutils
at any point in time defeats its purpose of being an independent
specification. Why should alternative implementations be bound to the
internal details of the docutils implementation?

I have deliberately not looked at the code, even though I have the
repository checked out.

Post by Guenter Milde
My question b/c was a Docutils-internal one: there is a style-guide saying
that literal Unicode should be avoided (except for language files) dating
from times when Unicode support in editors, fonts, ... was patchy.

As the proposal included an option to mandate an out-of-date Unicode
specification number within the reStructuredText specification based
on the current status of the docutils code, this affects the external
aspects in a significant way. My bug report for the failure of
``name``¶ - which is used extensively in the Python HTML docs (see the
permalinks when hovering over text at
https://docs.python.org/3/library/codecs.html for example) - is solely
for docutils, as the reST specification does not mandate a Unicode
version and hence the latest version would be a reasonable assumption.

Regards,

Edward

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Docutils-develop mailing list
Docutils-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/docutils-develop

Please use "Reply A

David Goodger

2017-01-02 19:26:31 UTC

Post by Guenter Milde
...

The pilcrow sign belongs to the [Unicode PO class]. According to
Inline markup recognition rule 7, this is an allowed character after
the inline markup end string.

...

Post by GÃ¼nter Milde
The character class of the pilcrow sign # changed,

...

[snip]

For the implementation, we should definitely state a specific Unicode
version. docutils/docutils/utils/punctuation_chars.py could be updated
using the very latest Python (e.g. 3.6.0 as of this writing) with
whatever version of Unicode it supports.

For the spec, don't mention any specific version of Unicode. Just
state that character-specific behavior (punctuation, whitespace,
openers/closers) is dependent on the version of Unicode in the
underlying platform (the Python interpreter running Docutils).

Post by Edward d'Auvergne
https://docs.python.org/3/library/codecs.html
My perspective is rather different. I'm looking at the
reStructuredText spec from the hypothetical viewpoint of someone
wanting to create a clean-room implementation. To make the
reStructuredText specification dependent on what code is in docutils
at any point in time defeats its purpose of being an independent
specification.

Neither Docutils nor reStructuredText have such lofty goals. It is not
our purpose to create an independent specification for clean-room
implementation. We have no aspirations to a "reStructuredText
standard" à la ISO.

Your perspective, while valid for you, is not necessarily shared by
the developers of Docutils.

On the other hand, if you do choose to develop a clean-room
implementation of the reSructuredText parser, you are free to
interpret the specification as you like, and to fill in any gaps as
you see fit.

Post by Edward d'Auvergne
Why should alternative implementations be bound to the
internal details of the docutils implementation?

Why should we (the developers of Docutils) care about alternative
implementations?

OK, that's a bit callous. Reworded: why should we care so much that
we'd go to great lengths to accommodate such hypothetical alternative
implementations? Answer: we shouldn't, and we don't.