Discussion:
[Docutils-develop] [docutils:bugs] #305 Inline markup end strings and the Unicode PO character set (rule 7 breakage).
Edward d'Auvergne
2016-12-30 12:10:59 UTC
Permalink
---

** [bugs:#305] Inline markup end strings and the Unicode PO character set (rule 7 breakage).**

**Status:** open
**Group:** Default
**Created:** Fri Dec 30, 2016 12:10 PM UTC by Edward d'Auvergne
**Last Updated:** Fri Dec 30, 2016 12:10 PM UTC
**Owner:** nobody


The following text reveals a bug in docutils::

``name``¶

The pilcrow sign belongs to the [Unicode PO class](http://www.fileformat.info/info/unicode/category/Po/list.htm). According to Inline markup recognition rule 7, this is an allowed character after the inline markup end string. However according to rstcheck, and the other rst converters, this is broken::

test.rst:1: (WARNING/2) Inline literal start-string without end-string.


---

Sent from sourceforge.net because docutils-***@lists.sourceforge.net is subscribed to https://sourceforge.net/p/docutils/bugs/

To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/docutils/admin/bugs/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.
Edward d'Auvergne
2016-12-30 12:11:46 UTC
Permalink
Let me try that text formatting again:

~~~
``name``¶
~~~


---

** [bugs:#305] Inline markup end strings and the Unicode PO character set (rule 7 breakage).**

**Status:** open
**Group:** Default
**Created:** Fri Dec 30, 2016 12:10 PM UTC by Edward d'Auvergne
**Last Updated:** Fri Dec 30, 2016 12:10 PM UTC
**Owner:** nobody


The following text reveals a bug in docutils::

``name``¶

The pilcrow sign belongs to the [Unicode PO class](http://www.fileformat.info/info/unicode/category/Po/list.htm). According to Inline markup recognition rule 7, this is an allowed character after the inline markup end string. However according to rstcheck, and the other rst converters, this is broken::

test.rst:1: (WARNING/2) Inline literal start-string without end-string.


---

Sent from sourceforge.net because docutils-***@lists.sourceforge.net is subscribed to https://sourceforge.net/p/docutils/bugs/

To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/docutils/admin/bugs/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.
Günter Milde
2016-12-30 21:01:37 UTC
Permalink
Thank you for reporting the problem.

The reason for the current behaviour is an outdated database in the unicodedata Python 2.7 module:

Docutils uses the "unicodedata" Python module to generate the lists of characters allowed around inline markup in utils/punctuation_chars.py.

The character class of the pilcrow sign # changed:
~~~~~
#> python3
Python 3.5.2+ (default, Nov 22 2016, 01:00:20)
import unicodedata
unicodedata.category('¶')
'Po'
#> python
Python 2.7.12+ (default, Nov 22 2016, 00:48:54)
import unicodedata
unicodedata.category(u'¶')
'So'
~~~~~
Python 2.7 uses UnicodeData.txt version 5.2.0 (ftp://ftp.unicode.org/Public/5.2.0/ucd/UnicodeData.txt). This file says:
00B6;PILCROW SIGN;So;0;ON;;;;;N;PARAGRAPH SIGN;;;;
while the current version of UnicodeData.txt (9.0.0) says
00B6;PILCROW SIGN;Po;0;ON;;;;;N;PARAGRAPH SIGN;;;;





---

** [bugs:#305] Inline markup end strings and the Unicode PO character set (rule 7 breakage).**

**Status:** open
**Group:** Default
**Created:** Fri Dec 30, 2016 12:10 PM UTC by Edward d'Auvergne
**Last Updated:** Fri Dec 30, 2016 12:11 PM UTC
**Owner:** nobody


The following text reveals a bug in docutils::

``name``¶

The pilcrow sign belongs to the [Unicode PO class](http://www.fileformat.info/info/unicode/category/Po/list.htm). According to Inline markup recognition rule 7, this is an allowed character after the inline markup end string. However according to rstcheck, and the other rst converters, this is broken::

test.rst:1: (WARNING/2) Inline literal start-string without end-string.


---

Sent from sourceforge.net because docutils-***@lists.sourceforge.net is subscribed to https://sourceforge.net/p/docutils/bugs/

To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/docutils/admin/bugs/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.
Guenter Milde
2016-12-31 09:54:49 UTC
Permalink
Post by Günter Milde
``name``¶
The pilcrow sign belongs to the [Unicode PO class]. According to
Inline markup recognition rule 7, this is an allowed character after
the inline markup end string.
The reason for the current behaviour is an outdated database in the
Docutils uses the "unicodedata" Python module to generate the lists of
characters allowed around inline markup in utils/punctuation_chars.py.
The character class of the pilcrow sign # changed,
some other characters changed, too. Testing with Py3 says::


- openers = ur"""b'"\'(<\\[{\xe0\xbc\xba\xe0\xbc\xbc\xe1\x9a\x9b\xe2\x81\x85\xe2\x81\xbd\xe2\x82\x8d\xe2\x8c\xa9\xe2\x9d\xa8\xe2\x9d\xaa\xe2\x9d\xac\xe2\x9d\xae\xe2\x9d\xb0\xe2\x9d\xb2\xe2\x9d\xb4\xe2\x9f\x85\xe2\x9f\xa6\xe2\x9f\xa8\xe2\x9f\xaa\xe2\x9f\xac\xe2\x9f\xae\xe2\xa6\x83\xe2\xa6\x85\xe2\xa6\x87\xe2\xa6\x89\xe2\xa6\x8b\xe2\xa6\x8d\xe2\xa6\x8f\xe2\xa6\x91\xe2\xa6\x93\xe2\xa6\x95\xe2\xa6\x97\xe2\xa7\x98\xe2\xa7\x9a\xe2\xa7\xbc\xe2\xb8\xa2\xe2\xb8\xa4\xe2\xb8\xa6\xe2\xb8\xa8\xe3\x80\x88\xe3\x80\x8a\xe3\x80\x8c\xe3\x80\x8e\xe3\x80\x90\xe3\x80\x94\xe3\x80\x96\xe3\x80\x98\xe3\x80\x9a\xe3\x80\x9d\xe3\x80\x9d\xef\xb4\xbe\xef\xb8\x97\xef\xb8\xb5\xef\xb8\xb7\xef\xb8\xb9\xef\xb8\xbb\xef\xb8\xbd\xef\xb8\xbf\xef\xb9\x81\xef\xb9\x83\xef\xb9\x87\xef\xb9\x99\xef\xb9\x9b\xef\xb9\x9d\xef\xbc\x88\xef\xbc\xbb\xef\xbd\x9b\xef\xbd\x9f\xef\xbd\xa2\xc2\xab\xe2\x80\x98\xe2\x80\x9c\xe2\x80\xb9\xe2\xb8\x82\xe2\xb8\x84\xe2\xb8\x89\xe2\xb8\x8c\xe2\xb8\x9c\xe2\xb8\xa0\xe2\x80\x9a\xe2\x80\x9e\xc2\xbb\xe2\x80\x99\xe2\x80\x9d\xe2\x80\xba\xe2\xb8\x83\xe2\xb8\x85\xe2\xb8\x8a\xe2\xb8\x8d\xe2\xb8\x9d\xe2\xb8\xa1\xe2\x80\x9b\xe2\x80\x9f'"""
+ openers = ur"""b'"\'(<\\[{\xe0\xbc\xba\xe0\xbc\xbc\xe1\x9a\x9b\xe2\x81\x85\xe2\x81\xbd\xe2\x82\x8d\xe2\x8c\x88\xe2\x8c\x8a\xe2\x8c\xa9\xe2\x9d\xa8\xe2\x9d\xaa\xe2\x9d\xac\xe2\x9d\xae\xe2\x9d\xb0\xe2\x9d\xb2\xe2\x9d\xb4\xe2\x9f\x85\xe2\x9f\xa6\xe2\x9f\xa8\xe2\x9f\xaa\xe2\x9f\xac\xe2\x9f\xae\xe2\xa6\x83\xe2\xa6\x85\xe2\xa6\x87\xe2\xa6\x89\xe2\xa6\x8b\xe2\xa6\x8d\xe2\xa6\x8f\xe2\xa6\x91\xe2\xa6\x93\xe2\xa6\x95\xe2\xa6\x97\xe2\xa7\x98\xe2\xa7\x9a\xe2\xa7\xbc\xe2\xb8\xa2\xe2\xb8\xa4\xe2\xb8\xa6\xe2\xb8\xa8\xe2\xb9\x82\xe3\x80\x88\xe3\x80\x8a\xe3\x80\x8c\xe3\x80\x8e\xe3\x80\x90\xe3\x80\x94\xe3\x80\x96\xe3\x80\x98\xe3\x80\x9a\xe3\x80\x9d\xe3\x80\x9d\xef\xb4\xbf\xef\xb8\x97\xef\xb8\xb5\xef\xb8\xb7\xef\xb8\xb9\xef\xb8\xbb\xef\xb8\xbd\xef\xb8\xbf\xef\xb9\x81\xef\xb9\x83\xef\xb9\x87\xef\xb9\x99\xef\xb9\x9b\xef\xb9\x9d\xef\xbc\x88\xef\xbc\xbb\xef\xbd\x9b\xef\xbd\x9f\xef\xbd\xa2\xc2\xab\xe2\x80\x98\xe2\x80\x9c\xe2\x80\xb9\xe2\xb8\x82\xe2\xb8\x84\xe2\xb8\x89\xe2\xb8\x8c\xe2\xb8\x9c\xe2\xb8\xa0\xe2\x80\x9a\xe2\x80\x9e\xc2\xbb\xe2\x80\x99\xe2\x80\x9d\xe2\x80\xba\xe2\xb8\x83\xe2\xb8\x85\xe2\xb8\x8a\xe2\xb8\x8d\xe2\xb8\x9d\xe2\xb8\xa1\xe2\x80\x9b\xe2\x80\x9f'"""
- closers = ur"""b'"\')>\\]}\xe0\xbc\xbb\xe0\xbc\xbd\xe1\x9a\x9c\xe2\x81\x86\xe2\x81\xbe\xe2\x82\x8e\xe2\x8c\xaa\xe2\x9d\xa9\xe2\x9d\xab\xe2\x9d\xad\xe2\x9d\xaf\xe2\x9d\xb1\xe2\x9d\xb3\xe2\x9d\xb5\xe2\x9f\x86\xe2\x9f\xa7\xe2\x9f\xa9\xe2\x9f\xab\xe2\x9f\xad\xe2\x9f\xaf\xe2\xa6\x84\xe2\xa6\x86\xe2\xa6\x88\xe2\xa6\x8a\xe2\xa6\x8c\xe2\xa6\x8e\xe2\xa6\x90\xe2\xa6\x92\xe2\xa6\x94\xe2\xa6\x96\xe2\xa6\x98\xe2\xa7\x99\xe2\xa7\x9b\xe2\xa7\xbd\xe2\xb8\xa3\xe2\xb8\xa5\xe2\xb8\xa7\xe2\xb8\xa9\xe3\x80\x89\xe3\x80\x8b\xe3\x80\x8d\xe3\x80\x8f\xe3\x80\x91\xe3\x80\x95\xe3\x80\x97\xe3\x80\x99\xe3\x80\x9b\xe3\x80\x9e\xe3\x80\x9f\xef\xb4\xbf\xef\xb8\x98\xef\xb8\xb6\xef\xb8\xb8\xef\xb8\xba\xef\xb8\xbc\xef\xb8\xbe\xef\xb9\x80\xef\xb9\x82\xef\xb9\x84\xef\xb9\x88\xef\xb9\x9a\xef\xb9\x9c\xef\xb9\x9e\xef\xbc\x89\xef\xbc\xbd\xef\xbd\x9d\xef\xbd\xa0\xef\xbd\xa3\xc2\xbb\xe2\x80\x99\xe2\x80\x9d\xe2\x80\xba\xe2\xb8\x83\xe2\xb8\x85\xe2\xb8\x8a\xe2\xb8\x8d\xe2\xb8\x9d\xe2\xb8\xa1\xe2\x80\x9b\xe2\x80\x9f\xc2\xab\xe2\x80\x98\xe2\x80\x9c\xe2\x80\xb9\xe2\xb8\x82\xe2\xb8\x84\xe2\xb8\x89\xe2\xb8\x8c\xe2\xb8\x9c\xe2\xb8\xa0\xe2\x80\x9a\xe2\x80\x9e'"""
+ closers = ur"""b'"\')>\\]}\xe0\xbc\xbb\xe0\xbc\xbd\xe1\x9a\x9c\xe2\x81\x86\xe2\x81\xbe\xe2\x82\x8e\xe2\x8c\x89\xe2\x8c\x8b\xe2\x8c\xaa\xe2\x9d\xa9\xe2\x9d\xab\xe2\x9d\xad\xe2\x9d\xaf\xe2\x9d\xb1\xe2\x9d\xb3\xe2\x9d\xb5\xe2\x9f\x86\xe2\x9f\xa7\xe2\x9f\xa9\xe2\x9f\xab\xe2\x9f\xad\xe2\x9f\xaf\xe2\xa6\x84\xe2\xa6\x86\xe2\xa6\x88\xe2\xa6\x8a\xe2\xa6\x8c\xe2\xa6\x8e\xe2\xa6\x90\xe2\xa6\x92\xe2\xa6\x94\xe2\xa6\x96\xe2\xa6\x98\xe2\xa7\x99\xe2\xa7\x9b\xe2\xa7\xbd\xe2\xb8\xa3\xe2\xb8\xa5\xe2\xb8\xa7\xe2\xb8\xa9\xe3\x80\x89\xe3\x80\x8b\xe3\x80\x8d\xe3\x80\x8f\xe3\x80\x91\xe3\x80\x95\xe3\x80\x97\xe3\x80\x99\xe3\x80\x9b\xe3\x80\x9e\xe3\x80\x9f\xef\xb4\xbe\xef\xb8\x98\xef\xb8\xb6\xef\xb8\xb8\xef\xb8\xba\xef\xb8\xbc\xef\xb8\xbe\xef\xb9\x80\xef\xb9\x82\xef\xb9\x84\xef\xb9\x88\xef\xb9\x9a\xef\xb9\x9c\xef\xb9\x9e\xef\xbc\x89\xef\xbc\xbd\xef\xbd\x9d\xef\xbd\xa0\xef\xbd\xa3\xc2\xbb\xe2\x80\x99\xe2\x80\x9d\xe2\x80\xba\xe2\xb8\x83\xe2\xb8\x85\xe2\xb8\x8a\xe2\xb8\x8d\xe2\xb8\x9d\xe2\xb8\xa1\xe2\x80\x9b\xe2\x80\x9f\xc2\xab\xe2\x80\x98\xe2\x80\x9c\xe2\x80\xb9\xe2\xb8\x82\xe2\xb8\x84\xe2\xb8\x89\xe2\xb8\x8c\xe2\xb8\x9c\xe2\xb8\xa0\xe2\x80\x9a\xe2\x80\x9e'"""
- delimiters = ur"b'\\-/:\xd6\x8a\xc2\xa1\xc2\xb7\xc2\xbf\xcd\xbe\xce\x87\xd5\x9a-\xd5\x9f\xd6\x89\xd6\xbe\xd7\x80\xd7\x83\xd7\x86\xd7\xb3\xd7\xb4\xd8\x89\xd8\x8a\xd8\x8c\xd8\x8d\xd8\x9b\xd8\x9e\xd8\x9f\xd9\xaa-\xd9\xad\xdb\x94\xdc\x80-\xdc\x8d\xdf\xb7-\xdf\xb9\xe0\xa0\xb0-\xe0\xa0\xbe\xe0\xa5\xa4\xe0\xa5\xa5\xe0\xa5\xb0\xe0\xb7\xb4\xe0\xb9\x8f\xe0\xb9\x9a\xe0\xb9\x9b\xe0\xbc\x84-\xe0\xbc\x92\xe0\xbe\x85\xe0\xbf\x90-\xe0\xbf\x94\xe1\x81\x8a-\xe1\x81\x8f\xe1\x83\xbb\xe1\x8d\xa1-\xe1\x8d\xa8\xe1\x90\x80\xe1\x99\xad\xe1\x99\xae\xe1\x9b\xab-\xe1\x9b\xad\xe1\x9c\xb5\xe1\x9c\xb6\xe1\x9f\x94-\xe1\x9f\x96\xe1\x9f\x98-\xe1\x9f\x9a\xe1\xa0\x80-\xe1\xa0\x8a\xe1\xa5\x84\xe1\xa5\x85\xe1\xa7\x9e\xe1\xa7\x9f\xe1\xa8\x9e\xe1\xa8\x9f\xe1\xaa\xa0-\xe1\xaa\xa6\xe1\xaa\xa8-\xe1\xaa\xad\xe1\xad\x9a-\xe1\xad\xa0\xe1\xb0\xbb-\xe1\xb0\xbf\xe1\xb1\xbe\xe1\xb1\xbf\xe1\xb3\x93\xe2\x80\x90-\xe2\x80\x97\xe2\x80\xa0-\xe2\x80\xa7\xe2\x80\xb0-\xe2\x80\xb8\xe2\x80\xbb-\xe2\x80\xbe\xe2\x81\x81-\xe2\x81\x83\xe2\x81\x87-\xe2\x81\x91\xe2\x81\x93\xe2\x81\x95-\xe2\x81\x9e\xe2\xb3\xb9-\xe2\xb3\xbc\xe2\xb3\xbe\xe2\xb3\xbf\xe2\xb8\x80\xe2\xb8\x81\xe2\xb8\x86-\xe2\xb8\x88\xe2\xb8\x8b\xe2\xb8\x8e-\xe2\xb8\x9b\xe2\xb8\x9e\xe2\xb8\x9f\xe2\xb8\xaa-\xe2\xb8\xae\xe2\xb8\xb0\xe2\xb8\xb1\xe3\x80\x81-\xe3\x80\x83\xe3\x80\x9c\xe3\x80\xb0\xe3\x80\xbd\xe3\x82\xa0\xe3\x83\xbb\xea\x93\xbe\xea\x93\xbf\xea\x98\x8d-\xea\x98\x8f\xea\x99\xb3\xea\x99\xbe\xea\x9b\xb2-\xea\x9b\xb7\xea\xa1\xb4-\xea\xa1\xb7\xea\xa3\x8e\xea\xa3\x8f\xea\xa3\xb8-\xea\xa3\xba\xea\xa4\xae\xea\xa4\xaf\xea\xa5\x9f\xea\xa7\x81-\xea\xa7\x8d\xea\xa7\x9e\xea\xa7\x9f\xea\xa9\x9c-\xea\xa9\x9f\xea\xab\x9e\xea\xab\x9f\xea\xaf\xab\xef\xb8\x90-\xef\xb8\x96\xef\xb8\x99\xef\xb8\xb0-\xef\xb8\xb2\xef\xb9\x85\xef\xb9\x86\xef\xb9\x89-\xef\xb9\x8c\xef\xb9\x90-\xef\xb9\x92\xef\xb9\x94-\xef\xb9\x98\xef\xb9\x9f-\xef\xb9\xa1\xef\xb9\xa3\xef\xb9\xa8\xef\xb9\xaa\xef\xb9\xab\xef\xbc\x81-\xef\xbc\x83\xef\xbc\x85-\xef\xbc\x87\xef\xbc\x8a\xef\xbc\x8c-\xef\xbc\x8f\xef\xbc\x9a\xef\xbc\x9b\xef\xbc\x9f\xef\xbc\xa0\xef\xbc\xbc\xef\xbd\xa1\xef\xbd\xa4\xef\xbd\xa5\xf0\x90\x84\x80\xf0\x90\x84\x81\xf0\x90\x8e\x9f\xf0\x90\x8f\x90\xf0\x90\xa1\x97\xf0\x90\xa4\x9f\xf0\x90\xa4\xbf\xf0\x90\xa9\x90-\xf0\x90\xa9\x98\xf0\x90\xa9\xbf\xf0\x90\xac\xb9-\xf0\x90\xac\xbf\xf0\x91\x82\xbb\xf0\x91\x82\xbc\xf0\x91\x82\xbe-\xf0\x91\x83\x81\xf0\x92\x91\xb0-\xf0\x92\x91\xb3'"
+ delimiters = ur"b'\\-/:\xd6\x8a\xc2\xa1\xc2\xa7\xc2\xb6\xc2\xb7\xc2\xbf\xcd\xbe\xce\x87\xd5\x9a-\xd5\x9f\xd6\x89\xd6\xbe\xd7\x80\xd7\x83\xd7\x86\xd7\xb3\xd7\xb4\xd8\x89\xd8\x8a\xd8\x8c\xd8\x8d\xd8\x9b\xd8\x9e\xd8\x9f\xd9\xaa-\xd9\xad\xdb\x94\xdc\x80-\xdc\x8d\xdf\xb7-\xdf\xb9\xe0\xa0\xb0-\xe0\xa0\xbe\xe0\xa1\x9e\xe0\xa5\xa4\xe0\xa5\xa5\xe0\xa5\xb0\xe0\xab\xb0\xe0\xb7\xb4\xe0\xb9\x8f\xe0\xb9\x9a\xe0\xb9\x9b\xe0\xbc\x84-\xe0\xbc\x92\xe0\xbc\x94\xe0\xbe\x85\xe0\xbf\x90-\xe0\xbf\x94\xe0\xbf\x99\xe0\xbf\x9a\xe1\x81\x8a-\xe1\x81\x8f\xe1\x83\xbb\xe1\x8d\xa0-\xe1\x8d\xa8\xe1\x90\x80\xe1\x99\xad\xe1\x99\xae\xe1\x9b\xab-\xe1\x9b\xad\xe1\x9c\xb5\xe1\x9c\xb6\xe1\x9f\x94-\xe1\x9f\x96\xe1\x9f\x98-\xe1\x9f\x9a\xe1\xa0\x80-\xe1\xa0\x8a\xe1\xa5\x84\xe1\xa5\x85\xe1\xa8\x9e\xe1\xa8\x9f\xe1\xaa\xa0-\xe1\xaa\xa6\xe1\xaa\xa8-\xe1\xaa\xad\xe1\xad\x9a-\xe1\xad\xa0\xe1\xaf\xbc-\xe1\xaf\xbf\xe1\xb0\xbb-\xe1\xb0\xbf\xe1\xb1\xbe\xe1\xb1\xbf\xe1\xb3\x80-\xe1\xb3\x87\xe1\xb3\x93\xe2\x80\x90-\xe2\x80\x97\xe2\x80\xa0-\xe2\x80\xa7\xe2\x80\xb0-\xe2\x80\xb8\xe2\x80\xbb-\xe2\x80\xbe\xe2\x81\x81-\xe2\x81\x83\xe2\x81\x87-\xe2\x81\x91\xe2\x81\x93\xe2\x81\x95-\xe2\x81\x9e\xe2\xb3\xb9-\xe2\xb3\xbc\xe2\xb3\xbe\xe2\xb3\xbf\xe2\xb5\xb0\xe2\xb8\x80\xe2\xb8\x81\xe2\xb8\x86-\xe2\xb8\x88\xe2\xb8\x8b\xe2\xb8\x8e-\xe2\xb8\x9b\xe2\xb8\x9e\xe2\xb8\x9f\xe2\xb8\xaa-\xe2\xb8\xae\xe2\xb8\xb0-\xe2\xb9\x81\xe3\x80\x81-\xe3\x80\x83\xe3\x80\x9c\xe3\x80\xb0\xe3\x80\xbd\xe3\x82\xa0\xe3\x83\xbb\xea\x93\xbe\xea\x93\xbf\xea\x98\x8d-\xea\x98\x8f\xea\x99\xb3\xea\x99\xbe\xea\x9b\xb2-\xea\x9b\xb7\xea\xa1\xb4-\xea\xa1\xb7\xea\xa3\x8e\xea\xa3\x8f\xea\xa3\xb8-\xea\xa3\xba\xea\xa3\xbc\xea\xa4\xae\xea\xa4\xaf\xea\xa5\x9f\xea\xa7\x81-\xea\xa7\x8d\xea\xa7\x9e\xea\xa7\x9f\xea\xa9\x9c-\xea\xa9\x9f\xea\xab\x9e\xea\xab\x9f\xea\xab\xb0\xea\xab\xb1\xea\xaf\xab\xef\xb8\x90-\xef\xb8\x96\xef\xb8\x99\xef\xb8\xb0-\xef\xb8\xb2\xef\xb9\x85\xef\xb9\x86\xef\xb9\x89-\xef\xb9\x8c\xef\xb9\x90-\xef\xb9\x92\xef\xb9\x94-\xef\xb9\x98\xef\xb9\x9f-\xef\xb9\xa1\xef\xb9\xa3\xef\xb9\xa8\xef\xb9\xaa\xef\xb9\xab\xef\xbc\x81-\xef\xbc\x83\xef\xbc\x85-\xef\xbc\x87\xef\xbc\x8a\xef\xbc\x8c-\xef\xbc\x8f\xef\xbc\x9a\xef\xbc\x9b\xef\xbc\x9f\xef\xbc\xa0\xef\xbc\xbc\xef\xbd\xa1\xef\xbd\xa4\xef\xbd\xa5\xf0\x90\x84\x80-\xf0\x90\x84\x82\xf0\x90\x8e\x9f\xf0\x90\x8f\x90\xf0\x90\x95\xaf\xf0\x90\xa1\x97\xf0\x90\xa4\x9f\xf0\x90\xa4\xbf\xf0\x90\xa9\x90-\xf0\x90\xa9\x98\xf0\x90\xa9\xbf\xf0\x90\xab\xb0-\xf0\x90\xab\xb6\xf0\x90\xac\xb9-\xf0\x90\xac\xbf\xf0\x90\xae\x99-\xf0\x90\xae\x9c\xf0\x91\x81\x87-\xf0\x91\x81\x8d\xf0\x91\x82\xbb\xf0\x91\x82\xbc\xf0\x91\x82\xbe-\xf0\x91\x83\x81\xf0\x91\x85\x80-\xf0\x91\x85\x83\xf0\x91\x85\xb4\xf0\x91\x85\xb5\xf0\x91\x87\x85-\xf0\x91\x87\x89\xf0\x91\x87\x8d\xf0\x91\x87\x9b\xf0\x91\x87\x9d-\xf0\x91\x87\x9f\xf0\x91\x88\xb8-\xf0\x91\x88\xbd\xf0\x91\x8a\xa9\xf0\x91\x93\x86\xf0\x91\x97\x81-\xf0\x91\x97\x97\xf0\x91\x99\x81-\xf0\x91\x99\x83\xf0\x91\x9c\xbc-\xf0\x91\x9c\xbe\xf0\x92\x91\xb0-\xf0\x92\x91\xb4\xf0\x96\xa9\xae\xf0\x96\xa9\xaf\xf0\x96\xab\xb5\xf0\x96\xac\xb7-\xf0\x96\xac\xbb\xf0\x96\xad\x84\xf0\x9b\xb2\x9f\xf0\x9d\xaa\x87-\xf0\x9d\xaa\x8b'"

and generates the updated code::

openers = ('"\'(<\\[{༺༼᚛⁅⁽₍⌈⌊〈❨❪❬❮❰❲❴⟅⟦⟨⟪⟬⟮⦃⦅⦇⦉⦋⦍⦏⦑⦓⦕⦗⧘⧚⧼⸢⸤⸦⸨⹂〈《「『【〔〖〘〚〝〝﴿︗︵︷︹︻︽︿﹁﹃﹇﹙﹛﹝([{⦅「«‘“‹⸂⸄⸉⸌⸜⸠‚„»’”›⸃⸅⸊⸍⸝⸡‛‟')
closers = ('"\')>\\]}༻༽᚜⁆⁾₎⌉⌋〉❩❫❭❯❱❳❵⟆⟧⟩⟫⟭⟯⦄⦆⦈⦊⦌⦎⦐⦒⦔⦖⦘⧙⧛⧽⸣⸥⸧⸩〉》」』】〕〗〙〛〞〟﴾︘︶︸︺︼︾﹀﹂﹄﹈﹚﹜﹞)]}⦆」»’”›⸃⸅⸊⸍⸝⸡‛‟«‘“‹⸂⸄⸉⸌⸜⸠‚„')
delimiters = ('\\-/:֊¡§¶·¿;·՚-՟։־׀׃׆׳״؉؊،؍؛؞؟٪-٭۔܀-܍߷-߹࠰-࠾࡞।॥॰૰෴๏๚๛༄-༒༔྅࿐-࿔࿙࿚၊-၏჻፠-፨᐀᙭᙮᛫-᛭᜵᜶។-៖៘-៚᠀-᠊᥄᥅᨞᨟᪠-᪦᪨-᪭᭚-᭠᯼-᯿᰻-᰿᱾᱿᳀-᳇᳓‐-‗†-‧‰-‸※-‾⁁-⁃⁇-⁑⁓⁕-⁞⳹-⳼⳾⳿⵰⸀⸁⸆-⸈⸋⸎-⸛⸞⸟⸪-⸮⸰-⹁、-〃〜〰〽゠・꓾꓿꘍-꘏꙳꙾꛲-꛷꡴-꡷꣎꣏꣸-꣺꣼꤮꤯꥟꧁-꧍꧞꧟꩜-꩟꫞꫟꫰꫱꯫︐-︖︙︰-︲﹅﹆﹉-﹌﹐-﹒﹔-﹘﹟-﹡﹣﹨﹪﹫!-#%-'*,-/:;?@\。、・𐄀-𐄂𐎟𐏐𐕯𐡗𐤟𐤿𐩐-𐩘𐩿𐫰-𐫶𐬹-𐬿𐮙-𐮜𑁇-𑁍𑂻𑂼𑂾-𑃁𑅀-𑅃𑅴𑅵𑇅-𑇉𑇍𑇛𑇝-𑇟𑈸-𑈽𑊩𑓆𑗁-𑗗𑙁-𑙃𑜼-𑜾𒑰-𒑴𖩮𖩯𖫵𖬷-𖬻𖭄𛲟𝪇-𝪋')
if sys.maxunicode >= 0x10FFFF: # "wide" build
delimiters += ('𐄀-𐄂𐎟𐏐𐕯𐡗𐤟𐤿𐩐-𐩘𐩿𐫰-𐫶𐬹-𐬿𐮙-𐮜𑁇-𑁍𑂻𑂼𑂾-𑃁𑅀-𑅃𑅴𑅵𑇅-𑇉𑇍𑇛𑇝-𑇟𑈸-𑈽𑊩𑓆𑗁-𑗗𑙁-𑙃𑜼-𑜾𒑰-𒑴𖩮𖩯𖫵𖬷-𖬻𖭄𛲟𝪇-𝪋')
closing_delimiters = '\\\\.,;!?'

The current code in docutils/utils/punctuation_chars.py, uses the \u...
notation::

openers = (u'"\'(<\\[{\u0f3a\u0f3c\u169b\u2045\u207d\u208d\u2329\u2768'
u'\u276a\u276c\u276e\u2770\u2772\u2774\u27c5\u27e6\u27e8\u27ea'
u'\u27ec\u27ee\u2983\u2985\u2987\u2989\u298b\u298d\u298f\u2991'
u'\u2993\u2995\u2997\u29d8\u29da\u29fc\u2e22\u2e24\u2e26\u2e28'
u'\u3008\u300a\u300c\u300e\u3010\u3014\u3016\u3018\u301a\u301d'
u'\u301d\ufd3e\ufe17\ufe35\ufe37\ufe39\ufe3b\ufe3d\ufe3f\ufe41'
u'\ufe43\ufe47\ufe59\ufe5b\ufe5d\uff08\uff3b\uff5b\uff5f\uff62'
u'\xab\u2018\u201c\u2039\u2e02\u2e04\u2e09\u2e0c\u2e1c\u2e20'
u'\u201a\u201e\xbb\u2019\u201d\u203a\u2e03\u2e05\u2e0a\u2e0d'
u'\u2e1d\u2e21\u201b\u201f')
...

Under Py3 the auto-converted code uses literal Unicode characters.

How should we procede:

a) close the bug as invalid and document we are using
Unicode 5.2 for inline markup rules

b) use the Py-3 generated strings with
``# -*- coding: utf-8 -*-`` and prepended ``u``

c) leave this open until someone updates either Py2.7's "unicodedata" or the
code-generating part of punctuation_chars.py so that the original format
can be pertained?


Günter


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Docutils-develop mailing list
Docutils-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/docutils-develop

Please use "Reply All" to reply to the li
Edward d'Auvergne
2016-12-31 11:35:07 UTC
Permalink
Post by Guenter Milde
Post by Günter Milde
``name``¶
The pilcrow sign belongs to the [Unicode PO class]. According to
Inline markup recognition rule 7, this is an allowed character after
the inline markup end string.
The reason for the current behaviour is an outdated database in the
Docutils uses the "unicodedata" Python module to generate the lists of
characters allowed around inline markup in utils/punctuation_chars.py.
The character class of the pilcrow sign # changed,
[snip]
Post by Guenter Milde
Under Py3 the auto-converted code uses literal Unicode characters.
a) close the bug as invalid and document we are using
Unicode 5.2 for inline markup rules
b) use the Py-3 generated strings with
``# -*- coding: utf-8 -*-`` and prepended ``u``
c) leave this open until someone updates either Py2.7's "unicodedata" or the
code-generating part of punctuation_chars.py so that the original format
can be pertained?
From a complete outsider's perspective, to me it would make sense to
split the question: For the reStructuredText standard, it would be
logical to pin it to the latest Unicode standard (rather than be tied
down based on the docutils implementation being restricted by the
underlying Python install). For the docutils implementation,
implementing a basic module, possibly auto-generated by a script, to
eliminate problems in the Unicode standard implementation in the local
Python installation so that docutils performs idenically on all
systems would make a lot of sense.

Regards,

Edward

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Docutils-develop mailing list
Docutils-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/docutils-develop

Please use "Reply A
Guenter Milde
2016-12-31 14:12:48 UTC
Permalink
...
Post by Edward d'Auvergne
Post by Guenter Milde
Post by Günter Milde
The pilcrow sign belongs to the [Unicode PO class]. According to
Inline markup recognition rule 7, this is an allowed character after
the inline markup end string.
...
Post by Edward d'Auvergne
Post by Guenter Milde
Post by Günter Milde
The character class of the pilcrow sign # changed,
...
Post by Edward d'Auvergne
Post by Guenter Milde
a) close the bug as invalid and document we are using
Unicode 5.2 for inline markup rules
b) use the Py-3 generated strings with
``# -*- coding: utf-8 -*-`` and prepended ``u``
c) leave this open until someone updates either Py2.7's "unicodedata" or the
code-generating part of punctuation_chars.py so that the original format
can be pertained?
From a complete outsider's perspective, to me it would make sense to
Agreed.
Post by Edward d'Auvergne
For the reStructuredText standard, it would be
logical to pin it to the latest Unicode standard (rather than be tied
down based on the docutils implementation being restricted by the
underlying Python install).
However, rather than specifying "the latest", we should state a specific
version in the docs and update it together with the actual code.
Why? We don't track the Unicode standard development and auto-update our
patterns.
Post by Edward d'Auvergne
For the docutils implementation,
implementing a basic module, possibly auto-generated by a script, to
eliminate problems in the Unicode standard implementation in the local
Python installation so that docutils performs idenically on all
systems would make a lot of sense.
This is already the case: Docutils uses (auto-generated) strings with
characters of the several classes in punctuation_chars.py. These do not
depend on the Python installation (but are actually from Python 2.x and
hence match Unicode 5.2.1.).

My question b/c was a Docutils-internal one: there is a style-guide saying
that literal Unicode should be avoided (except for language files) dating
from times when Unicode support in editors, fonts, ... was patchy.

Günter


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Docutils-develop mailing list
Docutils-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/docutils-develop

Pl
Edward d'Auvergne
2017-01-02 10:03:22 UTC
Permalink
Post by Guenter Milde
...
Post by Edward d'Auvergne
Post by Guenter Milde
Post by Günter Milde
The pilcrow sign belongs to the [Unicode PO class]. According to
Inline markup recognition rule 7, this is an allowed character after
the inline markup end string.
...
Post by Edward d'Auvergne
Post by Guenter Milde
Post by Günter Milde
The character class of the pilcrow sign # changed,
...
Post by Edward d'Auvergne
Post by Guenter Milde
a) close the bug as invalid and document we are using
Unicode 5.2 for inline markup rules
b) use the Py-3 generated strings with
``# -*- coding: utf-8 -*-`` and prepended ``u``
c) leave this open until someone updates either Py2.7's "unicodedata" or the
code-generating part of punctuation_chars.py so that the original format
can be pertained?
[snip]
Post by Guenter Milde
Post by Edward d'Auvergne
For the reStructuredText standard, it would be
logical to pin it to the latest Unicode standard (rather than be tied
down based on the docutils implementation being restricted by the
underlying Python install).
However, rather than specifying "the latest", we should state a specific
version in the docs and update it together with the actual code.
Why? We don't track the Unicode standard development and auto-update our
patterns.
https://docs.python.org/3/library/codecs.html
My perspective is rather different. I'm looking at the
reStructuredText spec from the hypothetical viewpoint of someone
wanting to create a clean-room implementation. To make the
reStructuredText specification dependent on what code is in docutils
at any point in time defeats its purpose of being an independent
specification. Why should alternative implementations be bound to the
internal details of the docutils implementation?
Post by Guenter Milde
Post by Edward d'Auvergne
For the docutils implementation,
implementing a basic module, possibly auto-generated by a script, to
eliminate problems in the Unicode standard implementation in the local
Python installation so that docutils performs idenically on all
systems would make a lot of sense.
This is already the case: Docutils uses (auto-generated) strings with
characters of the several classes in punctuation_chars.py. These do not
depend on the Python installation (but are actually from Python 2.x and
hence match Unicode 5.2.1.).
I have deliberately not looked at the code, even though I have the
repository checked out.
Post by Guenter Milde
My question b/c was a Docutils-internal one: there is a style-guide saying
that literal Unicode should be avoided (except for language files) dating
from times when Unicode support in editors, fonts, ... was patchy.
As the proposal included an option to mandate an out-of-date Unicode
specification number within the reStructuredText specification based
on the current status of the docutils code, this affects the external
aspects in a significant way. My bug report for the failure of
``name``¶ - which is used extensively in the Python HTML docs (see the
permalinks when hovering over text at
https://docs.python.org/3/library/codecs.html for example) - is solely
for docutils, as the reST specification does not mandate a Unicode
version and hence the latest version would be a reasonable assumption.

Regards,

Edward

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Docutils-develop mailing list
Docutils-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/docutils-develop

Please use "Reply A
David Goodger
2017-01-02 19:26:31 UTC
Permalink
Post by Edward d'Auvergne
Post by Guenter Milde
...
Post by Edward d'Auvergne
Post by Guenter Milde
Post by Günter Milde
The pilcrow sign belongs to the [Unicode PO class]. According to
Inline markup recognition rule 7, this is an allowed character after
the inline markup end string.
...
Post by Edward d'Auvergne
Post by Guenter Milde
Post by Günter Milde
The character class of the pilcrow sign # changed,
...
Post by Edward d'Auvergne
Post by Guenter Milde
a) close the bug as invalid and document we are using
Unicode 5.2 for inline markup rules
b) use the Py-3 generated strings with
``# -*- coding: utf-8 -*-`` and prepended ``u``
c) leave this open until someone updates either Py2.7's "unicodedata" or the
code-generating part of punctuation_chars.py so that the original format
can be pertained?
[snip]
Post by Guenter Milde
Post by Edward d'Auvergne
For the reStructuredText standard, it would be
logical to pin it to the latest Unicode standard (rather than be tied
down based on the docutils implementation being restricted by the
underlying Python install).
However, rather than specifying "the latest", we should state a specific
version in the docs and update it together with the actual code.
Why? We don't track the Unicode standard development and auto-update our
patterns.
For the implementation, we should definitely state a specific Unicode
version. docutils/docutils/utils/punctuation_chars.py could be updated
using the very latest Python (e.g. 3.6.0 as of this writing) with
whatever version of Unicode it supports.

For the spec, don't mention any specific version of Unicode. Just
state that character-specific behavior (punctuation, whitespace,
openers/closers) is dependent on the version of Unicode in the
underlying platform (the Python interpreter running Docutils).
Post by Edward d'Auvergne
https://docs.python.org/3/library/codecs.html
My perspective is rather different. I'm looking at the
reStructuredText spec from the hypothetical viewpoint of someone
wanting to create a clean-room implementation. To make the
reStructuredText specification dependent on what code is in docutils
at any point in time defeats its purpose of being an independent
specification.
Neither Docutils nor reStructuredText have such lofty goals. It is not
our purpose to create an independent specification for clean-room
implementation. We have no aspirations to a "reStructuredText
standard" à la ISO.

Your perspective, while valid for you, is not necessarily shared by
the developers of Docutils.

On the other hand, if you do choose to develop a clean-room
implementation of the reSructuredText parser, you are free to
interpret the specification as you like, and to fill in any gaps as
you see fit.
Post by Edward d'Auvergne
Why should alternative implementations be bound to the
internal details of the docutils implementation?
Why should we (the developers of Docutils) care about alternative
implementations?

OK, that's a bit callous. Reworded: why should we care so much that
we'd go to great lengths to accommodate such hypothetical alternative
implementations? Answer: we shouldn't, and we don't.
Post by Edward d'Auvergne
Post by Guenter Milde
Post by Edward d'Auvergne
For the docutils implementation,
implementing a basic module, possibly auto-generated by a script, to
eliminate problems in the Unicode standard implementation in the local
Python installation so that docutils performs idenically on all
systems would make a lot of sense.
This is already the case: Docutils uses (auto-generated) strings with
characters of the several classes in punctuation_chars.py. These do not
depend on the Python installation (but are actually from Python 2.x and
hence match Unicode 5.2.1.).
I have deliberately not looked at the code, even though I have the
repository checked out.
Post by Guenter Milde
My question b/c was a Docutils-internal one: there is a style-guide saying
that literal Unicode should be avoided (except for language files) dating
from times when Unicode support in editors, fonts, ... was patchy.
As the proposal included an option to mandate an out-of-date Unicode
specification number within the reStructuredText specification based
on the current status of the docutils code, this affects the external
aspects in a significant way. My bug report for the failure of
``name``¶ - which is used extensively in the Python HTML docs (see the
permalinks when hovering over text at
https://docs.python.org/3/library/codecs.html for example)
I believe the pilcrows (¶) are added by Sphinx to the output HTML.
They are not present in the reST sources.

If you want "``name``¶" in your source reST, just insert an escaped
space: ``name``\ ¶.
Post by Edward d'Auvergne
- is solely
for docutils, as the reST specification does not mandate a Unicode
version and hence the latest version would be a reasonable assumption.
The reST specification is a best effort to document the intentions of
the Docutils developers, and Docutils is a best effort to implement
the reST specification. They are interdependent. From
docutils/docs/ref/rst/restructuredtext.txt (the spec in question):
"The reStructuredText parser is a component of Docutils_."

I agree that it would be ideal to base our implementation on the
latest Unicode spec. However, we must temper that with practicality.
docutils/docutils/utils/punctuation_chars.py could certainly be
updated with the latest Unicode spec as implemented in a current
version of Python. We use Python's internal implementation to handle
many aspects of Unicode processing. In order to use the latest Unicode
standard in all contexts (e.g. in the running code), we'd have to
reimplement a lot more of what Python provides, and that's just not
worth the effort.

David Goodger
<http://python.net/~goodger>
Edward d'Auvergne
2017-01-02 22:52:43 UTC
Permalink
Post by David Goodger
Post by Edward d'Auvergne
My perspective is rather different. I'm looking at the
reStructuredText spec from the hypothetical viewpoint of someone
wanting to create a clean-room implementation. To make the
reStructuredText specification dependent on what code is in docutils
at any point in time defeats its purpose of being an independent
specification.
Neither Docutils nor reStructuredText have such lofty goals. It is not
our purpose to create an independent specification for clean-room
implementation. We have no aspirations to a "reStructuredText
standard" à la ISO.
Your perspective, while valid for you, is not necessarily shared by
the developers of Docutils.
On the other hand, if you do choose to develop a clean-room
implementation of the reSructuredText parser, you are free to
interpret the specification as you like, and to fill in any gaps as
you see fit.
The clean-room implementation was a hypothetical - that's not what I'm
working on. But I have been using the specification in isolation to
generate reStructuredText formatted documents independently of Docutils,
and from time to time using Docutils to validate the output. That way I
stumbled on the ``name``¶ issue - I generated that construct automatically
and Docutils was not happy. Another issue appears to be definition list
terms wrapped across multiple lines - the standard does not say anything
about this but Docutils does not like that. Should such issues be reported
in the bug tracker or would it be better here?
Post by David Goodger
Post by Edward d'Auvergne
Why should alternative implementations be bound to the
internal details of the docutils implementation?
Why should we (the developers of Docutils) care about alternative
implementations?
OK, that's a bit callous. Reworded: why should we care so much that
we'd go to great lengths to accommodate such hypothetical alternative
implementations? Answer: we shouldn't, and we don't.
I assumed that there was a separation of the reStructuredText standard and
the Docutils implementation - that's how the specification reads. In any
case, I'm simply reading and following the specification which to me looks
sufficient to produce reST formatted content.
Post by David Goodger
Post by Edward d'Auvergne
Post by Guenter Milde
Post by Edward d'Auvergne
For the docutils implementation,
implementing a basic module, possibly auto-generated by a script, to
eliminate problems in the Unicode standard implementation in the local
Python installation so that docutils performs idenically on all
systems would make a lot of sense.
This is already the case: Docutils uses (auto-generated) strings with
characters of the several classes in punctuation_chars.py. These do not
depend on the Python installation (but are actually from Python 2.x and
hence match Unicode 5.2.1.).
I have deliberately not looked at the code, even though I have the
repository checked out.
Post by Guenter Milde
My question b/c was a Docutils-internal one: there is a style-guide saying
that literal Unicode should be avoided (except for language files) dating
from times when Unicode support in editors, fonts, ... was patchy.
As the proposal included an option to mandate an out-of-date Unicode
specification number within the reStructuredText specification based
on the current status of the docutils code, this affects the external
aspects in a significant way. My bug report for the failure of
``name``¶ - which is used extensively in the Python HTML docs (see the
permalinks when hovering over text at
https://docs.python.org/3/library/codecs.html for example)
I believe the pilcrows (¶) are added by Sphinx to the output HTML.
They are not present in the reST sources.
If you want "``name``¶" in your source reST, just insert an escaped
space: ``name``\ ¶.
I know I can do this, but the specification says that a character from the
Unicode PO class can directly follow the inline markup end string (i.e.
without the escaping mechanism). That's why I reported it.
Post by David Goodger
Post by Edward d'Auvergne
- is solely
for docutils, as the reST specification does not mandate a Unicode
version and hence the latest version would be a reasonable assumption.
The reST specification is a best effort to document the intentions of
the Docutils developers, and Docutils is a best effort to implement
the reST specification. They are interdependent. From
"The reStructuredText parser is a component of Docutils_."
I agree that it would be ideal to base our implementation on the
latest Unicode spec. However, we must temper that with practicality.
docutils/docutils/utils/punctuation_chars.py could certainly be
updated with the latest Unicode spec as implemented in a current
version of Python. We use Python's internal implementation to handle
many aspects of Unicode processing. In order to use the latest Unicode
standard in all contexts (e.g. in the running code), we'd have to
reimplement a lot more of what Python provides, and that's just not
worth the effort.
Well, an external tool could create reST formatted text based on your
specification, and if rstcheck is unhappy about that output a bug report
could be filed for improving either the specification or Docutils ;) I'm
guessing the zero-width space (u"\u200B") will do some interesting things.

Regards,

Edward
David Goodger
2017-01-02 23:23:04 UTC
Permalink
Post by Edward d'Auvergne
Post by David Goodger
Post by Edward d'Auvergne
My perspective is rather different. I'm looking at the
reStructuredText spec from the hypothetical viewpoint of someone
wanting to create a clean-room implementation. To make the
reStructuredText specification dependent on what code is in docutils
at any point in time defeats its purpose of being an independent
specification.
Neither Docutils nor reStructuredText have such lofty goals. It is not
our purpose to create an independent specification for clean-room
implementation. We have no aspirations to a "reStructuredText
standard" à la ISO.
Your perspective, while valid for you, is not necessarily shared by
the developers of Docutils.
On the other hand, if you do choose to develop a clean-room
implementation of the reSructuredText parser, you are free to
interpret the specification as you like, and to fill in any gaps as
you see fit.
The clean-room implementation was a hypothetical - that's not what I'm
working on. But I have been using the specification in isolation to
generate reStructuredText formatted documents independently of Docutils, and
from time to time using Docutils to validate the output. That way I
stumbled on the ``name``¶ issue - I generated that construct automatically
and Docutils was not happy. Another issue appears to be definition list
terms wrapped across multiple lines - the standard does not say anything
about this but Docutils does not like that.
"A term is a simple one-line word or phrase."
— http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html#definition-lists
Post by Edward d'Auvergne
Should such issues be reported
in the bug tracker or would it be better here?
If it's a real bug, then the bug tracker is the best place.
Discussions on mailing lists fade from memory.
Post by Edward d'Auvergne
Post by David Goodger
Post by Edward d'Auvergne
Why should alternative implementations be bound to the
internal details of the docutils implementation?
Why should we (the developers of Docutils) care about alternative
implementations?
OK, that's a bit callous. Reworded: why should we care so much that
we'd go to great lengths to accommodate such hypothetical alternative
implementations? Answer: we shouldn't, and we don't.
I assumed that there was a separation of the reStructuredText standard and
the Docutils implementation - that's how the specification reads. In any
case, I'm simply reading and following the specification which to me looks
sufficient to produce reST formatted content.
It should be. The spec was mostly written before the code. Docutils
was DDD (documentation-driven development) as well as TDD.

My point is that the spec is not, and was never intended to be, a
standard on the level of ISO or W3C or IEC.
Post by Edward d'Auvergne
Post by David Goodger
Post by Edward d'Auvergne
Post by Guenter Milde
Post by Edward d'Auvergne
For the docutils implementation,
implementing a basic module, possibly auto-generated by a script, to
eliminate problems in the Unicode standard implementation in the local
Python installation so that docutils performs idenically on all
systems would make a lot of sense.
This is already the case: Docutils uses (auto-generated) strings with
characters of the several classes in punctuation_chars.py. These do not
depend on the Python installation (but are actually from Python 2.x and
hence match Unicode 5.2.1.).
I have deliberately not looked at the code, even though I have the
repository checked out.
Post by Guenter Milde
My question b/c was a Docutils-internal one: there is a style-guide saying
that literal Unicode should be avoided (except for language files) dating
from times when Unicode support in editors, fonts, ... was patchy.
As the proposal included an option to mandate an out-of-date Unicode
specification number within the reStructuredText specification based
on the current status of the docutils code, this affects the external
aspects in a significant way. My bug report for the failure of
``name``¶ - which is used extensively in the Python HTML docs (see the
permalinks when hovering over text at
https://docs.python.org/3/library/codecs.html for example)
I believe the pilcrows (¶) are added by Sphinx to the output HTML.
They are not present in the reST sources.
If you want "``name``¶" in your source reST, just insert an escaped
space: ``name``\ ¶.
I know I can do this, but the specification says that a character from the
Unicode PO class can directly follow the inline markup end string (i.e.
without the escaping mechanism). That's why I reported it.
Fair enough. And that's true, for whatever version of Unicode we're
supporting. It would certainly be useful to document the version, even
in the code (and maybe even have a command-line option that reports
it).

The problem here is that you were expecting ¶ to be PO class based on
the latest Unicode, but Docutils was using definitions from an older
Unicode where ¶ isn't PO class. <shrug>
Post by Edward d'Auvergne
Post by David Goodger
Post by Edward d'Auvergne
- is solely
for docutils, as the reST specification does not mandate a Unicode
version and hence the latest version would be a reasonable assumption.
The reST specification is a best effort to document the intentions of
the Docutils developers, and Docutils is a best effort to implement
the reST specification. They are interdependent. From
"The reStructuredText parser is a component of Docutils_."
I agree that it would be ideal to base our implementation on the
latest Unicode spec. However, we must temper that with practicality.
docutils/docutils/utils/punctuation_chars.py could certainly be
updated with the latest Unicode spec as implemented in a current
version of Python. We use Python's internal implementation to handle
many aspects of Unicode processing. In order to use the latest Unicode
standard in all contexts (e.g. in the running code), we'd have to
reimplement a lot more of what Python provides, and that's just not
worth the effort.
Well, an external tool could create reST formatted text based on your
specification, and if rstcheck is unhappy about that output a bug report
could be filed for improving either the specification or Docutils ;)
Ideally true.
Post by Edward d'Auvergne
I'm
guessing the zero-width space (u"\u200B") will do some interesting things.
It ought to work!

David Goodger
<http://python.net/~goodger>

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Docutils-develop mailing list
Docutils-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/docutils-develop

P
Edward d'Auvergne
2017-01-03 09:21:31 UTC
Permalink
Post by David Goodger
Post by Edward d'Auvergne
Post by David Goodger
Post by Edward d'Auvergne
My perspective is rather different. I'm looking at the
reStructuredText spec from the hypothetical viewpoint of someone
wanting to create a clean-room implementation. To make the
reStructuredText specification dependent on what code is in docutils
at any point in time defeats its purpose of being an independent
specification.
Neither Docutils nor reStructuredText have such lofty goals. It is not
our purpose to create an independent specification for clean-room
implementation. We have no aspirations to a "reStructuredText
standard" à la ISO.
Your perspective, while valid for you, is not necessarily shared by
the developers of Docutils.
On the other hand, if you do choose to develop a clean-room
implementation of the reSructuredText parser, you are free to
interpret the specification as you like, and to fill in any gaps as
you see fit.
The clean-room implementation was a hypothetical - that's not what I'm
working on. But I have been using the specification in isolation to
generate reStructuredText formatted documents independently of Docutils, and
from time to time using Docutils to validate the output. That way I
stumbled on the ``name``¶ issue - I generated that construct automatically
and Docutils was not happy. Another issue appears to be definition list
terms wrapped across multiple lines - the standard does not say anything
about this but Docutils does not like that.
"A term is a simple one-line word or phrase."
— http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html#definition-lists
Sorry, I missed the "one-line" part.
Post by David Goodger
Post by Edward d'Auvergne
Post by David Goodger
Post by Edward d'Auvergne
Post by Guenter Milde
My question b/c was a Docutils-internal one: there is a style-guide saying
that literal Unicode should be avoided (except for language files) dating
from times when Unicode support in editors, fonts, ... was patchy.
As the proposal included an option to mandate an out-of-date Unicode
specification number within the reStructuredText specification based
on the current status of the docutils code, this affects the external
aspects in a significant way. My bug report for the failure of
``name``¶ - which is used extensively in the Python HTML docs (see the
permalinks when hovering over text at
https://docs.python.org/3/library/codecs.html for example)
I believe the pilcrows (¶) are added by Sphinx to the output HTML.
They are not present in the reST sources.
If you want "``name``¶" in your source reST, just insert an escaped
space: ``name``\ ¶.
I know I can do this, but the specification says that a character from the
Unicode PO class can directly follow the inline markup end string (i.e.
without the escaping mechanism). That's why I reported it.
Fair enough. And that's true, for whatever version of Unicode we're
supporting. It would certainly be useful to document the version, even
in the code (and maybe even have a command-line option that reports
it).
The problem here is that you were expecting ¶ to be PO class based on
the latest Unicode, but Docutils was using definitions from an older
Unicode where ¶ isn't PO class. <shrug>
Exactly why I created a low priority bug report for it!
Post by David Goodger
Post by Edward d'Auvergne
I'm
guessing the zero-width space (u"\u200B") will do some interesting things.
It ought to work!
Not always:

$ python3 -c "print('==\nXY\n==')" | rst2xml --strict > /dev/null
$ python3 -c "print('==\nX\u200bY\n==')"
==
XY
==
$ python3 -c "print('==\nX\u200bY\n==')" | rst2xml --strict > /dev/null
<stdin>:1: (INFO/1) Possible incomplete section title.
Treating the overline as ordinary text because it's so short.
Exiting due to level-1 (INFO) system message.
$ python3 -c "print('+----+\n| XY |\n+----+\n')" | rst2xml --strict > /dev/null
$ python3 -c "print('+----+\n| X\u200bY |\n+----+\n')"
+----+
| XY |
+----+

$ python3 -c "print('+----+\n| X\u200bY |\n+----+\n')" | rst2xml
--strict > /dev/null
<stdin>:1: (ERROR/3) Malformed table.

+----+
| XY |
+----+
Exiting due to level-3 (ERROR) system message.
$

By not looking at the docutils sources and creating a reST writer
based solely on the specification means that I see a number of such
issues. The zero-width space appears a lot as it looks like it is
used quite liberally by Mediawiki.

Regards,

Edward

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Docutils-develop mailing list
Docutils-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/docutils-develop

Please use "
David Goodger
2017-01-04 04:47:55 UTC
Permalink
Post by Edward d'Auvergne
Post by David Goodger
Post by Edward d'Auvergne
I'm
guessing the zero-width space (u"\u200B") will do some interesting things.
It ought to work!
... examples of 2D parsing elided ...

I meant that it should work in cases like the one under discussion:
before the "¶" in "``name``¶", in place of an escaped space ("\ ").

reST's 2D parsing (tables, titles, etc.) depends on uniform-width
characters. So yes, zero-width characters do interesting things there.
Post by Edward d'Auvergne
By not looking at the docutils sources and creating a reST writer
based solely on the specification means that I see a number of such
issues.
It's easy to construct pathological cases that defeat the current reST
parser. Whether or not they are worth the development time to support
is a different issue entirely.
Post by Edward d'Auvergne
The zero-width space appears a lot as it looks like it is
used quite liberally by Mediawiki.
Is this just an interesting intellectual exercise, or should Docutils
change because of this? If so, how?

David Goodger
<http://python.net/~goodger>

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Docutils-develop mailing list
Docutils-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/docutils-develop

Please use "Repl
Guenter Milde
2017-01-05 10:45:57 UTC
Permalink
Dear Edward and David,
Post by David Goodger
Post by Edward d'Auvergne
I'm guessing the zero-width space (u"\u200B") will do some
interesting things.
...
Post by David Goodger
before the "¶" in "``name``¶", in place of an escaped space ("\ ").
reST's 2D parsing (tables, titles, etc.) depends on uniform-width
characters. So yes, zero-width characters do interesting things there.
This is similar to the problems with east-asian double-width charcters
and combining characters.

There are fixes for double-width characters "since ages".

Combining characters are fixed for headings and simple tables in Release
0.9 (2012-05-02). Fixing grid tables is an open task. See
https://sourceforge.net/p/docutils/bugs/128/ (former [ 2926161 ]).
Post by David Goodger
Post by Edward d'Auvergne
By not looking at the docutils sources and creating a reST writer
based solely on the specification means that I see a number of such
issues.
It's easy to construct pathological cases that defeat the current reST
parser. Whether or not they are worth the development time to support
is a different issue entirely.
Sometimes the solution is updating/refining/clarifying the specs,
sometimes just giving a pointer to the reference implementation.

In any case, not looking at the docutils sources is a poor basis for
discussion about implementation of fixes.
Post by David Goodger
Post by Edward d'Auvergne
The zero-width space appears a lot as it looks like it is
used quite liberally by Mediawiki.
In input or (like the ¶ in Sphinx only in output)?
Post by David Goodger
Is this just an interesting intellectual exercise, or should Docutils
change because of this? If so, how?
I could imagine to

* replace the current handling of combining characters wiht a version
counting for all zero-width characters.

* clarify in the specs, that "line lenght" or similar in definitions like

An underline/overline is a single repeated punctuation character that
begins in column 1 and forms a line extending at least as far as the
right edge of the title text.

are valid for monotype characters of unit width wiht some listed
exceptions.

Günter




------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Docutils-develop mailing list
Docutils-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/docutils-develop

Please use "Reply All" to reply to
Edward d'Auvergne
2017-01-09 13:15:55 UTC
Permalink
Post by Guenter Milde
Dear Edward and David,
Post by David Goodger
Post by Edward d'Auvergne
I'm guessing the zero-width space (u"\u200B") will do some
interesting things.
...
Post by David Goodger
before the "¶" in "``name``¶", in place of an escaped space ("\ ").
reST's 2D parsing (tables, titles, etc.) depends on uniform-width
characters. So yes, zero-width characters do interesting things there.
This is similar to the problems with east-asian double-width charcters
and combining characters.
There are fixes for double-width characters "since ages".
Combining characters are fixed for headings and simple tables in Release
0.9 (2012-05-02). Fixing grid tables is an open task. See
https://sourceforge.net/p/docutils/bugs/128/ (former [ 2926161 ]).
Post by David Goodger
Post by Edward d'Auvergne
By not looking at the docutils sources and creating a reST writer
based solely on the specification means that I see a number of such
issues.
It's easy to construct pathological cases that defeat the current reST
parser. Whether or not they are worth the development time to support
is a different issue entirely.
Sometimes the solution is updating/refining/clarifying the specs,
sometimes just giving a pointer to the reference implementation.
In any case, not looking at the docutils sources is a poor basis for
discussion about implementation of fixes.
Post by David Goodger
Post by Edward d'Auvergne
The zero-width space appears a lot as it looks like it is
used quite liberally by Mediawiki.
In input or (like the ¶ in Sphinx only in output)?
Post by David Goodger
Is this just an interesting intellectual exercise, or should Docutils
change because of this? If so, how?
I could imagine to
* replace the current handling of combining characters wiht a version
counting for all zero-width characters.
* clarify in the specs, that "line lenght" or similar in definitions like
An underline/overline is a single repeated punctuation character that
begins in column 1 and forms a line extending at least as far as the
right edge of the title text.
are valid for monotype characters of unit width wiht some listed
exceptions.
Hi Günter,

I was wondering if you have heard about the wcwidth() and wcswidth()
implementations [1, 2]? If this fast bisect algorithm is of interest,
the Python wcwidth package might need to be downgraded to the 10+ year
old 5.x Unicode standard used in Python 2. Where is the width
algorithm implemented in docutils? I see that double widths are
handled correctly:

$ python3 -c "print('====\n一二\n====')" | rst2xml --strict > /dev/null
$

Is this from the separate_wide_chars() function?

Regards,

Edward


[1] https://pypi.python.org/pypi/wcwidth
[2] https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Docutils-develop mailing list
Docutils-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/docutils-develop

Please use "Reply All" to reply
Guenter Milde
2017-01-10 09:59:17 UTC
Permalink
...
Post by Edward d'Auvergne
* replace the current handling of combining characters with a version
counting for all zero-width characters.
* clarify in the specs, that "line length" or similar in definitions like
An underline/overline is a single repeated punctuation character that
begins in column 1 and forms a line extending at least as far as the
right edge of the title text.
are valid for monospace characters of unit width with some listed
exceptions.
I was wondering if you have heard about the wcwidth() and wcswidth()
implementations [1, 2]?
Thank you for the pointer.
Post by Edward d'Auvergne
If this fast bisect algorithm is of interest,
the Python wcwidth package might need to be downgraded to the 10+ year
old 5.x Unicode standard used in Python 2.
There are several issues when using the wcwidth module:

+1 don't reinvent the wheel:
maintained implementation of a column-width determination function

+1 stability: character tables are part of the module, do not depend on
Python version.

The current implementation of wide-char correction depends on
unicodedata from the installed Python version.

-2 external dependency

-1 updating this module may break rST documents


In addition, also the external module cannot solve the ambiguity:

Example::

from wcwidth import wcswidth
text = u'wait ⌚ or ⌛'
print text
print 'x'*len(text)
print 'x'*wcswidth(text)


For wcswidth, WATCH and HOURGLASS are 2 columns wide.
In my text editor, WATCH and HOURGLASS are single-width characters (which
also makes most sense to me).
On some terminals, both characters are followed by space to make them double
width. In `geany`, the text panel uses single width and the terminal panel
double width.

The problem is generic:

No established formal standards exist at present on which Unicode
character shall occupy how many cell positions on character terminals.
-- Markus Kuhn http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c

IMO, Docutils should account for the display in "common" text editors using
monospaced fonts. Speed is no primary issue.
Maybe using a local implementation is best.


The documentation must make clear the remaining ambiguity and point to
fail-safe text source:

* additional underline characters in section headings and simple tables

* avoid "critical" characters in grid tables (use substitutions if required).
Post by Edward d'Auvergne
Where is the width
algorithm implemented in docutils?
docutils/docutils/statemachine.py:1450: def pad_double_width(self, pad_char):

Uses `unicodedata.east_asian_width`.

@David:

How about using a wcswidth()-like implementation instead of len() when
determining text length for section headings and tables instead of the
padding with `double_width_pad_char`?

+1 works also for zero-width characters and combining characters
(solves https://sourceforge.net/p/docutils/bugs/128/)

-1 API change


Günter
David Goodger
2017-01-16 22:07:26 UTC
Permalink
Post by Guenter Milde
...
Post by Edward d'Auvergne
* replace the current handling of combining characters with a version
counting for all zero-width characters.
* clarify in the specs, that "line length" or similar in definitions like
An underline/overline is a single repeated punctuation character that
begins in column 1 and forms a line extending at least as far as the
right edge of the title text.
are valid for monospace characters of unit width with some listed
exceptions.
I was wondering if you have heard about the wcwidth() and wcswidth()
implementations [1, 2]?
Thank you for the pointer.
Post by Edward d'Auvergne
If this fast bisect algorithm is of interest,
the Python wcwidth package might need to be downgraded to the 10+ year
old 5.x Unicode standard used in Python 2.
maintained implementation of a column-width determination function
+1 stability: character tables are part of the module, do not depend on
Python version.
The current implementation of wide-char correction depends on
unicodedata from the installed Python version.
-2 external dependency
-1 updating this module may break rST documents
from wcwidth import wcswidth
text = u'wait ⌚ or ⌛'
print text
print 'x'*len(text)
print 'x'*wcswidth(text)
For wcswidth, WATCH and HOURGLASS are 2 columns wide.
In my text editor, WATCH and HOURGLASS are single-width characters (which
also makes most sense to me).
On some terminals, both characters are followed by space to make them double
width. In `geany`, the text panel uses single width and the terminal panel
double width.
No established formal standards exist at present on which Unicode
character shall occupy how many cell positions on character terminals.
-- Markus Kuhn http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
IMO, Docutils should account for the display in "common" text editors using
monospaced fonts. Speed is no primary issue.
Maybe using a local implementation is best.
The documentation must make clear the remaining ambiguity and point to
* additional underline characters in section headings and simple tables
* avoid "critical" characters in grid tables (use substitutions if required).
Post by Edward d'Auvergne
Where is the width
algorithm implemented in docutils?
Uses `unicodedata.east_asian_width`.
How about using a wcswidth()-like implementation instead of len() when
determining text length for section headings and tables instead of the
padding with `double_width_pad_char`?
Sure, sounds fine to me.
Post by Guenter Milde
+1 works also for zero-width characters and combining characters
(solves https://sourceforge.net/p/docutils/bugs/128/)
-1 API change
What exactly would the API change be?

David Goodger
<http://python.net/~goodger>
Guenter Milde
2017-01-18 14:34:29 UTC
Permalink
...
Post by David Goodger
Post by Guenter Milde
Post by Edward d'Auvergne
I was wondering if you have heard about the wcwidth() and wcswidth()
implementations [1, 2]?
...
Post by David Goodger
Post by Guenter Milde
How about using a wcswidth()-like implementation instead of len() when
determining text length for section headings and tables instead of the
padding with `double_width_pad_char`?
Sure, sounds fine to me.
Actually, we already have this for section headings and simple tables:

utils/__init__.py:

601: def strip_combining_chars(text):
606: def find_combining_chars(text):
616: def column_indices(text):

Handle combining characters

639: def column_width(text):

Correct ``len(text)`` for wide East Asian and combining Unicode chars.

Should be easy to augment it for zero-width characters.

However, I did not manage to find out where this should be used in the grid
table parser to prevent errors with combining characters.
Post by David Goodger
Post by Guenter Milde
+1 works also for zero-width characters and combining characters
(solves https://sourceforge.net/p/docutils/bugs/128/)
-1 API change
What exactly would the API change be?
This depends on the actual implementation:

* When obsoleting the use of pad_double_width() by using
utils.colum_width() instead of len(), it would make sense to remove the
pad_double_width() function.

* When accounting for wide characters in utils.column_indices(), the result
may differ from present computations.

* When not padding table source lines in states.isolate_grid_table() and
states.isolate_simple_table(), the `block` returned by these functions
is different from the current implementation.
This may affect alternative table parsers.

* When considering additional zero-width characters (ZWSP, WJ, ZWNBSP
...), table parsing and section heading recognition may change.


Günter
Guenter Milde
2017-01-19 21:28:26 UTC
Permalink
Post by Guenter Milde
However, I did not manage to find out where this should be used in the grid
table parser to prevent errors with combining characters.
I got one step further, with the appended patch, combining characters can be
used in some simple grid tables. However, tables with combining characters
in a multi-line cell produce strange results for unknown reasons
(probabely some offset when extracting cell content).

Can anybody help to spot the reason or give advise on how to get more
diagnostic output?


Günter


Index: states.py
===================================================================
--- states.py (Revision 8017)
+++ states.py (Arbeitskopie)
@@ -1708,7 +1708,7 @@
messages.extend(self.malformed_table(block))
return [], messages, blank_finish
for i in range(len(block)): # check right edge
- if len(block[i]) != width or block[i][-1] not in '+|':
+ if len(utils.strip_combining_chars(block[i])) != width or block[i][-1] not in '+|':
messages.extend(self.malformed_table(block))
return [], messages, blank_finish
return block, messages, blank_finish
Index: tableparser.py
===================================================================
--- tableparser.py (Revision 8017)
+++ tableparser.py (Arbeitskopie)
@@ -167,7 +167,11 @@
We'll end up knowing all the row and column boundaries, cell positions
and their dimensions.
"""
+ # a copy of the block with zero-width characters taken out:
+ self.grid_block = [strip_combining_chars(line) for line in self.block]
+
corners = [(0, 0)]
+
while corners:
top, left = corners.pop(0)
if top == self.bottom or left == self.right \
@@ -208,7 +212,7 @@

def scan_cell(self, top, left):
"""Starting at the top-left corner, start tracing out a cell."""
- assert self.block[top][left] == '+'
+ assert self.grid_block[top][left] == '+'
result = self.scan_right(top, left)
return result

@@ -218,7 +222,7 @@
boundaries ('+').
"""
colseps = {}
- line = self.block[top]
+ line = self.grid_block[top]
for i in range(left + 1, self.right + 1):
if line[i] == '+':
colseps[i] = [top]
@@ -238,7 +242,7 @@
"""
rowseps = {}
for i in range(top + 1, self.bottom + 1):
- if self.block[i][right] == '+':
+ if self.grid_block[i][right] == '+':
rowseps[i] = [right]
result = self.scan_left(top, left, i, right)
if result:
@@ -245,7 +249,7 @@
newrowseps, colseps = result
update_dict_of_lists(rowseps, newrowseps)
return i, rowseps, colseps
- elif self.block[i][right] != '|':
+ elif self.grid_block[i][right] != '|':
return None
return None

@@ -255,7 +259,7 @@
It must line up with the starting point.
"""
colseps = {}
- line = self.block[bottom]
+ line = self.grid_block[bottom]
for i in range(right - 1, left, -1):
if line[i] == '+':
colseps[i] = [bottom]
@@ -275,9 +279,9 @@
"""
rowseps = {}
for i in range(bottom - 1, top, -1):
- if self.block[i][left] == '+':
+ if self.grid_block[i][left] == '+':
rowseps[i] = [left]
- elif self.block[i][left] != '|':
+ elif self.grid_block[i][left] != '|':
return None
return rowseps


Test file:

Tables with zero-width and wide Unicode characters
--------------------------------------------------

Some Unicode characters have different width also in monospaced fonts:

width 0: combining characters, zero width space, zero width word joiner

width 2: east asian wide characters

Tables that work
~~~~~~~~~~~~~~~~

================================ =========
Cell with combining char á acute next cell
================================ =========

+----------------------------------+
| Cell with combining char á acute |
+----------------------------------+

+--------------------------------+---+
|Cell with combining char á acute|á |
+--------------------------------+---+

+--------------------------+--------+
| Cell with combining char | column |
| á acute in second line | two á |
+--------------------------+--------+

+----------------------------------+
| first row |
+----------------------------------+
| Cell with combining char á acute |
+----------------------------------+

+----------------------------------+
| Cell with combining char á acute |
+----------------------------------+
| second row |
+----------------------------------+

+----------------------------------+----------------------------------+
| Header with combining á acute | next header without combi-char |
+==================================+==================================+
| second row | second row second column |
+----------------------------------+----------------------------------+


+----------------------------------+
| Header |
+==================================+
| Cell with combining char á acute |
+----------------------------------+

+------------------------+-----------+-----------+-----------+
| Heáder row, column 1 | Heáder 2 | Heáder 3 | Header 4 |
+========================+===========+===========+===========+
| body row 1, column 1 | column 2 | column 3 | column 4 |
+------------------------+-----------+-----------+-----------+



Tables that fail
~~~~~~~~~~~~~~~~

Spurious character «|»:

+-----------------------+
| á acute |
| two lines |
+-----------------------+

Table cell set as definition list, also following cell (without combining
character):

+-----------------------+----------+
| á acute + empty line | test |
| | |
+-----------------------+----------+


A complex table with combining characters at different places:

+------------------------+------------+----------+----------+
| Header row, column 1 | Header 2 | Heáder 3 | Header 4 |
| (heáder rows optional) | | | |
+========================+============+==========+==========+
| body row 1, column 1 | column 2 | column 3 | column 4 |
+------------------------+------------+----------+----------+
| body row 2 | Cells may spán columns. |
+------------------------+------------+---------------------+
| body row 3 | Cells may | - Table cells |
+------------------------+ span rows. | - contain |
| body ròw 4 | | - body elements. |
| | Parágraph. | |
+------------------------+------------+----------+----------+
| body row 5 | Cells máy also be | |
| | empty: ``-->`` | |
+------------------------+-----------------------+----------+






------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Docutils-develop mailing list
Docutils-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/docutils-develop

Please use "Reply All" to reply to the list.
David Goodger
2017-01-29 22:58:01 UTC
Permalink
Post by Guenter Milde
...
Post by David Goodger
Post by Guenter Milde
Post by Edward d'Auvergne
I was wondering if you have heard about the wcwidth() and wcswidth()
implementations [1, 2]?
...
Post by David Goodger
Post by Guenter Milde
How about using a wcswidth()-like implementation instead of len() when
determining text length for section headings and tables instead of the
padding with `double_width_pad_char`?
Sure, sounds fine to me.
Handle combining characters
Correct ``len(text)`` for wide East Asian and combining Unicode chars.
Should be easy to augment it for zero-width characters.
However, I did not manage to find out where this should be used in the grid
table parser to prevent errors with combining characters.
Post by David Goodger
Post by Guenter Milde
+1 works also for zero-width characters and combining characters
(solves https://sourceforge.net/p/docutils/bugs/128/)
-1 API change
What exactly would the API change be?
* When obsoleting the use of pad_double_width() by using
utils.colum_width() instead of len(), it would make sense to remove the
pad_double_width() function.
* When accounting for wide characters in utils.column_indices(), the result
may differ from present computations.
* When not padding table source lines in states.isolate_grid_table() and
states.isolate_simple_table(), the `block` returned by these functions
is different from the current implementation.
This may affect alternative table parsers.
* When considering additional zero-width characters (ZWSP, WJ, ZWNBSP
...), table parsing and section heading recognition may change.
Seems like more trouble than it's worth. I'm not too worried by this.
It may only warrant a mention in todo.txt for now (with reference back
to this thread), and we can reexamine it if it ever comes up in real
life examples.

David Goodger
<http://python.net/~goodger>

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Docutils-develop mailing list
Docutils-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/docutils-develop

Please use "Reply All" to reply to the list.
Edward d'Auvergne
2017-02-05 12:13:10 UTC
Permalink
Post by David Goodger
Post by Guenter Milde
...
Post by David Goodger
Post by Guenter Milde
Post by Edward d'Auvergne
I was wondering if you have heard about the wcwidth() and wcswidth()
implementations [1, 2]?
...
Post by David Goodger
Post by Guenter Milde
How about using a wcswidth()-like implementation instead of len() when
determining text length for section headings and tables instead of the
padding with `double_width_pad_char`?
Sure, sounds fine to me.
Handle combining characters
Correct ``len(text)`` for wide East Asian and combining Unicode chars.
Should be easy to augment it for zero-width characters.
However, I did not manage to find out where this should be used in the grid
table parser to prevent errors with combining characters.
Post by David Goodger
Post by Guenter Milde
+1 works also for zero-width characters and combining characters
(solves https://sourceforge.net/p/docutils/bugs/128/)
-1 API change
What exactly would the API change be?
* When obsoleting the use of pad_double_width() by using
utils.colum_width() instead of len(), it would make sense to remove the
pad_double_width() function.
* When accounting for wide characters in utils.column_indices(), the result
may differ from present computations.
* When not padding table source lines in states.isolate_grid_table() and
states.isolate_simple_table(), the `block` returned by these functions
is different from the current implementation.
This may affect alternative table parsers.
* When considering additional zero-width characters (ZWSP, WJ, ZWNBSP
...), table parsing and section heading recognition may change.
Seems like more trouble than it's worth. I'm not too worried by this.
It may only warrant a mention in todo.txt for now (with reference back
to this thread), and we can reexamine it if it ever comes up in real
life examples.
It depends on your perspective - if you are interested in any asian
language, for example, then the worth could seem to have a greater weight
than the trouble. For example in Chinese, Japanese, or Korean where there
are no spaces between words (unless some English terms are dropped in),
these zero width space characters can be quite common for controlling the
text flow (line breaking, etc.). You can use the WJ character to stop your
name from been split across lines due to line wrapping in the desired
output format.

For the wcwidth Python module, note that some characters are currently not
correctly identified (https://github.com/jquast/wcwidth/issues/22).

Regards,

Edward
Guenter Milde
2017-02-05 18:41:38 UTC
Permalink
...
Post by Edward d'Auvergne
Post by David Goodger
Seems like more trouble than it's worth. I'm not too worried by this.
It may only warrant a mention in todo.txt for now (with reference back
to this thread), and we can reexamine it if it ever comes up in real
life examples.
It depends on your perspective - if you are interested in any asian
language, for example, then the worth could seem to have a greater weight
than the trouble.
You are right - however, there is already a partial implementation of
wcwidth-like behviour. The current situation (Doctutils 0.13 but actually
since some time) is:

OK:

* asian wide characters count 2 in tables and section headings

* combining characters count 0 in simple tables and section headings

TODO:

* combining characters count 1 in grid tables (hard to correct)

* zero width characters count 1 in tables and section headings
(easy to correct in simple tables and section headings).
Post by Edward d'Auvergne
For the wcwidth Python module, note that some characters are currently not
correctly identified (https://github.com/jquast/wcwidth/issues/22).
Given the problems of "what characters are wide", the not completely perfect
state seems acceptable for now.

Feel free to open a bug report for ZWSP and the like or add some weight
to https://sourceforge.net/p/docutils/bugs/128/. Then, it might be fixed
at least for headings and simple tables in near future...


Günter


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Docutils-develop mailing list
Docutils-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/docutils-develop

Please use "Reply All" to reply to the l

Guenter Milde
2017-01-03 11:38:23 UTC
Permalink
Post by David Goodger
Post by Guenter Milde
Post by Günter Milde
The character class of the pilcrow sign # changed,
some other characters changed, too.
...
Post by David Goodger
For the implementation, we should definitely state a specific Unicode
version.
Agreed.
Post by David Goodger
docutils/docutils/utils/punctuation_chars.py could be updated
using the very latest Python (e.g. 3.6.0 as of this writing) with
whatever version of Unicode it supports.
Agreed.
Note: this would be an API change with the potential to make existing rST
documents invalid.
Post by David Goodger
For the spec, don't mention any specific version of Unicode. Just
state that character-specific behavior (punctuation, whitespace,
openers/closers) is dependent on the version of Unicode in the
underlying platform (the Python interpreter running Docutils).
Notes:

1. This would make the validity of rST documents dependent on the
Unicode version used in the underlying platform.

Examples:

a) ``*test*¶``

invalid with Unicode 5.2.0 (e.g. Python 2.7.12)
valid with Unicode 8.0.0 (e.g., Python 3.5)

b) 25*¶

valid with Unicode 5.2.0 (e.g. Python 2.7.12)
invalid with Unicode 8.0.0 (e.g., Python 3.5)

2. Currently, the character-specific behavior does not depend on the Python
interpreter, the relevant pattern strings are pre-generated and stored in
utils/punctuation_chars.py for efficiency and stability.

For consistency, we would need to

a) create the patterns in every Docutils invocation with the help of
the unicodedata module, or

b) reword the proposed spec amendment to
Post by David Goodger
state that character-specific behavior (punctuation, whitespace,
openers/closers) is dependent on the version of Unicode
in the interpreter (e.g. Docutils).

Günter


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Docutils-develop mailing list
Docutils-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/docutils-develop

P
David Goodger
2017-01-04 04:55:10 UTC
Permalink
Post by Guenter Milde
Post by David Goodger
Post by Guenter Milde
Post by Günter Milde
The character class of the pilcrow sign # changed,
some other characters changed, too.
...
Post by David Goodger
For the implementation, we should definitely state a specific Unicode
version.
Agreed.
Post by David Goodger
docutils/docutils/utils/punctuation_chars.py could be updated
using the very latest Python (e.g. 3.6.0 as of this writing) with
whatever version of Unicode it supports.
Agreed.
Note: this would be an API change with the potential to make existing rST
documents invalid.
Post by David Goodger
For the spec, don't mention any specific version of Unicode. Just
state that character-specific behavior (punctuation, whitespace,
openers/closers) is dependent on the version of Unicode in the
underlying platform (the Python interpreter running Docutils).
1. This would make the validity of rST documents dependent on the
Unicode version used in the underlying platform.
a) ``*test*¶``
invalid with Unicode 5.2.0 (e.g. Python 2.7.12)
valid with Unicode 8.0.0 (e.g., Python 3.5)
b) 25*¶
valid with Unicode 5.2.0 (e.g. Python 2.7.12)
invalid with Unicode 8.0.0 (e.g., Python 3.5)
Which is one reason why the utils/punctuation_chars.py approach is
better, right? So let's just keep that approach, document the Unicode
version used to generate that code, and we won't have these issues.
Post by Guenter Milde
2. Currently, the character-specific behavior does not depend on the Python
interpreter, the relevant pattern strings are pre-generated and stored in
utils/punctuation_chars.py for efficiency and stability.
Which is good. Let's keep it that way.
Post by Guenter Milde
For consistency, we would need to
I don't understand what you're getting at here. "For consistency" with what?
Post by Guenter Milde
a) create the patterns in every Docutils invocation with the help of
the unicodedata module, or
b) reword the proposed spec amendment to
Post by David Goodger
state that character-specific behavior (punctuation, whitespace,
openers/closers) is dependent on the version of Unicode
in the interpreter (e.g. Docutils).
Or:

c) State in the spec that character-specific behavior (punctuation,
whitespace, openers/closers) is dependent on the version of Unicode
used to generate utils/punctuation_chars.py, and that version was:
A.B.C (as implemented in Python X.Y.Z).

David Goodger
<http://python.net/~goodger>

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Docutils-develop mailing list
Docutils-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/docutils-develop

Please use
Guenter Milde
2017-01-04 20:44:12 UTC
Permalink
Dear David,

thank you for the fast reply.
...
Post by David Goodger
Post by Guenter Milde
Post by David Goodger
For the spec, don't mention any specific version of Unicode. Just
state that character-specific behavior (punctuation, whitespace,
openers/closers) is dependent on the version of Unicode in the
underlying platform (the Python interpreter running Docutils).
...
Post by David Goodger
Post by Guenter Milde
2. Currently, the character-specific behavior does not depend on the
Python interpreter, the relevant pattern strings are pre-generated
and stored in utils/punctuation_chars.py for efficiency and
stability.
Which is good. Let's keep it that way.
Post by Guenter Milde
For consistency, we would need to
I don't understand what you're getting at here. "For consistency" with what?
Your proposal for the spec
Post by David Goodger
Post by Guenter Milde
Post by David Goodger
... dependent on the version of Unicode in the underlying platform (the
Python interpreter running Docutils)" ...
contradicts the current implementation.

I am glad to hear you agree to keep the implementation via pre-generated
strings and suggest to adapt the spec to reflect this decision:

+.. [#uni-version] The category of some characters changed with the
+ development of the Unicode standard.
+ Docutils 0.13 uses `Unicode version 5.2.0`_.

The patch below contains also the update for the script generating the
strings so that this also works under Py3k. OK to commit?

Günter







Index: docutils/utils/punctuation_chars.py
===================================================================
--- docutils/utils/punctuation_chars.py (Revision 8010)
+++ docutils/utils/punctuation_chars.py (Arbeitskopie)
@@ -11,6 +11,8 @@
# .. _2-Clause BSD license: http://www.spdx.org/licenses/BSD-2-Clause

# :Id: $Id$
+#
+# ::

import sys, re
import unicodedata
@@ -20,16 +22,31 @@
#
# This module provides the lists of characters for the implementation of
# the `inline markup recognition rules`_ in the reStructuredText parser
-# (states.py)
+# `<../parsers/rst/states.py>`__.
+# For efficiency and stability, the lists are pre-generated and stored in
+# module-level variables.
#
+# The category of some characters changed with the development of the Unicode
+# standard. The current lists are generated with the help of the "unicodedata"
+# module of Python 2.7 (based on the UnicodeData.txt file version 5.2.0).
+#
+# Running this file as a standalone module (``__main__``) checks the
+# definitions against a re-calculation. Updating the lists with a new
+# Unicode standard version is an API change (may render valid rST documents
+# invalid). It should only be done for "feature releases" and requires
+# also updating the specification of `inline markup recognition rules`_ in
+# ../../docs/ref/rst/restructuredtext.txt.
+#
# .. _inline markup recognition rules:
# ../../docs/ref/rst/restructuredtext.html#inline-markup
-
+#
+#
# Docutils punctuation category sample strings
# --------------------------------------------
#
# The sample strings are generated by punctuation_samples() and put here
-# literal to avoid the time-consuming generation with every Docutils run.
+# literal to avoid the time-consuming generation with every Docutils run
+# and prevent dependance on the Python version.
# As the samples are used inside ``[ ]`` in regular expressions, hyphen and
# square brackets are escaped. ::

@@ -83,12 +100,14 @@

# Matching open/close quotes
# --------------------------
-
+#
# Rule (5) requires determination of matching open/close pairs. However,
# the pairing of open/close quotes is ambigue due to different typographic
-# conventions in different languages.
+# conventions in different languages. The following dictionary specifies
+# additional valid matches::

-quote_pairs = {u'\xbb': u'\xbb', # Swedish
+quote_pairs = {# open char: matching closing characters
+ u'\xbb': u'\xbb', # Swedish
u'\u2018': u'\u201a', # Greek
u'\u2019': u'\u2019', # Swedish
u'\u201a': u'\u2018\u2019', # German, Polish
@@ -99,11 +118,12 @@
}

def match_chars(c1, c2):
+ """Test whether `c1` and `c2` are a matching open/close character pair."""
try:
i = openers.index(c1)
except ValueError: # c1 not in openers
return False
- return c2 == closers[i] or c2 in quote_pairs.get(c1, '')
+ return c2 == closers[i] or c2 in quote_pairs.get(c1, u'')


# Running this file as a standalone module checks the definitions against a
@@ -114,6 +134,10 @@

# Unicode punctuation character categories
# ----------------------------------------
+#
+# For details about Unicode categories, see
+# http://www.unicode.org/Public/5.1.0/ucd/UCD.html#General_Category_Values
+# ::

unicode_punctuation_categories = {
# 'Pc': 'Connector', # not used in Docutils inline markup recognition
@@ -129,6 +153,8 @@

# generate character pattern strings
# ==================================
+#
+# ::

def unicode_charlists(categories, cp_min=0, cp_max=None):
"""Return dictionary of Unicode character lists.
@@ -157,6 +183,8 @@

# Character categories in Docutils
# --------------------------------
+#
+# ::

def punctuation_samples():

@@ -203,18 +231,18 @@

# allowed before markup if there is a matching closer
openers = [u'"\'(<\\[{']
- for cat in ('Ps', 'Pi', 'Pf'):
- openers.extend(ucharlists[cat])
+ for category in ('Ps', 'Pi', 'Pf'):
+ openers.extend(ucharlists[category])

# allowed after markup if there is a matching opener
closers = [u'"\')>\\]}']
- for cat in ('Pe', 'Pf', 'Pi'):
- closers.extend(ucharlists[cat])
+ for category in ('Pe', 'Pf', 'Pi'):
+ closers.extend(ucharlists[category])

# non-matching, allowed on both sides
delimiters = [u'\\-/:']
- for cat in ('Pd', 'Po'):
- delimiters.extend(ucharlists[cat])
+ for category in ('Pd', 'Po'):
+ delimiters.extend(ucharlists[category])

# non-matching, after markup
closing_delimiters = [r'\\.,;!?']
@@ -260,13 +288,13 @@

return ''.join(l2)

- def wrap_string(s, startstring= "(",
- endstring = ")", wrap=65):
+ def wrap_string(s, startstring= "(u'",
+ endstring = "')", wrap=65):
"""Line-wrap a unicode string literal definition."""
c = len(startstring)
- contstring = "'\n" + ' ' * len(startstring) + "u'"
+ contstring = "'\n" + ' ' * (len(startstring)-2) + "u'"
l = [startstring]
- for ch in s:
+ for ch in s.replace("'", r"\'"):
c += 1
if ch == '\\' and c > wrap:
c = len(startstring)
@@ -276,10 +304,25 @@
return ''.join(l)


+ def print_differences(old, new, name):
+ """List characters missing in old/new."""
+ if old != new:
+ print('new %s:' % name)
+ for c in new:
+ if c not in old:
+ print ' %04x'%ord(c), unicodedata.name(c)
+ print('removed %s:' % name)
+ for c in old:
+ if c not in new:
+ print ' %04x'%ord(c), unicodedata.name(c)
+
+
# print results
# =============
-
+#
# (re) create and compare the samples:
+#
+# ::

(o, c, d, cd) = punctuation_samples()
o, o_wide = separate_wide_chars(o)
@@ -287,40 +330,44 @@
d, d_wide = separate_wide_chars(d)
d = d[:5] + mark_intervals(d[5:])
d_wide = mark_intervals(d_wide)
- if sys.maxunicode >= 0x10FFFF: # "wide" build
- d += d_wide
- if o != openers:
- print '- openers = ur"""%s"""' % openers.encode('utf8')
- print '+ openers = ur"""%s"""' % o.encode('utf8')
+
+ print_differences(openers, o, 'openers')
if o_wide:
print '+ openers-wide = ur"""%s"""' % o_wide.encode('utf8')
- if c != closers:
- print '- closers = ur"""%s"""' % closers.encode('utf8')
- print '+ closers = ur"""%s"""' % c.encode('utf8')
+ print_differences(closers, c, 'closers')
if c_wide:
print '+ closers-wide = ur"""%s"""' % c_wide.encode('utf8')
- if d != delimiters:
- print '- delimiters = ur"%s"' % delimiters.encode('utf8')
- print '+ delimiters = ur"%s"' % d.encode('utf8')
- if cd != closing_delimiters:
- print '- closing_delimiters = ur"%s"' % closing_delimiters.encode('utf8')
- print '+ closing_delimiters = ur"%s"' % cd.encode('utf8')
- # closing_delimiters are all ASCII characters

+ print_differences(delimiters, d + d_wide, 'delimiters')
+ print_differences(closing_delimiters, cd, 'closing_delimiters')
+
# Print literal code to define the character sets:
+#
+# ::

+ print '# based on Unicode version', unicodedata.unidata_version
+
# `openers` and `closers` must be verbose and keep order because they are
# also used in `match_chars()`.
- print wrap_string(repr(o), startstring='openers = (')
- print wrap_string(repr(c), startstring='closers = (')
+ print wrap_string(o.encode('unicode-escape').decode(),
+ startstring="openers = (u'")
+ print wrap_string(c.encode('unicode-escape').decode(),
+ startstring="closers = (u'")
# delimiters: sort and use shortcut for intervals (saves ~150 characters):
- print wrap_string(repr(d), startstring='delimiters = (')
+ print wrap_string(d.encode('unicode-escape').decode(),
+ startstring="delimiters = (u'")
# add characters in the upper plane only in a "wide" build:
print 'if sys.maxunicode >= 0x10FFFF: # "wide" build'
- print wrap_string(repr(d_wide), startstring=' delimiters += (')
- print 'closing_delimiters =', repr(cd)
+ print wrap_string(d_wide.encode('unicode-escape').decode(),
+ startstring=" delimiters += (u'")
+ # additional closing delimiters:
+ print wrap_string(cd.encode('unicode-escape').decode(),
+ startstring="closing_delimiters = (u'")
+

# test prints
+#
+# ::

# print "wide" Unicode characters:
# ucharlists = unicode_charlists(unicode_punctuation_categories)
@@ -340,6 +387,8 @@
# print (u":%s: %s" % (cat, u''.join(chars))).encode('utf8')

# verbose print
+#
+# ::

# print 'openers:'
# for ch in openers:
Index: docs/ref/rst/restructuredtext.txt
===================================================================
--- docs/ref/rst/restructuredtext.txt (Revision 8010)
+++ docs/ref/rst/restructuredtext.txt (Arbeitskopie)
@@ -2442,11 +2442,11 @@
See `Escaping Mechanism`_ above for details.

5. If an inline markup start-string is immediately preceded by one of the
- ASCII characters ``' " < ( [ {`` or a similar Unicode character\ [#]_, it
- must not be followed by the corresponding closing character from ``' " )
- ] } >`` or a similar Unicode character\ [#]_.
- (For quotes, corresponding characters can be any of the `quotation marks
- in international usage`_.)
+ ASCII characters ``' " < ( [ {`` or a similar
+ Unicode character\ [#openers]_, it must not be followed by the
+ corresponding closing character from ``' " ) ] } >`` or a similar Unicode
+ character\ [#closers]_. (For quotes, matching characters can be any of
+ the `quotation marks in international usage`_.)

If the configuration setting `simple-inline-markup`_ is False (default),
additional conditions apply to the characters "around" the
@@ -2457,7 +2457,7 @@

* whitespace,
* one of the ASCII characters ``- : / ' " < ( [ {``
- * or a similar Unicode punctuation character.\ [#]_
+ * or a similar Unicode punctuation character.\ [#pre-chars]_

7. Inline markup end-strings must end a text block or be immediately
followed by
@@ -2464,18 +2464,24 @@

* whitespace,
* one of the ASCII characters ``- . , : ; ! ? \ / ' " ) ] } >``
- * or a similar Unicode punctuation character.\ [#]_
+ * or a similar Unicode punctuation character.\ [#post-chars]_

-.. [#] `Unicode categories`_ `Ps`, `Pi`, or `Pf`
-.. [#] Unicode categories `Pe`, `Pf`, or `Pi`
-.. [#] Unicode categories `Pd` (Dash), `Po` (Other), `Pi` (Initial quote),
- `Pf` (Final quote), or `Ps` (Open)
-.. [#] Unicode categories `Pd` (Dash), `Po` (Other), `Pi` (Initial quote),
- `Pf` (Final quote), or `Pe` (Close)
+.. [#openers] `Unicode categories`_ `Ps` (Open), `Pi` (Initial quote),
+ or `Pf` (Final quote).\ [#uni-version]_
+.. [#closers] Unicode categories `Pe` (Close), `Pi` (Initial quote),
+ or `Pf` (Final quote).\ [#uni-version]_
+.. [#pre-chars] Unicode categories `Ps` (Open), `Pi` (Initial quote),
+ `Pf` (Final quote), `Pd` (Dash), or `Po` (Other).\ [#uni-version]_
+.. [#post-chars] Unicode categories `Pe` (Close), `Pi` (Initial quote),
+ `Pf` (Final quote), `Pd` (Dash), or `Po` (Other).\ [#uni-version]_

+.. [#uni-version] The category of some characters changed with the
+ development of the Unicode standard.
+ Docutils 0.13 uses `Unicode version 5.2.0`_.
+
.. _Unicode categories:
http://www.unicode.org/Public/5.1.0/ucd/UCD.html#General_Category_Values
-
+.. _Unicode version 5.2.0: http://www.unicode.org/Public/5.2.0/
.. _quotation marks in international usage:
http://en.wikipedia.org/wiki/Quotation_mark,_non-English_usage



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Docutils-develop mailing list
Docutils-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/docutils-develop

Please
David Goodger
2017-01-04 21:08:31 UTC
Permalink
The content looks fine. Some stylistic gripes:

1. In docs/ref/rst/restructuredtext.txt, why are there escaped spaces
("\ ") before each footnote reference? That makes the resulting HTML
look weird. I recommend removing the backslash-escapes.

2. In docutils/utils/punctuation_chars.py, there are a lot of
module-level definitions, function definitions, and raw executable
code under the "if __name__ == '__main__':" guard. That smells really
bad. I recommend pulling everything out to the module level, putting
the module-level executable code in well-named functions, and reducing
the contents of "if __name__ == '__main__':" to a simple function call
(or two or three). (This is not specific to this patch, but to the
module itself. As it is, the module *must* be run from the command
line. It cannot be tested from another script or from the interactive
interpreter, which is bad.)

Feel free to make those changes in a separate commit. Or I can do them
if you like.

David Goodger
<http://python.net/~goodger>
Post by Guenter Milde
Dear David,
thank you for the fast reply.
...
Post by David Goodger
Post by Guenter Milde
Post by David Goodger
For the spec, don't mention any specific version of Unicode. Just
state that character-specific behavior (punctuation, whitespace,
openers/closers) is dependent on the version of Unicode in the
underlying platform (the Python interpreter running Docutils).
...
Post by David Goodger
Post by Guenter Milde
2. Currently, the character-specific behavior does not depend on the
Python interpreter, the relevant pattern strings are pre-generated
and stored in utils/punctuation_chars.py for efficiency and
stability.
Which is good. Let's keep it that way.
Post by Guenter Milde
For consistency, we would need to
I don't understand what you're getting at here. "For consistency" with what?
Your proposal for the spec
Post by David Goodger
Post by Guenter Milde
Post by David Goodger
... dependent on the version of Unicode in the underlying platform (the
Python interpreter running Docutils)" ...
contradicts the current implementation.
I am glad to hear you agree to keep the implementation via pre-generated
+.. [#uni-version] The category of some characters changed with the
+ development of the Unicode standard.
+ Docutils 0.13 uses `Unicode version 5.2.0`_.
The patch below contains also the update for the script generating the
strings so that this also works under Py3k. OK to commit?
Günter
Index: docutils/utils/punctuation_chars.py
===================================================================
--- docutils/utils/punctuation_chars.py (Revision 8010)
+++ docutils/utils/punctuation_chars.py (Arbeitskopie)
@@ -11,6 +11,8 @@
# .. _2-Clause BSD license: http://www.spdx.org/licenses/BSD-2-Clause
# :Id: $Id$
+#
import sys, re
import unicodedata
@@ -20,16 +22,31 @@
#
# This module provides the lists of characters for the implementation of
# the `inline markup recognition rules`_ in the reStructuredText parser
-# (states.py)
+# `<../parsers/rst/states.py>`__.
+# For efficiency and stability, the lists are pre-generated and stored in
+# module-level variables.
#
+# The category of some characters changed with the development of the Unicode
+# standard. The current lists are generated with the help of the "unicodedata"
+# module of Python 2.7 (based on the UnicodeData.txt file version 5.2.0).
+#
+# Running this file as a standalone module (``__main__``) checks the
+# definitions against a re-calculation. Updating the lists with a new
+# Unicode standard version is an API change (may render valid rST documents
+# invalid). It should only be done for "feature releases" and requires
+# also updating the specification of `inline markup recognition rules`_ in
+# ../../docs/ref/rst/restructuredtext.txt.
+#
# ../../docs/ref/rst/restructuredtext.html#inline-markup
-
+#
+#
# Docutils punctuation category sample strings
# --------------------------------------------
#
# The sample strings are generated by punctuation_samples() and put here
-# literal to avoid the time-consuming generation with every Docutils run.
+# literal to avoid the time-consuming generation with every Docutils run
+# and prevent dependance on the Python version.
# As the samples are used inside ``[ ]`` in regular expressions, hyphen and
@@ -83,12 +100,14 @@
# Matching open/close quotes
# --------------------------
-
+#
# Rule (5) requires determination of matching open/close pairs. However,
# the pairing of open/close quotes is ambigue due to different typographic
-# conventions in different languages.
+# conventions in different languages. The following dictionary specifies
-quote_pairs = {u'\xbb': u'\xbb', # Swedish
+quote_pairs = {# open char: matching closing characters
+ u'\xbb': u'\xbb', # Swedish
u'\u2018': u'\u201a', # Greek
u'\u2019': u'\u2019', # Swedish
u'\u201a': u'\u2018\u2019', # German, Polish
@@ -99,11 +118,12 @@
}
+ """Test whether `c1` and `c2` are a matching open/close character pair."""
i = openers.index(c1)
except ValueError: # c1 not in openers
return False
- return c2 == closers[i] or c2 in quote_pairs.get(c1, '')
+ return c2 == closers[i] or c2 in quote_pairs.get(c1, u'')
# Running this file as a standalone module checks the definitions against a
@@ -114,6 +134,10 @@
# Unicode punctuation character categories
# ----------------------------------------
+#
+# For details about Unicode categories, see
+# http://www.unicode.org/Public/5.1.0/ucd/UCD.html#General_Category_Values
unicode_punctuation_categories = {
# 'Pc': 'Connector', # not used in Docutils inline markup recognition
@@ -129,6 +153,8 @@
# generate character pattern strings
# ==================================
+#
"""Return dictionary of Unicode character lists.
@@ -157,6 +183,8 @@
# Character categories in Docutils
# --------------------------------
+#
@@ -203,18 +231,18 @@
# allowed before markup if there is a matching closer
openers = [u'"\'(<\\[{']
- openers.extend(ucharlists[cat])
+ openers.extend(ucharlists[category])
# allowed after markup if there is a matching opener
closers = [u'"\')>\\]}']
- closers.extend(ucharlists[cat])
+ closers.extend(ucharlists[category])
# non-matching, allowed on both sides
delimiters = [u'\\-/:']
- delimiters.extend(ucharlists[cat])
+ delimiters.extend(ucharlists[category])
# non-matching, after markup
closing_delimiters = [r'\\.,;!?']
@@ -260,13 +288,13 @@
return ''.join(l2)
- def wrap_string(s, startstring= "(",
+ def wrap_string(s, startstring= "(u'",
"""Line-wrap a unicode string literal definition."""
c = len(startstring)
- contstring = "'\n" + ' ' * len(startstring) + "u'"
+ contstring = "'\n" + ' ' * (len(startstring)-2) + "u'"
l = [startstring]
c += 1
c = len(startstring)
@@ -276,10 +304,25 @@
return ''.join(l)
+ """List characters missing in old/new."""
+ print('new %s:' % name)
+ print ' %04x'%ord(c), unicodedata.name(c)
+ print('removed %s:' % name)
+ print ' %04x'%ord(c), unicodedata.name(c)
+
+
# print results
# =============
-
+#
+#
(o, c, d, cd) = punctuation_samples()
o, o_wide = separate_wide_chars(o)
@@ -287,40 +330,44 @@
d, d_wide = separate_wide_chars(d)
d = d[:5] + mark_intervals(d[5:])
d_wide = mark_intervals(d_wide)
- if sys.maxunicode >= 0x10FFFF: # "wide" build
- d += d_wide
- print '- openers = ur"""%s"""' % openers.encode('utf8')
- print '+ openers = ur"""%s"""' % o.encode('utf8')
+
+ print_differences(openers, o, 'openers')
print '+ openers-wide = ur"""%s"""' % o_wide.encode('utf8')
- print '- closers = ur"""%s"""' % closers.encode('utf8')
- print '+ closers = ur"""%s"""' % c.encode('utf8')
+ print_differences(closers, c, 'closers')
print '+ closers-wide = ur"""%s"""' % c_wide.encode('utf8')
- print '- delimiters = ur"%s"' % delimiters.encode('utf8')
- print '+ delimiters = ur"%s"' % d.encode('utf8')
- print '- closing_delimiters = ur"%s"' % closing_delimiters.encode('utf8')
- print '+ closing_delimiters = ur"%s"' % cd.encode('utf8')
- # closing_delimiters are all ASCII characters
+ print_differences(delimiters, d + d_wide, 'delimiters')
+ print_differences(closing_delimiters, cd, 'closing_delimiters')
+
+#
+ print '# based on Unicode version', unicodedata.unidata_version
+
# `openers` and `closers` must be verbose and keep order because they are
# also used in `match_chars()`.
- print wrap_string(repr(o), startstring='openers = (')
- print wrap_string(repr(c), startstring='closers = (')
+ print wrap_string(o.encode('unicode-escape').decode(),
+ startstring="openers = (u'")
+ print wrap_string(c.encode('unicode-escape').decode(),
+ startstring="closers = (u'")
- print wrap_string(repr(d), startstring='delimiters = (')
+ print wrap_string(d.encode('unicode-escape').decode(),
+ startstring="delimiters = (u'")
print 'if sys.maxunicode >= 0x10FFFF: # "wide" build'
- print wrap_string(repr(d_wide), startstring=' delimiters += (')
- print 'closing_delimiters =', repr(cd)
+ print wrap_string(d_wide.encode('unicode-escape').decode(),
+ startstring=" delimiters += (u'")
+ print wrap_string(cd.encode('unicode-escape').decode(),
+ startstring="closing_delimiters = (u'")
+
# test prints
+#
# ucharlists = unicode_charlists(unicode_punctuation_categories)
@@ -340,6 +387,8 @@
# print (u":%s: %s" % (cat, u''.join(chars))).encode('utf8')
# verbose print
+#
# print 'openers:'
Index: docs/ref/rst/restructuredtext.txt
===================================================================
--- docs/ref/rst/restructuredtext.txt (Revision 8010)
+++ docs/ref/rst/restructuredtext.txt (Arbeitskopie)
@@ -2442,11 +2442,11 @@
See `Escaping Mechanism`_ above for details.
5. If an inline markup start-string is immediately preceded by one of the
- ASCII characters ``' " < ( [ {`` or a similar Unicode character\ [#]_, it
- must not be followed by the corresponding closing character from ``' " )
- ] } >`` or a similar Unicode character\ [#]_.
- (For quotes, corresponding characters can be any of the `quotation marks
- in international usage`_.)
+ ASCII characters ``' " < ( [ {`` or a similar
+ Unicode character\ [#openers]_, it must not be followed by the
+ corresponding closing character from ``' " ) ] } >`` or a similar Unicode
+ character\ [#closers]_. (For quotes, matching characters can be any of
+ the `quotation marks in international usage`_.)
If the configuration setting `simple-inline-markup`_ is False (default),
additional conditions apply to the characters "around" the
@@ -2457,7 +2457,7 @@
* whitespace,
* one of the ASCII characters ``- : / ' " < ( [ {``
- * or a similar Unicode punctuation character.\ [#]_
+ * or a similar Unicode punctuation character.\ [#pre-chars]_
7. Inline markup end-strings must end a text block or be immediately
followed by
@@ -2464,18 +2464,24 @@
* whitespace,
* one of the ASCII characters ``- . , : ; ! ? \ / ' " ) ] } >``
- * or a similar Unicode punctuation character.\ [#]_
+ * or a similar Unicode punctuation character.\ [#post-chars]_
-.. [#] `Unicode categories`_ `Ps`, `Pi`, or `Pf`
-.. [#] Unicode categories `Pe`, `Pf`, or `Pi`
-.. [#] Unicode categories `Pd` (Dash), `Po` (Other), `Pi` (Initial quote),
- `Pf` (Final quote), or `Ps` (Open)
-.. [#] Unicode categories `Pd` (Dash), `Po` (Other), `Pi` (Initial quote),
- `Pf` (Final quote), or `Pe` (Close)
+.. [#openers] `Unicode categories`_ `Ps` (Open), `Pi` (Initial quote),
+ or `Pf` (Final quote).\ [#uni-version]_
+.. [#closers] Unicode categories `Pe` (Close), `Pi` (Initial quote),
+ or `Pf` (Final quote).\ [#uni-version]_
+.. [#pre-chars] Unicode categories `Ps` (Open), `Pi` (Initial quote),
+ `Pf` (Final quote), `Pd` (Dash), or `Po` (Other).\ [#uni-version]_
+.. [#post-chars] Unicode categories `Pe` (Close), `Pi` (Initial quote),
+ `Pf` (Final quote), `Pd` (Dash), or `Po` (Other).\ [#uni-version]_
+.. [#uni-version] The category of some characters changed with the
+ development of the Unicode standard.
+ Docutils 0.13 uses `Unicode version 5.2.0`_.
+
http://www.unicode.org/Public/5.1.0/ucd/UCD.html#General_Category_Values
-
+.. _Unicode version 5.2.0: http://www.unicode.org/Public/5.2.0/
http://en.wikipedia.org/wiki/Quotation_mark,_non-English_usage
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Docutils-develop mailing list
Docutils-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/docutils-develop

Please
Guenter Milde
2017-01-04 23:19:03 UTC
Permalink
Post by David Goodger
1. In docs/ref/rst/restructuredtext.txt, why are there escaped spaces
("\ ") before each footnote reference? That makes the resulting HTML
look weird. I recommend removing the backslash-escapes.
Done and committed.
Post by David Goodger
2. In docutils/utils/punctuation_chars.py, there are a lot of
module-level definitions, function definitions, and raw executable
code under the "if __name__ == '__main__':" guard. That smells really
bad. I recommend pulling everything out to the module level, putting
the module-level executable code in well-named functions, and reducing
the contents of "if __name__ == '__main__':" to a simple function call
(or two or three). (This is not specific to this patch, but to the
module itself. As it is, the module *must* be run from the command
line. It cannot be tested from another script or from the interactive
interpreter, which is bad.)
Currently, only the definitions required in Docutils are at module level.

The rest is only required to manually check and eventually
update/regenerate the samples.

If you want them separate, I propose to split the module into 2 files:

punctuation_chars.py # the part before if _name_ == '__main__'

generate_punctuation_chars.py # script for (re)generating the samples and
test for differences between re-generated
and stored samples.

The latter would be a tool for developers to check and eventually update
the first. Probably a candidate for tools/dev.


Günter


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Docutils-develop mailing list
Docutils-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/docutils-develop

Please use "Reply All" to reply to the list.
David Goodger
2017-01-05 02:04:46 UTC
Permalink
Post by Guenter Milde
Post by David Goodger
1. In docs/ref/rst/restructuredtext.txt, why are there escaped spaces
("\ ") before each footnote reference? That makes the resulting HTML
look weird. I recommend removing the backslash-escapes.
Done and committed.
Post by David Goodger
2. In docutils/utils/punctuation_chars.py, there are a lot of
module-level definitions, function definitions, and raw executable
code under the "if __name__ == '__main__':" guard. That smells really
bad. I recommend pulling everything out to the module level, putting
the module-level executable code in well-named functions, and reducing
the contents of "if __name__ == '__main__':" to a simple function call
(or two or three). (This is not specific to this patch, but to the
module itself. As it is, the module *must* be run from the command
line. It cannot be tested from another script or from the interactive
interpreter, which is bad.)
Currently, only the definitions required in Docutils are at module level.
The rest is only required to manually check and eventually
update/regenerate the samples.
punctuation_chars.py # the part before if _name_ == '__main__'
generate_punctuation_chars.py # script for (re)generating the samples and
test for differences between re-generated
and stored samples.
The latter would be a tool for developers to check and eventually update
the first. Probably a candidate for tools/dev.
That sounds good. It would be great if generate_punctuation_chars.py
produces the entirety of punctuation_chars.py, including all code,
comments (even the $Id$ slug), and docstrings. Then regenerating
punctuation_chars.py would require no manual editing.

David Goodger
<http://python.net/~goodger>

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Docutils-develop mailing list
Docutils-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/docutils-develop

Please use "Reply All" to reply to the list.
Guenter Milde
2017-01-05 10:19:37 UTC
Permalink
Dear David,
Post by David Goodger
Post by Guenter Milde
punctuation_chars.py # the part before if _name_ == '__main__'
generate_punctuation_chars.py # script for (re)generating the samples and
test for differences between re-generated
and stored samples.
The latter would be a tool for developers to check and eventually update
the first. Probably a candidate for tools/dev.
That sounds good.
Done and committed.
Post by David Goodger
It would be great if generate_punctuation_chars.py
produces the entirety of punctuation_chars.py, including all code,
comments (even the $Id$ slug), and docstrings. Then regenerating
punctuation_chars.py would require no manual editing.
Good idea. I will work in this later.


Currently, the tool has 2 functions,

a) diagnosis/test

(usefull for diagnosis and documenting changes,
cf appended output for Py 3.5)

b) generate character class pattern definition code,

which should be separated, by

i. moving into 2 tool files
ii. command line argument or option


What would you prefer?

Günter



Output for Py 3.5 / Unicode version 8.0.0:

new openers:
2308 LEFT CEILING
230a LEFT FLOOR
2e42 DOUBLE LOW-REVERSED-9 QUOTATION MARK
fd3f ORNATE RIGHT PARENTHESIS
removed openers:
fd3e ORNATE LEFT PARENTHESIS
new closers:
2309 RIGHT CEILING
230b RIGHT FLOOR
fd3e ORNATE LEFT PARENTHESIS
removed closers:
fd3f ORNATE RIGHT PARENTHESIS
new delimiters:
00a7 SECTION SIGN
00b6 PILCROW SIGN
085e MANDAIC PUNCTUATION
0af0 GUJARATI ABBREVIATION SIGN
0f14 TIBETAN MARK GTER TSHEG
0fd9 TIBETAN MARK LEADING MCHAN RTAGS
0fda TIBETAN MARK TRAILING MCHAN RTAGS
1360 ETHIOPIC SECTION MARK
1bfc BATAK SYMBOL BINDU NA METEK
1bff BATAK SYMBOL BINDU PANGOLAT
1cc0 SUNDANESE PUNCTUATION BINDU SURYA
1cc7 SUNDANESE PUNCTUATION BINDU BA SATANGA
2d70 TIFINAGH SEPARATOR MARK
2e41 REVERSED COMMA
a8fc DEVANAGARI SIGN SIDDHAM
aaf0 MEETEI MAYEK CHEIKHAN
aaf1 MEETEI MAYEK AHANG KHUDAM
10102 AEGEAN CHECK MARK
1056f CAUCASIAN ALBANIAN CITATION MARK
10af0 MANICHAEAN PUNCTUATION STAR
10af6 MANICHAEAN PUNCTUATION LINE FILLER
10b99 PSALTER PAHLAVI SECTION MARK
10b9c PSALTER PAHLAVI FOUR DOTS WITH DOT
11047 BRAHMI DANDA
1104d BRAHMI PUNCTUATION LOTUS
11140 CHAKMA SECTION MARK
11143 CHAKMA QUESTION MARK
11174 MAHAJANI ABBREVIATION SIGN
11175 MAHAJANI SECTION MARK
111c5 SHARADA DANDA
111c9 SHARADA SANDHI MARK
111cd SHARADA SUTRA MARK
111db SHARADA SIGN SIDDHAM
111dd SHARADA CONTINUATION SIGN
111df SHARADA SECTION MARK-2
11238 KHOJKI DANDA
1123d KHOJKI ABBREVIATION SIGN
112a9 MULTANI SECTION MARK
114c6 TIRHUTA ABBREVIATION SIGN
115c1 SIDDHAM SIGN SIDDHAM
115d7 SIDDHAM SECTION MARK WITH CIRCLES AND FOUR ENCLOSURES
11641 MODI DANDA
11643 MODI ABBREVIATION SIGN
1173c AHOM SIGN SMALL SECTION
1173e AHOM SIGN RULAI
12474 CUNEIFORM PUNCTUATION SIGN DIAGONAL QUADCOLON
16a6e MRO DANDA
16a6f MRO DOUBLE DANDA
16af5 BASSA VAH FULL STOP
16b37 PAHAWH HMONG SIGN VOS THOM
16b3b PAHAWH HMONG SIGN VOS FEEM
16b44 PAHAWH HMONG SIGN XAUS
1bc9f DUPLOYAN PUNCTUATION CHINOOK FULL STOP
1da87 SIGNWRITING COMMA
1da8b SIGNWRITING PARENTHESIS
removed delimiters:
1361 ETHIOPIC WORDSPACE
19de NEW TAI LUE SIGN LAE
19df NEW TAI LUE SIGN LAEV
2e31 WORD SEPARATOR MIDDLE DOT
10101 AEGEAN WORD SEPARATOR DOT
12473 CUNEIFORM PUNCTUATION SIGN DIAGONAL TRICOLON

# based on Unicode version 8.0.0
openers = (u'"\'(<\\[{\u0f3a\u0f3c\u169b\u2045\u207d\u208d\u2308\u230a'
u'\u2329\u2768\u276a\u276c\u276e\u2770\u2772\u2774\u27c5'
u'\u27e6\u27e8\u27ea\u27ec\u27ee\u2983\u2985\u2987\u2989'
u'\u298b\u298d\u298f\u2991\u2993\u2995\u2997\u29d8\u29da'
u'\u29fc\u2e22\u2e24\u2e26\u2e28\u2e42\u3008\u300a\u300c'
u'\u300e\u3010\u3014\u3016\u3018\u301a\u301d\u301d\ufd3f'
u'\ufe17\ufe35\ufe37\ufe39\ufe3b\ufe3d\ufe3f\ufe41\ufe43'
u'\ufe47\ufe59\ufe5b\ufe5d\uff08\uff3b\uff5b\uff5f\uff62'
u'\xab\u2018\u201c\u2039\u2e02\u2e04\u2e09\u2e0c\u2e1c\u2e20'
u'\u201a\u201e\xbb\u2019\u201d\u203a\u2e03\u2e05\u2e0a\u2e0d'
u'\u2e1d\u2e21\u201b\u201f')
closers = (u'"\')>\\]}\u0f3b\u0f3d\u169c\u2046\u207e\u208e\u2309\u230b'
u'\u232a\u2769\u276b\u276d\u276f\u2771\u2773\u2775\u27c6'
u'\u27e7\u27e9\u27eb\u27ed\u27ef\u2984\u2986\u2988\u298a'
u'\u298c\u298e\u2990\u2992\u2994\u2996\u2998\u29d9\u29db'
u'\u29fd\u2e23\u2e25\u2e27\u2e29\u3009\u300b\u300d\u300f'
u'\u3011\u3015\u3017\u3019\u301b\u301e\u301f\ufd3e\ufe18'
u'\ufe36\ufe38\ufe3a\ufe3c\ufe3e\ufe40\ufe42\ufe44\ufe48'
u'\ufe5a\ufe5c\ufe5e\uff09\uff3d\uff5d\uff60\uff63\xbb\u2019'
u'\u201d\u203a\u2e03\u2e05\u2e0a\u2e0d\u2e1d\u2e21\u201b'
u'\u201f\xab\u2018\u201c\u2039\u2e02\u2e04\u2e09\u2e0c\u2e1c'
u'\u2e20\u201a\u201e')
delimiters = (u'\\-/:\u058a\xa1\xa7\xb6\xb7\xbf\u037e\u0387\u055a-'
u'\u055f\u0589\u05be\u05c0\u05c3\u05c6\u05f3\u05f4\u0609'
u'\u060a\u060c\u060d\u061b\u061e\u061f\u066a-\u066d\u06d4'
u'\u0700-\u070d\u07f7-\u07f9\u0830-\u083e\u085e\u0964'
u'\u0965\u0970\u0af0\u0df4\u0e4f\u0e5a\u0e5b\u0f04-\u0f12'
u'\u0f14\u0f85\u0fd0-\u0fd4\u0fd9\u0fda\u104a-\u104f'
u'\u10fb\u1360-\u1368\u1400\u166d\u166e\u16eb-\u16ed'
u'\u1735\u1736\u17d4-\u17d6\u17d8-\u17da\u1800-\u180a'
u'\u1944\u1945\u1a1e\u1a1f\u1aa0-\u1aa6\u1aa8-\u1aad'
u'\u1b5a-\u1b60\u1bfc-\u1bff\u1c3b-\u1c3f\u1c7e\u1c7f'
u'\u1cc0-\u1cc7\u1cd3\u2010-\u2017\u2020-\u2027\u2030-'
u'\u2038\u203b-\u203e\u2041-\u2043\u2047-\u2051\u2053'
u'\u2055-\u205e\u2cf9-\u2cfc\u2cfe\u2cff\u2d70\u2e00'
u'\u2e01\u2e06-\u2e08\u2e0b\u2e0e-\u2e1b\u2e1e\u2e1f'
u'\u2e2a-\u2e2e\u2e30-\u2e41\u3001-\u3003\u301c\u3030'
u'\u303d\u30a0\u30fb\ua4fe\ua4ff\ua60d-\ua60f\ua673\ua67e'
u'\ua6f2-\ua6f7\ua874-\ua877\ua8ce\ua8cf\ua8f8-\ua8fa'
u'\ua8fc\ua92e\ua92f\ua95f\ua9c1-\ua9cd\ua9de\ua9df\uaa5c-'
u'\uaa5f\uaade\uaadf\uaaf0\uaaf1\uabeb\ufe10-\ufe16\ufe19'
u'\ufe30-\ufe32\ufe45\ufe46\ufe49-\ufe4c\ufe50-\ufe52'
u'\ufe54-\ufe58\ufe5f-\ufe61\ufe63\ufe68\ufe6a\ufe6b'
u'\uff01-\uff03\uff05-\uff07\uff0a\uff0c-\uff0f\uff1a'
u'\uff1b\uff1f\uff20\uff3c\uff61\uff64\uff65')
if sys.maxunicode >= 0x10FFFF: # "wide" build
delimiters += (u'\U00010100-\U00010102\U0001039f\U000103d0\U0001056f'
u'\U00010857\U0001091f\U0001093f\U00010a50-\U00010a58'
u'\U00010a7f\U00010af0-\U00010af6\U00010b39-\U00010b3f'
u'\U00010b99-\U00010b9c\U00011047-\U0001104d\U000110bb'
u'\U000110bc\U000110be-\U000110c1\U00011140-\U00011143'
u'\U00011174\U00011175\U000111c5-\U000111c9\U000111cd'
u'\U000111db\U000111dd-\U000111df\U00011238-\U0001123d'
u'\U000112a9\U000114c6\U000115c1-\U000115d7\U00011641-'
u'\U00011643\U0001173c-\U0001173e\U00012470-\U00012474'
u'\U00016a6e\U00016a6f\U00016af5\U00016b37-\U00016b3b'
u'\U00016b44\U0001bc9f\U0001da87-\U0001da8b')
closing_delimiters = (u'\\\\.,;!?')


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Docutils-develop mailing list
Docutils-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/docutils-develop

Please use "Reply All"
David Goodger
2017-01-16 21:51:38 UTC
Permalink
Post by Guenter Milde
Dear David,
Post by David Goodger
Post by Guenter Milde
punctuation_chars.py # the part before if _name_ == '__main__'
generate_punctuation_chars.py # script for (re)generating the samples and
test for differences between re-generated
and stored samples.
The latter would be a tool for developers to check and eventually update
the first. Probably a candidate for tools/dev.
That sounds good.
Done and committed.
Great, thanks!
Post by Guenter Milde
Post by David Goodger
It would be great if generate_punctuation_chars.py
produces the entirety of punctuation_chars.py, including all code,
comments (even the $Id$ slug), and docstrings. Then regenerating
punctuation_chars.py would require no manual editing.
Good idea. I will work in this later.
Currently, the tool has 2 functions,
a) diagnosis/test
(usefull for diagnosis and documenting changes,
cf appended output for Py 3.5)
b) generate character class pattern definition code,
which should be separated, by
i. moving into 2 tool files
ii. command line argument or option
What would you prefer?
I don't have a strong opinion. My gut feeling is that it would be
better to keep it in one tool. Since you already named it
generate_punctuation_chars.py, I'd add a -t/--test or -c/--changes
option to that tool.

David Goodger
<http://python.net/~goodger>
Post by Guenter Milde
2308 LEFT CEILING
230a LEFT FLOOR
2e42 DOUBLE LOW-REVERSED-9 QUOTATION MARK
fd3f ORNATE RIGHT PARENTHESIS
fd3e ORNATE LEFT PARENTHESIS
2309 RIGHT CEILING
230b RIGHT FLOOR
fd3e ORNATE LEFT PARENTHESIS
fd3f ORNATE RIGHT PARENTHESIS
00a7 SECTION SIGN
00b6 PILCROW SIGN
085e MANDAIC PUNCTUATION
0af0 GUJARATI ABBREVIATION SIGN
0f14 TIBETAN MARK GTER TSHEG
0fd9 TIBETAN MARK LEADING MCHAN RTAGS
0fda TIBETAN MARK TRAILING MCHAN RTAGS
1360 ETHIOPIC SECTION MARK
1bfc BATAK SYMBOL BINDU NA METEK
1bff BATAK SYMBOL BINDU PANGOLAT
1cc0 SUNDANESE PUNCTUATION BINDU SURYA
1cc7 SUNDANESE PUNCTUATION BINDU BA SATANGA
2d70 TIFINAGH SEPARATOR MARK
2e41 REVERSED COMMA
a8fc DEVANAGARI SIGN SIDDHAM
aaf0 MEETEI MAYEK CHEIKHAN
aaf1 MEETEI MAYEK AHANG KHUDAM
10102 AEGEAN CHECK MARK
1056f CAUCASIAN ALBANIAN CITATION MARK
10af0 MANICHAEAN PUNCTUATION STAR
10af6 MANICHAEAN PUNCTUATION LINE FILLER
10b99 PSALTER PAHLAVI SECTION MARK
10b9c PSALTER PAHLAVI FOUR DOTS WITH DOT
11047 BRAHMI DANDA
1104d BRAHMI PUNCTUATION LOTUS
11140 CHAKMA SECTION MARK
11143 CHAKMA QUESTION MARK
11174 MAHAJANI ABBREVIATION SIGN
11175 MAHAJANI SECTION MARK
111c5 SHARADA DANDA
111c9 SHARADA SANDHI MARK
111cd SHARADA SUTRA MARK
111db SHARADA SIGN SIDDHAM
111dd SHARADA CONTINUATION SIGN
111df SHARADA SECTION MARK-2
11238 KHOJKI DANDA
1123d KHOJKI ABBREVIATION SIGN
112a9 MULTANI SECTION MARK
114c6 TIRHUTA ABBREVIATION SIGN
115c1 SIDDHAM SIGN SIDDHAM
115d7 SIDDHAM SECTION MARK WITH CIRCLES AND FOUR ENCLOSURES
11641 MODI DANDA
11643 MODI ABBREVIATION SIGN
1173c AHOM SIGN SMALL SECTION
1173e AHOM SIGN RULAI
12474 CUNEIFORM PUNCTUATION SIGN DIAGONAL QUADCOLON
16a6e MRO DANDA
16a6f MRO DOUBLE DANDA
16af5 BASSA VAH FULL STOP
16b37 PAHAWH HMONG SIGN VOS THOM
16b3b PAHAWH HMONG SIGN VOS FEEM
16b44 PAHAWH HMONG SIGN XAUS
1bc9f DUPLOYAN PUNCTUATION CHINOOK FULL STOP
1da87 SIGNWRITING COMMA
1da8b SIGNWRITING PARENTHESIS
1361 ETHIOPIC WORDSPACE
19de NEW TAI LUE SIGN LAE
19df NEW TAI LUE SIGN LAEV
2e31 WORD SEPARATOR MIDDLE DOT
10101 AEGEAN WORD SEPARATOR DOT
12473 CUNEIFORM PUNCTUATION SIGN DIAGONAL TRICOLON
# based on Unicode version 8.0.0
openers = (u'"\'(<\\[{\u0f3a\u0f3c\u169b\u2045\u207d\u208d\u2308\u230a'
u'\u2329\u2768\u276a\u276c\u276e\u2770\u2772\u2774\u27c5'
u'\u27e6\u27e8\u27ea\u27ec\u27ee\u2983\u2985\u2987\u2989'
u'\u298b\u298d\u298f\u2991\u2993\u2995\u2997\u29d8\u29da'
u'\u29fc\u2e22\u2e24\u2e26\u2e28\u2e42\u3008\u300a\u300c'
u'\u300e\u3010\u3014\u3016\u3018\u301a\u301d\u301d\ufd3f'
u'\ufe17\ufe35\ufe37\ufe39\ufe3b\ufe3d\ufe3f\ufe41\ufe43'
u'\ufe47\ufe59\ufe5b\ufe5d\uff08\uff3b\uff5b\uff5f\uff62'
u'\xab\u2018\u201c\u2039\u2e02\u2e04\u2e09\u2e0c\u2e1c\u2e20'
u'\u201a\u201e\xbb\u2019\u201d\u203a\u2e03\u2e05\u2e0a\u2e0d'
u'\u2e1d\u2e21\u201b\u201f')
closers = (u'"\')>\\]}\u0f3b\u0f3d\u169c\u2046\u207e\u208e\u2309\u230b'
u'\u232a\u2769\u276b\u276d\u276f\u2771\u2773\u2775\u27c6'
u'\u27e7\u27e9\u27eb\u27ed\u27ef\u2984\u2986\u2988\u298a'
u'\u298c\u298e\u2990\u2992\u2994\u2996\u2998\u29d9\u29db'
u'\u29fd\u2e23\u2e25\u2e27\u2e29\u3009\u300b\u300d\u300f'
u'\u3011\u3015\u3017\u3019\u301b\u301e\u301f\ufd3e\ufe18'
u'\ufe36\ufe38\ufe3a\ufe3c\ufe3e\ufe40\ufe42\ufe44\ufe48'
u'\ufe5a\ufe5c\ufe5e\uff09\uff3d\uff5d\uff60\uff63\xbb\u2019'
u'\u201d\u203a\u2e03\u2e05\u2e0a\u2e0d\u2e1d\u2e21\u201b'
u'\u201f\xab\u2018\u201c\u2039\u2e02\u2e04\u2e09\u2e0c\u2e1c'
u'\u2e20\u201a\u201e')
delimiters = (u'\\-/:\u058a\xa1\xa7\xb6\xb7\xbf\u037e\u0387\u055a-'
u'\u055f\u0589\u05be\u05c0\u05c3\u05c6\u05f3\u05f4\u0609'
u'\u060a\u060c\u060d\u061b\u061e\u061f\u066a-\u066d\u06d4'
u'\u0700-\u070d\u07f7-\u07f9\u0830-\u083e\u085e\u0964'
u'\u0965\u0970\u0af0\u0df4\u0e4f\u0e5a\u0e5b\u0f04-\u0f12'
u'\u0f14\u0f85\u0fd0-\u0fd4\u0fd9\u0fda\u104a-\u104f'
u'\u10fb\u1360-\u1368\u1400\u166d\u166e\u16eb-\u16ed'
u'\u1735\u1736\u17d4-\u17d6\u17d8-\u17da\u1800-\u180a'
u'\u1944\u1945\u1a1e\u1a1f\u1aa0-\u1aa6\u1aa8-\u1aad'
u'\u1b5a-\u1b60\u1bfc-\u1bff\u1c3b-\u1c3f\u1c7e\u1c7f'
u'\u1cc0-\u1cc7\u1cd3\u2010-\u2017\u2020-\u2027\u2030-'
u'\u2038\u203b-\u203e\u2041-\u2043\u2047-\u2051\u2053'
u'\u2055-\u205e\u2cf9-\u2cfc\u2cfe\u2cff\u2d70\u2e00'
u'\u2e01\u2e06-\u2e08\u2e0b\u2e0e-\u2e1b\u2e1e\u2e1f'
u'\u2e2a-\u2e2e\u2e30-\u2e41\u3001-\u3003\u301c\u3030'
u'\u303d\u30a0\u30fb\ua4fe\ua4ff\ua60d-\ua60f\ua673\ua67e'
u'\ua6f2-\ua6f7\ua874-\ua877\ua8ce\ua8cf\ua8f8-\ua8fa'
u'\ua8fc\ua92e\ua92f\ua95f\ua9c1-\ua9cd\ua9de\ua9df\uaa5c-'
u'\uaa5f\uaade\uaadf\uaaf0\uaaf1\uabeb\ufe10-\ufe16\ufe19'
u'\ufe30-\ufe32\ufe45\ufe46\ufe49-\ufe4c\ufe50-\ufe52'
u'\ufe54-\ufe58\ufe5f-\ufe61\ufe63\ufe68\ufe6a\ufe6b'
u'\uff01-\uff03\uff05-\uff07\uff0a\uff0c-\uff0f\uff1a'
u'\uff1b\uff1f\uff20\uff3c\uff61\uff64\uff65')
if sys.maxunicode >= 0x10FFFF: # "wide" build
delimiters += (u'\U00010100-\U00010102\U0001039f\U000103d0\U0001056f'
u'\U00010857\U0001091f\U0001093f\U00010a50-\U00010a58'
u'\U00010a7f\U00010af0-\U00010af6\U00010b39-\U00010b3f'
u'\U00010b99-\U00010b9c\U00011047-\U0001104d\U000110bb'
u'\U000110bc\U000110be-\U000110c1\U00011140-\U00011143'
u'\U00011174\U00011175\U000111c5-\U000111c9\U000111cd'
u'\U000111db\U000111dd-\U000111df\U00011238-\U0001123d'
u'\U000112a9\U000114c6\U000115c1-\U000115d7\U00011641-'
u'\U00011643\U0001173c-\U0001173e\U00012470-\U00012474'
u'\U00016a6e\U00016a6f\U00016af5\U00016b37-\U00016b3b'
u'\U00016b44\U0001bc9f\U0001da87-\U0001da8b')
closing_delimiters = (u'\\\\.,;!?')
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Docutils-develop mailing list
Docutils-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/docutils-develop

Please use "Reply All" to reply to the list.
Edward d'Auvergne
2017-01-02 10:01:49 UTC
Permalink
Post by Guenter Milde
...
Post by Edward d'Auvergne
Post by Guenter Milde
Post by Günter Milde
The pilcrow sign belongs to the [Unicode PO class]. According to
Inline markup recognition rule 7, this is an allowed character after
the inline markup end string.
...
Post by Edward d'Auvergne
Post by Guenter Milde
Post by Günter Milde
The character class of the pilcrow sign # changed,
...
Post by Edward d'Auvergne
Post by Guenter Milde
a) close the bug as invalid and document we are using
Unicode 5.2 for inline markup rules
b) use the Py-3 generated strings with
``# -*- coding: utf-8 -*-`` and prepended ``u``
c) leave this open until someone updates either Py2.7's "unicodedata" or the
code-generating part of punctuation_chars.py so that the original format
can be pertained?
[snip]
Post by Guenter Milde
Post by Edward d'Auvergne
For the reStructuredText standard, it would be
logical to pin it to the latest Unicode standard (rather than be tied
down based on the docutils implementation being restricted by the
underlying Python install).
However, rather than specifying "the latest", we should state a specific
version in the docs and update it together with the actual code.
Why? We don't track the Unicode standard development and auto-update our
patterns.
https://docs.python.org/3/library/codecs.html
My perspective is rather different. I'm looking at the
reStructuredText spec from the hypothetical viewpoint of someone
wanting to create a clean-room implementation. To make the
reStructuredText specification dependent on what code is in docutils
at any point in time defeats its purpose of being an independent
specification. Why should alternative implementations be bound to the
internal details of the docutils implementation?
Post by Guenter Milde
Post by Edward d'Auvergne
For the docutils implementation,
implementing a basic module, possibly auto-generated by a script, to
eliminate problems in the Unicode standard implementation in the local
Python installation so that docutils performs idenically on all
systems would make a lot of sense.
This is already the case: Docutils uses (auto-generated) strings with
characters of the several classes in punctuation_chars.py. These do not
depend on the Python installation (but are actually from Python 2.x and
hence match Unicode 5.2.1.).
I have deliberately not looked at the code, even though I have the
repository checked out.
Post by Guenter Milde
My question b/c was a Docutils-internal one: there is a style-guide saying
that literal Unicode should be avoided (except for language files) dating
from times when Unicode support in editors, fonts, ... was patchy.
As the proposal included an option to mandate an out-of-date Unicode
specification number within the reStructuredText specification based
on the current status of the docutils code, this affects the external
aspects in a significant way. My bug report for the failure of
``name``¶ - which is used extensively in the Python HTML docs (see the
permalinks when hovering over text at
https://docs.python.org/3/library/codecs.html for example) - is solely
for docutils, as the reST specification does not mandate a Unicode
version and hence the latest version would be a reasonable assumption.

Regards,

Edward

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Docutils-develop mailing list
Docutils-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/docutils-develop

Please use "Reply A
Günter Milde
2016-12-30 21:04:30 UTC
Permalink
Running the self-test in the Python-3 version of utils/punctuation_chars.py shows that there are some more changes.


---

** [bugs:#305] Inline markup end strings and the Unicode PO character set (rule 7 breakage).**

**Status:** open
**Group:** Default
**Created:** Fri Dec 30, 2016 12:10 PM UTC by Edward d'Auvergne
**Last Updated:** Fri Dec 30, 2016 09:01 PM UTC
**Owner:** nobody


The following text reveals a bug in docutils::

``name``¶

The pilcrow sign belongs to the [Unicode PO class](http://www.fileformat.info/info/unicode/category/Po/list.htm). According to Inline markup recognition rule 7, this is an allowed character after the inline markup end string. However according to rstcheck, and the other rst converters, this is broken::

test.rst:1: (WARNING/2) Inline literal start-string without end-string.


---

Sent from sourceforge.net because docutils-***@lists.sourceforge.net is subscribed to https://sourceforge.net/p/docutils/bugs/

To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/docutils/admin/bugs/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.
Edward d'Auvergne
2016-12-30 21:21:20 UTC
Permalink
Python's Unicode support is terrible - just try textwrap.wrap() to see how the zero-width space and double space characters (Chinese for example - 䞀二䞉四五六䞃八九十). Why not just hard code the Unicode categories into lists within a docutils module? That gives the flexibility of preformatting the list into the internal docutils text format for faster lookups.


---

** [bugs:#305] Inline markup end strings and the Unicode PO character set (rule 7 breakage).**

**Status:** open
**Group:** Default
**Created:** Fri Dec 30, 2016 12:10 PM UTC by Edward d'Auvergne
**Last Updated:** Fri Dec 30, 2016 09:04 PM UTC
**Owner:** nobody


The following text reveals a bug in docutils::

``name``¶

The pilcrow sign belongs to the [Unicode PO class](http://www.fileformat.info/info/unicode/category/Po/list.htm). According to Inline markup recognition rule 7, this is an allowed character after the inline markup end string. However according to rstcheck, and the other rst converters, this is broken::

test.rst:1: (WARNING/2) Inline literal start-string without end-string.


---

Sent from sourceforge.net because docutils-***@lists.sourceforge.net is subscribed to https://sourceforge.net/p/docutils/bugs/

To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/docutils/admin/bugs/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.
Günter Milde
2016-12-31 09:40:05 UTC
Permalink
Did you hava a look at docutils/utils/punctuation_chars.py? It already does what you propose.
However, the lists are generated using "unicodedata" (see the second part of punctuation_chars.py):
-1 currently, outdated data from Python 2.7's "unicodedata" is used.
+1 updating is simple and predictable when "unicodedata" is updated.


---

** [bugs:#305] Inline markup end strings and the Unicode PO character set (rule 7 breakage).**

**Status:** open
**Group:** Default
**Created:** Fri Dec 30, 2016 12:10 PM UTC by Edward d'Auvergne
**Last Updated:** Sat Dec 31, 2016 08:53 AM UTC
**Owner:** nobody


The following text reveals a bug in docutils::

``name``¶

The pilcrow sign belongs to the [Unicode PO class](http://www.fileformat.info/info/unicode/category/Po/list.htm). According to Inline markup recognition rule 7, this is an allowed character after the inline markup end string. However according to rstcheck, and the other rst converters, this is broken::

test.rst:1: (WARNING/2) Inline literal start-string without end-string.


---

Sent from sourceforge.net because docutils-***@lists.sourceforge.net is subscribed to https://sourceforge.net/p/docutils/bugs/

To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/docutils/admin/bugs/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.
Edward d'Auvergne
2016-12-30 21:32:17 UTC
Permalink
For example this is what I use:

~~~
# Unicode category Pd (Dash).
UNICODE_PD = [
u'\u002D', u'\u058A', u'\u05BE', u'\u1400', u'\u1806', u'\u2010', u'\u2011', u'\u2012', u'\u2013', u'\u2014',
u'\u2015', u'\u2E17', u'\u2E1A', u'\u2E3A', u'\u2E3B', u'\u2E40', u'\u301C', u'\u3030', u'\u30A0', u'\uFE31',
u'\uFE32', u'\uFE58', u'\uFE63', u'\uFF0D'
]

# Unicode category Po (Other).
UNICODE_PO = [
u'\u0021', u'\u0022', u'\u0023', u'\u0025', u'\u0026', u'\u0027', u'\u002A', u'\u002C', u'\u002E', u'\u002F',
u'\u003A', u'\u003B', u'\u003F', u'\u0040', u'\u005C', u'\u00A1', u'\u00A7', u'\u00B6', u'\u00B7', u'\u00BF',
u'\u037E', u'\u0387', u'\u055A', u'\u055B', u'\u055C', u'\u055D', u'\u055E', u'\u055F', u'\u0589', u'\u05C0',
u'\u05C3', u'\u05C6', u'\u05F3', u'\u05F4', u'\u0609', u'\u060A', u'\u060C', u'\u060D', u'\u061B', u'\u061E',
u'\u061F', u'\u066A', u'\u066B', u'\u066C', u'\u066D', u'\u06D4', u'\u0700', u'\u0701', u'\u0702', u'\u0703',
u'\u0704', u'\u0705', u'\u0706', u'\u0707', u'\u0708', u'\u0709', u'\u070A', u'\u070B', u'\u070C', u'\u070D',
u'\u07F7', u'\u07F8', u'\u07F9', u'\u0830', u'\u0831', u'\u0832', u'\u0833', u'\u0834', u'\u0835', u'\u0836',
u'\u0837', u'\u0838', u'\u0839', u'\u083A', u'\u083B', u'\u083C', u'\u083D', u'\u083E', u'\u085E', u'\u0964',
u'\u0965', u'\u0970', u'\u0AF0', u'\u0DF4', u'\u0E4F', u'\u0E5A', u'\u0E5B', u'\u0F04', u'\u0F05', u'\u0F06',
u'\u0F07', u'\u0F08', u'\u0F09', u'\u0F0A', u'\u0F0B', u'\u0F0C', u'\u0F0D', u'\u0F0E', u'\u0F0F', u'\u0F10',
u'\u0F11', u'\u0F12', u'\u0F14', u'\u0F85', u'\u0FD0', u'\u0FD1', u'\u0FD2', u'\u0FD3', u'\u0FD4', u'\u0FD9',
u'\u0FDA', u'\u104A', u'\u104B', u'\u104C', u'\u104D', u'\u104E', u'\u104F', u'\u10FB', u'\u1360', u'\u1361',
u'\u1362', u'\u1363', u'\u1364', u'\u1365', u'\u1366', u'\u1367', u'\u1368', u'\u166D', u'\u166E', u'\u16EB',
u'\u16EC', u'\u16ED', u'\u1735', u'\u1736', u'\u17D4', u'\u17D5', u'\u17D6', u'\u17D8', u'\u17D9', u'\u17DA',
u'\u1800', u'\u1801', u'\u1802', u'\u1803', u'\u1804', u'\u1805', u'\u1807', u'\u1808', u'\u1809', u'\u180A',
u'\u1944', u'\u1945', u'\u1A1E', u'\u1A1F', u'\u1AA0', u'\u1AA1', u'\u1AA2', u'\u1AA3', u'\u1AA4', u'\u1AA5',
u'\u1AA6', u'\u1AA8', u'\u1AA9', u'\u1AAA', u'\u1AAB', u'\u1AAC', u'\u1AAD', u'\u1B5A', u'\u1B5B', u'\u1B5C',
u'\u1B5D', u'\u1B5E', u'\u1B5F', u'\u1B60', u'\u1BFC', u'\u1BFD', u'\u1BFE', u'\u1BFF', u'\u1C3B', u'\u1C3C',
u'\u1C3D', u'\u1C3E', u'\u1C3F', u'\u1C7E', u'\u1C7F', u'\u1CC0', u'\u1CC1', u'\u1CC2', u'\u1CC3', u'\u1CC4',
u'\u1CC5', u'\u1CC6', u'\u1CC7', u'\u1CD3', u'\u2016', u'\u2017', u'\u2020', u'\u2021', u'\u2022', u'\u2023',
u'\u2024', u'\u2025', u'\u2026', u'\u2027', u'\u2030', u'\u2031', u'\u2032', u'\u2033', u'\u2034', u'\u2035',
u'\u2036', u'\u2037', u'\u2038', u'\u203B', u'\u203C', u'\u203D', u'\u203E', u'\u2041', u'\u2042', u'\u2043',
u'\u2047', u'\u2048', u'\u2049', u'\u204A', u'\u204B', u'\u204C', u'\u204D', u'\u204E', u'\u204F', u'\u2050',
u'\u2051', u'\u2053', u'\u2055', u'\u2056', u'\u2057', u'\u2058', u'\u2059', u'\u205A', u'\u205B', u'\u205C',
u'\u205D', u'\u205E', u'\u2CF9', u'\u2CFA', u'\u2CFB', u'\u2CFC', u'\u2CFE', u'\u2CFF', u'\u2D70', u'\u2E00',
u'\u2E01', u'\u2E06', u'\u2E07', u'\u2E08', u'\u2E0B', u'\u2E0E', u'\u2E0F', u'\u2E10', u'\u2E11', u'\u2E12',
u'\u2E13', u'\u2E14', u'\u2E15', u'\u2E16', u'\u2E18', u'\u2E19', u'\u2E1B', u'\u2E1E', u'\u2E1F', u'\u2E2A',
u'\u2E2B', u'\u2E2C', u'\u2E2D', u'\u2E2E', u'\u2E30', u'\u2E31', u'\u2E32', u'\u2E33', u'\u2E34', u'\u2E35',
u'\u2E36', u'\u2E37', u'\u2E38', u'\u2E39', u'\u2E3C', u'\u2E3D', u'\u2E3E', u'\u2E3F', u'\u2E41', u'\u2E43',
u'\u2E44', u'\u3001', u'\u3002', u'\u3003', u'\u303D', u'\u30FB', u'\uA4FE', u'\uA4FF', u'\uA60D', u'\uA60E',
u'\uA60F', u'\uA673', u'\uA67E', u'\uA6F2', u'\uA6F3', u'\uA6F4', u'\uA6F5', u'\uA6F6', u'\uA6F7', u'\uA874',
u'\uA875', u'\uA876', u'\uA877', u'\uA8CE', u'\uA8CF', u'\uA8F8', u'\uA8F9', u'\uA8FA', u'\uA8FC', u'\uA92E',
u'\uA92F', u'\uA95F', u'\uA9C1', u'\uA9C2', u'\uA9C3', u'\uA9C4', u'\uA9C5', u'\uA9C6', u'\uA9C7', u'\uA9C8',
u'\uA9C9', u'\uA9CA', u'\uA9CB', u'\uA9CC', u'\uA9CD', u'\uA9DE', u'\uA9DF', u'\uAA5C', u'\uAA5D', u'\uAA5E',
u'\uAA5F', u'\uAADE', u'\uAADF', u'\uAAF0', u'\uAAF1', u'\uABEB', u'\uFE10', u'\uFE11', u'\uFE12', u'\uFE13',
u'\uFE14', u'\uFE15', u'\uFE16', u'\uFE19', u'\uFE30', u'\uFE45', u'\uFE46', u'\uFE49', u'\uFE4A', u'\uFE4B',
u'\uFE4C', u'\uFE50', u'\uFE51', u'\uFE52', u'\uFE54', u'\uFE55', u'\uFE56', u'\uFE57', u'\uFE5F', u'\uFE60',
u'\uFE61', u'\uFE68', u'\uFE6A', u'\uFE6B', u'\uFF01', u'\uFF02', u'\uFF03', u'\uFF05', u'\uFF06', u'\uFF07',
u'\uFF0A', u'\uFF0C', u'\uFF0E', u'\uFF0F', u'\uFF1A', u'\uFF1B', u'\uFF1F', u'\uFF20', u'\uFF3C', u'\uFF61',
u'\uFF64', u'\uFF65', u'\u10100', u'\u10101', u'\u10102', u'\u1039F', u'\u103D0', u'\u1056F', u'\u10857', u'\u1091F',
u'\u1093F', u'\u10A50', u'\u10A51', u'\u10A52', u'\u10A53', u'\u10A54', u'\u10A55', u'\u10A56', u'\u10A57', u'\u10A58',
u'\u10A7F', u'\u10AF0', u'\u10AF1', u'\u10AF2', u'\u10AF3', u'\u10AF4', u'\u10AF5', u'\u10AF6', u'\u10B39', u'\u10B3A',
u'\u10B3B', u'\u10B3C', u'\u10B3D', u'\u10B3E', u'\u10B3F', u'\u10B99', u'\u10B9A', u'\u10B9B', u'\u10B9C', u'\u11047',
u'\u11048', u'\u11049', u'\u1104A', u'\u1104B', u'\u1104C', u'\u1104D', u'\u110BB', u'\u110BC', u'\u110BE', u'\u110BF',
u'\u110C0', u'\u110C1', u'\u11140', u'\u11141', u'\u11142', u'\u11143', u'\u11174', u'\u11175', u'\u111C5', u'\u111C6',
u'\u111C7', u'\u111C8', u'\u111C9', u'\u111CD', u'\u111DB', u'\u111DD', u'\u111DE', u'\u111DF', u'\u11238', u'\u11239',
u'\u1123A', u'\u1123B', u'\u1123C', u'\u1123D', u'\u112A9', u'\u1144B', u'\u1144C', u'\u1144D', u'\u1144E', u'\u1144F',
u'\u1145B', u'\u1145D', u'\u114C6', u'\u115C1', u'\u115C2', u'\u115C3', u'\u115C4', u'\u115C5', u'\u115C6', u'\u115C7',
u'\u115C8', u'\u115C9', u'\u115CA', u'\u115CB', u'\u115CC', u'\u115CD', u'\u115CE', u'\u115CF', u'\u115D0', u'\u115D1',
u'\u115D2', u'\u115D3', u'\u115D4', u'\u115D5', u'\u115D6', u'\u115D7', u'\u11641', u'\u11642', u'\u11643', u'\u11660',
u'\u11661', u'\u11662', u'\u11663', u'\u11664', u'\u11665', u'\u11666', u'\u11667', u'\u11668', u'\u11669', u'\u1166A',
u'\u1166B', u'\u1166C', u'\u1173C', u'\u1173D', u'\u1173E', u'\u11C41', u'\u11C42', u'\u11C43', u'\u11C44', u'\u11C45',
u'\u11C70', u'\u11C71', u'\u12470', u'\u12471', u'\u12472', u'\u12473', u'\u12474', u'\u16A6E', u'\u16A6F', u'\u16AF5',
u'\u16B37', u'\u16B38', u'\u16B39', u'\u16B3A', u'\u16B3B', u'\u16B44', u'\u1BC9F', u'\u1DA87', u'\u1DA88', u'\u1DA89',
u'\u1DA8A', u'\u1DA8B', u'\u1E95E', u'\u1E95F'
]

# Unicode category Pi (Initial quote).
UNICODE_PI = [
u'\u00AB', u'\u2018', u'\u201B', u'\u201C', u'\u201F', u'\u2039', u'\u2E02', u'\u2E04', u'\u2E09', u'\u2E0C',
u'\u2E1C', u'\u2E20'
]

# Unicode category Pf (Final quote).
UNICODE_PF = [
u'\u00BB', u'\u2019', u'\u201D', u'\u203A', u'\u2E03', u'\u2E05', u'\u2E0A', u'\u2E0D', u'\u2E1D', u'\u2E21'
]

# Unicode category Pe (Close).
UNICODE_PE = [
u'\u0029', u'\u005D', u'\u007D', u'\u0F3B', u'\u0F3D', u'\u169C', u'\u2046', u'\u207E', u'\u208E', u'\u2309',
u'\u230B', u'\u232A', u'\u2769', u'\u276B', u'\u276D', u'\u276F', u'\u2771', u'\u2773', u'\u2775', u'\u27C6',
u'\u27E7', u'\u27E9', u'\u27EB', u'\u27ED', u'\u27EF', u'\u2984', u'\u2986', u'\u2988', u'\u298A', u'\u298C',
u'\u298E', u'\u2990', u'\u2992', u'\u2994', u'\u2996', u'\u2998', u'\u29D9', u'\u29DB', u'\u29FD', u'\u2E23',
u'\u2E25', u'\u2E27', u'\u2E29', u'\u3009', u'\u300B', u'\u300D', u'\u300F', u'\u3011', u'\u3015', u'\u3017',
u'\u3019', u'\u301B', u'\u301E', u'\u301F', u'\uFD3E', u'\uFE18', u'\uFE36', u'\uFE38', u'\uFE3A', u'\uFE3C',
u'\uFE3E', u'\uFE40', u'\uFE42', u'\uFE44', u'\uFE48', u'\uFE5A', u'\uFE5C', u'\uFE5E', u'\uFF09', u'\uFF3D',
u'\uFF5D', u'\uFF60', u'\uFF63'
]

# Unicode category Ps (Open).
UNICODE_PS = [
u'\u0028', u'\u005B', u'\u007B', u'\u0F3A', u'\u0F3C', u'\u169B', u'\u201A', u'\u201E', u'\u2045', u'\u207D',
u'\u208D', u'\u2308', u'\u230A', u'\u2329', u'\u2768', u'\u276A', u'\u276C', u'\u276E', u'\u2770', u'\u2772',
u'\u2774', u'\u27C5', u'\u27E6', u'\u27E8', u'\u27EA', u'\u27EC', u'\u27EE', u'\u2983', u'\u2985', u'\u2987',
u'\u2989', u'\u298B', u'\u298D', u'\u298F', u'\u2991', u'\u2993', u'\u2995', u'\u2997', u'\u29D8', u'\u29DA',
u'\u29FC', u'\u2E22', u'\u2E24', u'\u2E26', u'\u2E28', u'\u2E42', u'\u3008', u'\u300A', u'\u300C', u'\u300E',
u'\u3010', u'\u3014', u'\u3016', u'\u3018', u'\u301A', u'\u301D', u'\uFD3F', u'\uFE17', u'\uFE35', u'\uFE37',
u'\uFE39', u'\uFE3B', u'\uFE3D', u'\uFE3F', u'\uFE41', u'\uFE43', u'\uFE47', u'\uFE59', u'\uFE5B', u'\uFE5D',
u'\uFF08', u'\uFF3B', u'\uFF5B', u'\uFF5F', u'\uFF62'
]
~~~


---

** [bugs:#305] Inline markup end strings and the Unicode PO character set (rule 7 breakage).**

**Status:** open
**Group:** Default
**Created:** Fri Dec 30, 2016 12:10 PM UTC by Edward d'Auvergne
**Last Updated:** Fri Dec 30, 2016 09:21 PM UTC
**Owner:** nobody


The following text reveals a bug in docutils::

``name``¶

The pilcrow sign belongs to the [Unicode PO class](http://www.fileformat.info/info/unicode/category/Po/list.htm). According to Inline markup recognition rule 7, this is an allowed character after the inline markup end string. However according to rstcheck, and the other rst converters, this is broken::

test.rst:1: (WARNING/2) Inline literal start-string without end-string.


---

Sent from sourceforge.net because docutils-***@lists.sourceforge.net is subscribed to https://sourceforge.net/p/docutils/bugs/

To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/docutils/admin/bugs/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.
Edward d'Auvergne
2016-12-31 08:53:47 UTC
Permalink
Being a list from a published standard, that code would be non-copyrightable (i.e. public domain), so anyone can use it.


---

** [bugs:#305] Inline markup end strings and the Unicode PO character set (rule 7 breakage).**

**Status:** open
**Group:** Default
**Created:** Fri Dec 30, 2016 12:10 PM UTC by Edward d'Auvergne
**Last Updated:** Fri Dec 30, 2016 09:32 PM UTC
**Owner:** nobody


The following text reveals a bug in docutils::

``name``¶

The pilcrow sign belongs to the [Unicode PO class](http://www.fileformat.info/info/unicode/category/Po/list.htm). According to Inline markup recognition rule 7, this is an allowed character after the inline markup end string. However according to rstcheck, and the other rst converters, this is broken::

test.rst:1: (WARNING/2) Inline literal start-string without end-string.


---

Sent from sourceforge.net because docutils-***@lists.sourceforge.net is subscribed to https://sourceforge.net/p/docutils/bugs/

To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/docutils/admin/bugs/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.
Günter Milde
2017-01-05 14:14:18 UTC
Permalink
- **status**: open --> closed-fixed
- **Comment**:

After intensive discussion on the docutils-devel list, this bug is fixed by referencing the Unicode standard version used in the reference application.

* For stability, no rst parser should "silently" update the Unicode standard version used to parse inline markup, as this may break existing rst documents.

* For practicability and stability, Docutils uses pre-generated patterns based on Python's "unicodedata" stdlib module.

File an enhancement request if you want an updated version of the pre-generated patterns in one of the next "feature releases". (However, for efficiency we will not support Unicode standard versions newer that what Python unicodedata supports.)



---

** [bugs:#305] Inline markup end strings and the Unicode PO character set (rule 7 breakage).**

**Status:** closed-fixed
**Group:** Default
**Created:** Fri Dec 30, 2016 12:10 PM UTC by Edward d'Auvergne
**Last Updated:** Sat Dec 31, 2016 08:53 AM UTC
**Owner:** nobody


The following text reveals a bug in docutils::

``name``¶

The pilcrow sign belongs to the [Unicode PO class](http://www.fileformat.info/info/unicode/category/Po/list.htm). According to Inline markup recognition rule 7, this is an allowed character after the inline markup end string. However according to rstcheck, and the other rst converters, this is broken::

test.rst:1: (WARNING/2) Inline literal start-string without end-string.


---

Sent from sourceforge.net because docutils-***@lists.sourceforge.net is subscribed to https://sourceforge.net/p/docutils/bugs/

To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/docutils/admin/bugs/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.
Loading...