View source for Tag Pattern Issues

Revisiting the page tag.
{{toc numerate=1}}

===Issues===
====Underscore====
Currently it is possible to create a page tag with a underscore, but then the page won't be found via run() function because ##_## is cast away before processing, because of the ##urls_underscores## option for WikiWords. 
%%(php)
<?php

// normalize & sanitize tag name
$this->sanitize_page_tag($tag);
%%
====Dot and Hyphen====
--PageTag.........
--PageTag........./--PageTag........./--PageTag.........
--Page.Tag.........
--Page-Tag.........
--Page-.-Tag.........
--Page--...--Tag.........

What the **full stop** (##.## dot) and **hyphen-minus** (##-## hyphen) makes problematic, is that they are also part of existing syntax and therefore the incorrect use of them may be conflict with the WikiSyntax, for example:

##""--PageTag--""## strike-through
##""__PageTag__""## underlined
##""---PagePage""## hard linebreak
##""../PagePage""## relative path

Therefore only alphanumeric signs at the start and end of the tag and each subtag should be allowed. Furthermore the use of more then one dot or hyphen in consecutive order should be not allowed.

Now a page tag may consists of tag and parent tags, e.g. **Cluster/SubCluster/PageTag**, and the same rules apply for the parent tags.
##""--/jj---jj---//---kk-k/---""## should be sanitized to ##jj-jj/kk-k## to be valid.

===Suggested Solution===
====Underscore====
  1. filter out possible underscores in page tag during page creation by introducing a new ##$wacko_language['TAG']## pattern without ##_##
    1. ((commit:c62b47172363d52194cc202de9e762c7b124a0e1 patch 1)), ((commit:e71ab5bbd84a755be265a9af19658a961f0e7610 patch 2))
    2. %%$wacko_language['TAG']		= '[\p{L}\p{M}\p{Nd}\.\-\/]';
$wacko_language['TAG_P']	= '\p{L}\p{M}\p{Nd}\.\-\/';%% 
  3. %%(info type="example" title="HOTFIX")
If you've created a inaccessible page with a underscore, just remove the underscore from the ##tag## in the page table.
%%
 
===Considerations===
The only reason we won't allow the underscore in the page tag is the ##urls_underscores## option for WikiWords. In other words if we remove this option, we could allow also page tags with underscore, but then ##Wiki_Name## and ##WikiName## are no longer interchangeable.

#|
*| page_id | tag | URI |*
|| 1 | WikiName | ~https://example.com/WikiName ||
|| 2 | Wiki_Name | ~https://example.com/Wiki_Name ||
|#

In the case you want use WackoWiki that allows underscores in ##tag##, then you can no longer use the ##urls_underscores## option. We can add such a option, but then we must also set a flag in the config, which disables the ##urls_underscores## option once and for all, because it is not backward compatible.

The user name is a subset of the page tag pattern, without slash ##/## and possibly additional mandated name conventions.

In the upload handler, spaces in the file name are replaced by underscores.

The underscore character is used to create visual spacing within a sequence of characters, where a whitespace character is not permitted (e.g., in filenames, email addresses, and in Internet URLs).

A page tag cannot exceed 255 bytes in length. Be aware that non-ASCII characters may take up to four bytes in UTF-8 encoding, so the total number of characters that can fit into a title may be less than 255.

For database versions to work without key prefixes longer than 767 bytes.
%%
+------------------+---------------------+------+-----+---------+----------------+
| Field            | Type                | Null | Key | Default | Extra          |
+------------------+---------------------+------+-----+---------+----------------+
| tag              | varchar(191)        | NO   | UNI |         |                |
+------------------+---------------------+------+-----+---------+----------------+
%%
Newer database versions support index key prefixes up to 3072 bytes by default.


((https://dev.mysql.com/doc/refman/8.0/en/storage-requirements.html Data Type Storage Requirements))

===Regex pattern===

%%
$wacko_language['USER_NAME']	= '[\p{L}\p{Nd}\.\-]+';
$wacko_language['USER_NAME_P']	= '\p{L}\p{Nd}\.\-';

$wacko_language['TAG']		= '[\p{L}\p{M}\p{Nd}\.\-\/]';
$wacko_language['TAG_P']	= '\p{L}\p{M}\p{Nd}\.\-\/';

$wacko_language['UPPER']	= '[\p{Lu}]';
$wacko_language['UPPERNUM']	= '[\p{Lu}\p{Nd}]';
$wacko_language['LOWER']	= '[\p{Ll}\/]';
$wacko_language['ALPHA']	= '[\p{L}\_\-\/]';
$wacko_language['ALPHANUM']	= '[\p{L}\p{M}\p{Nd}\_\-\/]';
$wacko_language['ALPHANUM_P']	= '\p{L}\p{M}\p{Nd}\_\-\/';
%%

====User name====
%%\p{L}\p{Nd}\-\.%% 
====Page tag====
%%\p{L}\p{M}\p{Nd}\-\.\/%%
====URL underscore====
%%\p{L}\p{M}\p{Nd}\_\-\.\/%%

===Processing===
====Validation====
=====JS Client =====
%%(php)
<?php

$tpl->pattern	= $this->language['TAG'];
%% 
%%
<input type="text" id="new_tag" name="tag" value="[ ' tag | e attr ' ]" pattern="[ ' pattern | e attr ' ]" title="[ ' only | e attr ' ]" size="60" maxlength="255">
%%
=====PHP Server =====
%%(php)
<?php

if (!preg_match('/^([' . $this->language['TAG_P'] . '\.]+)$/u', $new_tag))
{
	$this->set_message($this->_t('InvalidWikiName'));
}
%% 
====Sanitization====
%%(php)
<?php

function sanitize_page_tag(&$tag, $normalize = false)
{
	// normalizing tag name
	$tag = Ut::normalize($tag);

	// remove starting/trailing slashes, spaces, and minimize multi-slashes
	$tag = preg_replace_callback('#^/+|/+$|(/{2,})|\s+#u',
		function ($x)
		{
			return @$x[1]? '/' : '';
		}, $tag);

	$tag = preg_replace('/[^' . $this->language['TAG_P'] . '\.]/u', '', $tag);

	// strip full stop, hyphen-minus and underline from the beginning and end of the string
	$tag = utf8_trim($tag, '.-_');
}
%%

===Message sets===
Validation error messages for info box as well as title argument provided for form input patterns.

A message set that do reflect the fact, that along with alphanumeric characters also **full stop** (##.## dot), **hyphen-minus** (##-## hyphen) and **slash** (##/##) but no underline (##_##) is allowed for the page tag, is missing.

%%
	'InvalidWikiName'			=> 'Chosen name is invalid',
	'InvalidUserName'			=> 'Chosen user name is invalid',

	'NameAlphanumOnly'			=> 'Username must be between %1 and %2 chars long and use only alphanumeric characters. Upper case characters are OK.',
	'NameCamelCaseOnly'			=> 'Username must be between %1 and %2 chars long and WikiName formatted.',
%%

%%
Ut::perc_replace($this->_t($this->db->disable_wikiname? 'NameAlphanumOnly' : 'NameCamelCaseOnly'),
			$this->db->username_chars_min,
			$this->db->username_chars_max);
%%