Tag Pattern Issues

Revisiting the page tag.


1. Issues

1.1. Underscore

Currently it is possible to create a page tag with a underscore, but then the page won't be found via run() function because _ is cast away before processing, because of the urls_underscores option for WikiWords.

<?php

// normalize & sanitize tag name
$this->sanitize_page_tag($tag);

1.2. Dot and Hyphen

PageTag.........
--PageTag........./
PageTag........./PageTag.........
--Page.Tag.........
--Page-Tag.........
--Page-.-Tag.........
--Page
...--Tag.........


What the full stop (. dot) and hyphen-minus (- hyphen) makes problematic, is that they are also part of existing syntax and therefore the incorrect use of them may be conflict with the WikiSyntax, for example:


--PageTag-- strike-through
__PageTag__ underlined
---PagePage hard linebreak
../PagePage relative path


Therefore only alphanumeric signs at the start and end of the tag and each subtag should be allowed. Furthermore the use of more then one dot or hyphen in consecutive order should be not allowed.


Now a page tag may consists of tag and parent tags, e.g. Cluster/SubCluster/PageTag, and the same rules apply for the parent tags.
--/jj---jj---//---kk-k/--- should be sanitized to jj-jj/kk-k to be valid.

2. Suggested Solution

2.1. Underscore

  1. filter out possible underscores in page tag during page creation by introducing a new $wacko_language['TAG'] pattern without _
    1. patch 1, patch 2
    2. $wacko_language['TAG']		= '[\p{L}\p{M}\p{Nd}\.\-\/]';
      $wacko_language['TAG_P']	= '\p{L}\p{M}\p{Nd}\.\-\/';	
  2. HOTFIX

    If you've created a inaccessible page with a underscore, just remove the underscore from the tag in the page table.

3. Considerations

The only reason we won't allow the underscore in the page tag is the urls_underscores option for WikiWords. In other words if we remove this option, we could allow also page tags with underscore, but then Wiki_Name and WikiName are no longer interchangeable.


page_id tag URI
1 WikiName https://example.com/WikiName
2 Wiki_Name https://example.com/Wiki_Name

In the case you want use WackoWiki that allows underscores in tag, then you can no longer use the urls_underscores option. We can add such a option, but then we must also set a flag in the config, which disables the urls_underscores option once and for all, because it is not backward compatible.


The user name is a subset of the page tag pattern, without slash / and possibly additional mandated name conventions.


In the upload handler, spaces in the file name are replaced by underscores.


The underscore character is used to create visual spacing within a sequence of characters, where a whitespace character is not permitted (e.g., in filenames, email addresses, and in Internet URLs).


A page tag cannot exceed 255 bytes in length. Be aware that non-ASCII characters may take up to four bytes in UTF-8 encoding, so the total number of characters that can fit into a title may be less than 255.


For database versions to work without key prefixes longer than 767 bytes.

+------------------+---------------------+------+-----+---------+----------------+
| Field            | Type                | Null | Key | Default | Extra          |
+------------------+---------------------+------+-----+---------+----------------+
| tag              | varchar(191)        | NO   | UNI |         |                |
+------------------+---------------------+------+-----+---------+----------------+	

Newer database versions support index key prefixes up to 3072 bytes by default.


Data Type Storage Requirements

4. Regex pattern


$wacko_language['USER_NAME']	= '[\p{L}\p{Nd}\.\-]+';
$wacko_language['USER_NAME_P']	= '\p{L}\p{Nd}\.\-';

$wacko_language['TAG']		= '[\p{L}\p{M}\p{Nd}\.\-\/]';
$wacko_language['TAG_P']	= '\p{L}\p{M}\p{Nd}\.\-\/';

$wacko_language['UPPER']	= '[\p{Lu}]';
$wacko_language['UPPERNUM']	= '[\p{Lu}\p{Nd}]';
$wacko_language['LOWER']	= '[\p{Ll}\/]';
$wacko_language['ALPHA']	= '[\p{L}\_\-\/]';
$wacko_language['ALPHANUM']	= '[\p{L}\p{M}\p{Nd}\_\-\/]';
$wacko_language['ALPHANUM_P']	= '\p{L}\p{M}\p{Nd}\_\-\/';	

4.1. User name

\p{L}\p{Nd}\-\.	

4.2. Page tag

\p{L}\p{M}\p{Nd}\-\.\/	

4.3. URL underscore

\p{L}\p{M}\p{Nd}\_\-\.\/	

5. Processing

5.1. Validation

5.1.1. JS Client

<?php

$tpl->pattern    = $this->language['TAG'];

<input type="text" id="new_tag" name="tag" value="[ ' tag | e attr ' ]" pattern="[ ' pattern | e attr ' ]" title="[ ' only | e attr ' ]" size="60" maxlength="255">	

5.1.2. PHP Server

<?php

if (!preg_match('/^([' . $this->language['TAG_P'] . '\.]+)$/u', $new_tag))
{
    $this->set_message($this->_t('InvalidWikiName'));
}

5.2. Sanitization

<?php

function sanitize_page_tag(&$tag, $normalize = false)
{
    // normalizing tag name
    $tag = Ut::normalize($tag);

    // remove starting/trailing slashes, spaces, and minimize multi-slashes
    $tag = preg_replace_callback('#^/+|/+$|(/{2,})|\s+#u',
        function ($x)
        {
            return @$x[1]? '/' : '';
        }, $tag);

    $tag = preg_replace('/[^' . $this->language['TAG_P'] . '\.]/u', '', $tag);

    // strip full stop, hyphen-minus and underline from the beginning and end of the string
    $tag = utf8_trim($tag, '.-_');
}

6. Message sets

Validation error messages for info box as well as title argument provided for form input patterns.


A message set that do reflect the fact, that along with alphanumeric characters also full stop (. dot), hyphen-minus (- hyphen) and slash (/) but no underline (_) is allowed for the page tag, is missing.


'InvalidWikiName'			=> 'Chosen name is invalid',
	'InvalidUserName'			=> 'Chosen user name is invalid',

	'NameAlphanumOnly'			=> 'Username must be between %1 and %2 chars long and use only alphanumeric characters. Upper case characters are OK.',
	'NameCamelCaseOnly'			=> 'Username must be between %1 and %2 chars long and WikiName formatted.',	

Ut::perc_replace($this->_t($this->db->disable_wikiname? 'NameAlphanumOnly' : 'NameCamelCaseOnly'),
			$this->db->username_chars_min,
			$this->db->username_chars_max);