Tag Pattern Issues
Revisiting the page tag.
1. Issues
1.1. Underscore
Currently it is possible to create a page tag with a underscore, but then the page won't be found via run() function because _
is cast away before processing, because of the urls_underscores
option for WikiWords.
<?php
// normalize & sanitize tag name
$this->sanitize_page_tag($tag);
1.2. Dot and Hyphen
PageTag.........PageTag........./
--PageTag........./PageTag............--Tag.........
--Page.Tag.........
--Page-Tag.........
--Page-.-Tag.........
--Page
What the full stop (.
dot) and hyphen-minus (-
hyphen) makes problematic, is that they are also part of existing syntax and therefore the incorrect use of them may be conflict with the WikiSyntax, for example:
--PageTag--
strike-through
__PageTag__
underlined
---PagePage
hard linebreak
../PagePage
relative path
Therefore only alphanumeric signs at the start and end of the tag and each subtag should be allowed. Furthermore the use of more then one dot or hyphen in consecutive order should be not allowed.
Now a page tag may consists of tag and parent tags, e.g. Cluster/SubCluster/PageTag, and the same rules apply for the parent tags.
--/jj---jj---//---kk-k/---
should be sanitized to jj-jj/kk-k
to be valid.
2. Suggested Solution
2.1. Underscore
- filter out possible underscores in page tag during page creation by introducing a new
$wacko_language['TAG']
pattern without_
-
HOTFIX
If you've created a inaccessible page with a underscore, just remove the underscore from thetag
in the page table.
3. Considerations
The only reason we won't allow the underscore in the page tag is the urls_underscores
option for WikiWords. In other words if we remove this option, we could allow also page tags with underscore, but then Wiki_Name
and WikiName
are no longer interchangeable.
page_id | tag | URI |
---|---|---|
1 | WikiName | https://example.com/WikiName |
2 | Wiki_Name | https://example.com/Wiki_Name |
In the case you want use WackoWiki that allows underscores in tag
, then you can no longer use the urls_underscores
option. We can add such a option, but then we must also set a flag in the config, which disables the urls_underscores
option once and for all, because it is not backward compatible.
The user name is a subset of the page tag pattern, without slash /
and possibly additional mandated name conventions.
In the upload handler, spaces in the file name are replaced by underscores.
The underscore character is used to create visual spacing within a sequence of characters, where a whitespace character is not permitted (e.g., in filenames, email addresses, and in Internet URLs).
A page tag cannot exceed 255 bytes in length. Be aware that non-ASCII characters may take up to four bytes in UTF-8 encoding, so the total number of characters that can fit into a title may be less than 255.
For database versions to work without key prefixes longer than 767 bytes.
+------------------+---------------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +------------------+---------------------+------+-----+---------+----------------+ | tag | varchar(191) | NO | UNI | | | +------------------+---------------------+------+-----+---------+----------------+
Newer database versions support index key prefixes up to 3072 bytes by default.
Data Type Storage Requirements
4. Regex pattern
$wacko_language['USER_NAME'] = '[\p{L}\p{Nd}\.\-]+'; $wacko_language['USER_NAME_P'] = '\p{L}\p{Nd}\.\-'; $wacko_language['TAG'] = '[\p{L}\p{M}\p{Nd}\.\-\/]'; $wacko_language['TAG_P'] = '\p{L}\p{M}\p{Nd}\.\-\/'; $wacko_language['UPPER'] = '[\p{Lu}]'; $wacko_language['UPPERNUM'] = '[\p{Lu}\p{Nd}]'; $wacko_language['LOWER'] = '[\p{Ll}\/]'; $wacko_language['ALPHA'] = '[\p{L}\_\-\/]'; $wacko_language['ALPHANUM'] = '[\p{L}\p{M}\p{Nd}\_\-\/]'; $wacko_language['ALPHANUM_P'] = '\p{L}\p{M}\p{Nd}\_\-\/';
4.1. User name
\p{L}\p{Nd}\-\.
4.2. Page tag
\p{L}\p{M}\p{Nd}\-\.\/
4.3. URL underscore
\p{L}\p{M}\p{Nd}\_\-\.\/
5. Processing
5.1. Validation
5.1.1. JS Client
<?php
$tpl->pattern = $this->language['TAG'];
<input type="text" id="new_tag" name="tag" value="[ ' tag | e attr ' ]" pattern="[ ' pattern | e attr ' ]" title="[ ' only | e attr ' ]" size="60" maxlength="255">
5.1.2. PHP Server
<?php
if (!preg_match('/^([' . $this->language['TAG_P'] . '\.]+)$/u', $new_tag))
{
$this->set_message($this->_t('InvalidWikiName'));
}
5.2. Sanitization
<?php
function sanitize_page_tag(&$tag, $normalize = false)
{
// normalizing tag name
$tag = Ut::normalize($tag);
// remove starting/trailing slashes, spaces, and minimize multi-slashes
$tag = preg_replace_callback('#^/+|/+$|(/{2,})|\s+#u',
function ($x)
{
return @$x[1]? '/' : '';
}, $tag);
$tag = preg_replace('/[^' . $this->language['TAG_P'] . '\.]/u', '', $tag);
// strip full stop, hyphen-minus and underline from the beginning and end of the string
$tag = utf8_trim($tag, '.-_');
}
6. Message sets
Validation error messages for info box as well as title argument provided for form input patterns.
A message set that do reflect the fact, that along with alphanumeric characters also full stop (.
dot), hyphen-minus (-
hyphen) and slash (/
) but no underline (_
) is allowed for the page tag, is missing.
'InvalidWikiName' => 'Chosen name is invalid', 'InvalidUserName' => 'Chosen user name is invalid', 'NameAlphanumOnly' => 'Username must be between %1 and %2 chars long and use only alphanumeric characters. Upper case characters are OK.', 'NameCamelCaseOnly' => 'Username must be between %1 and %2 chars long and WikiName formatted.',
Ut::perc_replace($this->_t($this->db->disable_wikiname? 'NameAlphanumOnly' : 'NameCamelCaseOnly'), $this->db->username_chars_min, $this->db->username_chars_max);