mirror of
git://git.sv.gnu.org/coreutils.git
synced 2026-02-16 04:12:26 +02:00
doc: add "version sort ordering" chapter
* doc/sort-version.texi: New file. * doc/local.mk (doc_coreutils_TEXINFOS): Add new file. * doc/coreutils.texi: @include new file, replace previous "Details about version sort" section.
This commit is contained in:
@@ -212,6 +212,7 @@ Free Documentation License''.
|
||||
* File permissions:: Access modes
|
||||
* File timestamps:: File timestamp issues
|
||||
* Date input formats:: Specifying date strings
|
||||
* Version sort ordering:: Details on version-sort algorithm
|
||||
* Opening the software toolbox:: The software tools philosophy
|
||||
* GNU Free Documentation License:: Copying and sharing this manual
|
||||
* Concept index:: General index
|
||||
@@ -315,7 +316,6 @@ Directory listing
|
||||
* Which files are listed:: Which files are listed
|
||||
* What information is listed:: What information is listed
|
||||
* Sorting the output:: Sorting the output
|
||||
* Details about version sort:: More details about version sort
|
||||
* General output formatting:: General output formatting
|
||||
* Formatting the file names:: Formatting the file names
|
||||
|
||||
@@ -495,6 +495,13 @@ Date input formats
|
||||
* Specifying time zone rules:: TZ="America/New_York", TZ="UTC0"
|
||||
* Authors of parse_datetime:: Bellovin, Eggert, Salz, Berets, et al.
|
||||
|
||||
Version sorting order
|
||||
|
||||
* Version sort overview::
|
||||
* Implementation Details::
|
||||
* Differences from the official Debian Algorithm::
|
||||
* Advanced Topics::
|
||||
|
||||
Opening the software toolbox
|
||||
|
||||
* Toolbox introduction:: Toolbox introduction
|
||||
@@ -4470,7 +4477,7 @@ To compare such strings numerically, use the
|
||||
@cindex version number sort
|
||||
Sort by version name and number. It behaves like a standard sort,
|
||||
except that each sequence of decimal digits is treated numerically
|
||||
as an index/version number. (@xref{Details about version sort}.)
|
||||
as an index/version number. (@xref{Version sort ordering}.)
|
||||
|
||||
@item -r
|
||||
@itemx --reverse
|
||||
@@ -7370,7 +7377,6 @@ Also see @ref{Common options}.
|
||||
* Which files are listed::
|
||||
* What information is listed::
|
||||
* Sorting the output::
|
||||
* Details about version sort::
|
||||
* General output formatting::
|
||||
* Formatting file timestamps::
|
||||
* Formatting the file names::
|
||||
@@ -7906,7 +7912,7 @@ directories, since not doing any sorting can be noticeably faster.
|
||||
@opindex version@r{, sorting option for @command{ls}}
|
||||
Sort by version name and number, lowest first. It behaves like a default
|
||||
sort, except that each sequence of decimal digits is treated numerically
|
||||
as an index/version number. (@xref{Details about version sort}.)
|
||||
as an index/version number. (@xref{Version sort ordering}.)
|
||||
|
||||
@item -X
|
||||
@itemx --sort=extension
|
||||
@@ -7919,60 +7925,6 @@ after the last @samp{.}); files with no extension are sorted first.
|
||||
@end table
|
||||
|
||||
|
||||
@node Details about version sort
|
||||
@subsection Details about version sort
|
||||
|
||||
Version sorting handles the fact that file names frequently include indices or
|
||||
version numbers. Standard sorting usually does not produce the order that one
|
||||
expects because comparisons are made on a character-by-character basis.
|
||||
Version sorting is especially useful when browsing directories that contain
|
||||
many files with indices/version numbers in their names:
|
||||
|
||||
@example
|
||||
$ ls -1 $ ls -1v
|
||||
abc.zml-1.gz abc.zml-1.gz
|
||||
abc.zml-12.gz abc.zml-2.gz
|
||||
abc.zml-2.gz abc.zml-12.gz
|
||||
@end example
|
||||
|
||||
Version-sorted strings are compared such that if @var{ver1} and @var{ver2}
|
||||
are version numbers and @var{prefix} and @var{suffix} (@var{suffix} matching
|
||||
the regular expression @samp{(\.[A-Za-z~][A-Za-z0-9~]*)*}) are strings then
|
||||
@var{ver1} < @var{ver2} implies that the name composed of
|
||||
``@var{prefix} @var{ver1} @var{suffix}'' sorts before
|
||||
``@var{prefix} @var{ver2} @var{suffix}''.
|
||||
|
||||
Note also that leading zeros of numeric parts are ignored:
|
||||
|
||||
@example
|
||||
$ ls -1 $ ls -1v
|
||||
abc-1.007.tgz abc-1.01a.tgz
|
||||
abc-1.012b.tgz abc-1.007.tgz
|
||||
abc-1.01a.tgz abc-1.012b.tgz
|
||||
@end example
|
||||
|
||||
This functionality is implemented using gnulib's @code{filevercmp} function,
|
||||
which has some caveats worth noting.
|
||||
|
||||
@itemize @bullet
|
||||
@item @env{LC_COLLATE} is ignored, which means @samp{ls -v} and @samp{sort -V}
|
||||
will sort non-numeric prefixes as if the @env{LC_COLLATE} locale category
|
||||
was set to @samp{C}@.
|
||||
@item Some suffixes will not be matched by the regular
|
||||
expression mentioned above. Consequently these examples may
|
||||
not sort as you expect:
|
||||
|
||||
@example
|
||||
abc-1.2.3.4.7z
|
||||
abc-1.2.3.7z
|
||||
@end example
|
||||
|
||||
@example
|
||||
abc-1.2.3.4.x86_64.rpm
|
||||
abc-1.2.3.x86_64.rpm
|
||||
@end example
|
||||
@end itemize
|
||||
|
||||
@node General output formatting
|
||||
@subsection General output formatting
|
||||
|
||||
@@ -18903,6 +18855,7 @@ to set a file's timestamp to an arbitrary value.
|
||||
|
||||
@include parse-datetime.texi
|
||||
|
||||
@include sort-version.texi
|
||||
|
||||
@c What's GNU?
|
||||
@c Arnold Robbins
|
||||
|
||||
@@ -22,7 +22,8 @@ doc_coreutils_TEXINFOS = \
|
||||
doc/perm.texi \
|
||||
doc/parse-datetime.texi \
|
||||
doc/constants.texi \
|
||||
doc/fdl.texi
|
||||
doc/fdl.texi \
|
||||
doc/sort-version.texi
|
||||
|
||||
# The following is necessary if the package name is 8 characters or longer.
|
||||
# If the info documentation would be split into 10 or more separate files,
|
||||
|
||||
902
doc/sort-version.texi
Normal file
902
doc/sort-version.texi
Normal file
@@ -0,0 +1,902 @@
|
||||
@c GNU Version-sort ordering documentation
|
||||
|
||||
@c Copyright (C) 2019 Free Software Foundation, Inc.
|
||||
|
||||
@c Permission is granted to copy, distribute and/or modify this document
|
||||
@c under the terms of the GNU Free Documentation License, Version 1.3 or
|
||||
@c any later version published by the Free Software Foundation; with no
|
||||
@c Invariant Sections, no Front-Cover Texts, and no Back-Cover
|
||||
@c Texts. A copy of the license is included in the ``GNU Free
|
||||
@c Documentation License'' file as part of this distribution.
|
||||
|
||||
@c Written by Assaf Gordon
|
||||
|
||||
@node Version sort ordering
|
||||
@chapter Version sort ordering
|
||||
|
||||
|
||||
|
||||
@node Version sort overview
|
||||
@section Version sort overview
|
||||
|
||||
@dfn{version sort} ordering (and similarly, @dfn{natural sort}
|
||||
ordering) is a method to sort items such as file names and lines of
|
||||
text in an order that feels more natural to people, when the text
|
||||
contains a mixture of letters and digits.
|
||||
|
||||
Standard sorting usually does not produce the order that one expects
|
||||
because comparisons are made on a character-by-character basis.
|
||||
|
||||
Compare the sorting of the following items:
|
||||
|
||||
@example
|
||||
Alphabetical sort: Version Sort:
|
||||
|
||||
a1 a1
|
||||
a120 a2
|
||||
a13 a13
|
||||
a2 a120
|
||||
@end example
|
||||
|
||||
version sort functionality in GNU coreutils is available in the @samp{ls -v},
|
||||
@samp{ls --sort=version}, @samp{sort -V}, @samp{sort --version-sort} commands.
|
||||
|
||||
|
||||
|
||||
@node Using version sort in GNU coreutils
|
||||
@subsection Using version sort in GNU coreutils
|
||||
|
||||
Two GNU coreutils programs use version sort: @command{ls} and @command{sort}.
|
||||
|
||||
To list files in version sort order, use @command{ls}
|
||||
with @option{-v} or @option{--sort=version} options:
|
||||
|
||||
@example
|
||||
default sort: version sort:
|
||||
|
||||
$ ls -1 $ ls -1 -v
|
||||
a1 a1
|
||||
a100 a1.4
|
||||
a1.13 a1.13
|
||||
a1.4 a1.40
|
||||
a1.40 a2
|
||||
a2 a100
|
||||
@end example
|
||||
|
||||
To sort text files in version sort order, use @command{sort} with
|
||||
the @option{-V} option:
|
||||
|
||||
@example
|
||||
$ cat input
|
||||
b3
|
||||
b11
|
||||
b1
|
||||
b20
|
||||
|
||||
|
||||
alphabetical order: version sort order:
|
||||
|
||||
$ sort input $ sort -V input
|
||||
b1 b1
|
||||
b11 b3
|
||||
b20 b11
|
||||
b3 b20
|
||||
@end example
|
||||
|
||||
To sort a specific column in a file use @option{-k/--key} with @samp{V}
|
||||
ordering option:
|
||||
|
||||
@example
|
||||
$ cat input2
|
||||
1000 b3 apples
|
||||
2000 b11 oranges
|
||||
3000 b1 potatos
|
||||
4000 b20 bananas
|
||||
|
||||
$ sort -k2V,2 input2
|
||||
3000 b1 potatos
|
||||
1000 b3 apples
|
||||
2000 b11 oranges
|
||||
4000 b20 bananas
|
||||
@end example
|
||||
|
||||
@node Origin of version sort and differences from natural sort
|
||||
@subsection Origin of version sort and differences from natural sort
|
||||
|
||||
In GNU coreutils, the name @dfn{version sort} was chosen because it is based
|
||||
on Debian GNU/Linux's algorithm of sorting packages' versions.
|
||||
|
||||
Its goal is to answer the question
|
||||
``which package is newer, @file{firefox-60.7.2} or @file{firefox-60.12.3} ?''
|
||||
|
||||
In coreutils this algorithm was slightly modified to work on more
|
||||
general input such as textual strings and file names
|
||||
(see @ref{Differences from the official Debian Algorithm}).
|
||||
|
||||
In other contexts, such as other programs and other programming
|
||||
languages, a similar sorting functionality is called
|
||||
@uref{https://en.wikipedia.org/wiki/Natural_sort_order,natural sort}.
|
||||
|
||||
|
||||
@node Correct/Incorrect ordering and Expected/Unexpected results
|
||||
@subsection Correct/Incorrect ordering and Expected/Unexpected results
|
||||
|
||||
Currently there is no standard for version/natural sort ordering.
|
||||
|
||||
That is: there is no one correct way or universally agreed-upon way to
|
||||
order items. Each program and each programming language can decide its
|
||||
own ordering algorithm and call it 'natural sort' (or other various
|
||||
names).
|
||||
|
||||
See @ref{Other version/natural sort implementations} for many examples of
|
||||
differing sorting possibilities, each with its own rules and variations.
|
||||
|
||||
If you do suspect a bug in coreutils' implementation of version-sort,
|
||||
see @ref{Reporting bugs or incorrect results} on how to report them.
|
||||
|
||||
|
||||
@node Implementation Details
|
||||
@section Implementation Details
|
||||
|
||||
GNU coreutils' version sort algorithm is based on
|
||||
@uref{https://www.debian.org/doc/debian-policy/ch-controlfields.html#version,
|
||||
Debian's versioning scheme}, specifically on the "upstream version"
|
||||
part.
|
||||
|
||||
This section describe the ordering rules.
|
||||
|
||||
The next section (@ref{Differences from the official Debian
|
||||
Algorithm}) describes some differences between GNU coreutils
|
||||
implementation and Debian's official algorithm.
|
||||
|
||||
|
||||
@node Version-sort ordering rules
|
||||
@subsection Version-sort ordering rules
|
||||
|
||||
The version sort ordering rules are:
|
||||
|
||||
@enumerate
|
||||
@item
|
||||
The strings are compared from left to right.
|
||||
|
||||
@item
|
||||
First the initial part of each string consisting entirely of non-digit
|
||||
characters is determined.
|
||||
|
||||
@enumerate
|
||||
@item
|
||||
These two parts (one of which may be empty) are compared lexically.
|
||||
If a difference is found it is returned.
|
||||
|
||||
@item
|
||||
The lexical comparison is a comparison of ASCII values modified so that:
|
||||
|
||||
@enumerate
|
||||
@item
|
||||
all the letters sort earlier than all the non-letters and
|
||||
@item
|
||||
so that a tilde sorts before anything, even the end of a part.
|
||||
@end enumerate
|
||||
@end enumerate
|
||||
|
||||
@item
|
||||
Then the initial part of the remainder of each string which consists
|
||||
entirely of digit characters is determined. The numerical values of
|
||||
these two parts are compared, and any difference found is returned as
|
||||
the result of the comparison.
|
||||
@enumerate
|
||||
@item
|
||||
For these purposes an empty string (which can only occur at the end of
|
||||
one or both version strings being compared) counts as zero.
|
||||
@end enumerate
|
||||
|
||||
@item
|
||||
These two steps (comparing and removing initial non-digit strings and
|
||||
initial digit strings) are repeated until a difference is found or
|
||||
both strings are exhausted.
|
||||
@end enumerate
|
||||
|
||||
Consider the version-sort comparison of two file names:
|
||||
@file{foo07.7z} and @file{foo7a.7z}. The two strings will be broken
|
||||
down to the following parts, and the parts compared respectively from
|
||||
each string:
|
||||
|
||||
@example
|
||||
foo @r{vs} foo @r{(rule 2, non-digits characters)}
|
||||
07 @r{vs} 7 @r{(rule 3, digits characters)}
|
||||
. @r{vs} a. @r{(rule 2)}
|
||||
7 @r{vs} 7 @r{(rule 3)}
|
||||
z @r{vs} z @r{(rule 2)}
|
||||
@end example
|
||||
|
||||
Comparison flow based on above algorithm:
|
||||
|
||||
@enumerate
|
||||
@item
|
||||
The first parts (@code{foo}) are identical in both strings.
|
||||
|
||||
@item
|
||||
The second parts (@code{07} and @code{7}) are compared numerically,
|
||||
and are identical.
|
||||
|
||||
@item
|
||||
The third parts (@samp{@code{.}} vs @samp{@code{a.}}) are compared
|
||||
lexically by ASCII value (rule 2.2).
|
||||
|
||||
@item
|
||||
The first character of the first string (@samp{@code{.}}) is compared
|
||||
to the first character of the second string (@samp{@code{a}}).
|
||||
|
||||
@item
|
||||
Rule 2.2.1 dictates that "all letters sorts earlier than all non-letters".
|
||||
Hence, @samp{@code{a}} comes before @samp{@code{.}}.
|
||||
|
||||
@item
|
||||
The returned result is that @file{foo7a.7z} comes before @file{foo07.7z}.
|
||||
@end enumerate
|
||||
|
||||
Result when using sort:
|
||||
|
||||
@example
|
||||
$ cat input3
|
||||
foo07.7z
|
||||
foo7a.7z
|
||||
|
||||
$ sort -V input3
|
||||
foo7a.7z
|
||||
foo07.7z
|
||||
@end example
|
||||
|
||||
See @ref{Differences from the official Debian Algorithm} for
|
||||
additional rules that extend the Debian algorithm in coreutils.
|
||||
|
||||
|
||||
@node Version sort is not the same as numeric sort
|
||||
@subsection Version sort is not the same as numeric sort
|
||||
|
||||
Consider the following text file:
|
||||
|
||||
@example
|
||||
$ cat input4
|
||||
8.10
|
||||
8.5
|
||||
8.1
|
||||
8.01
|
||||
8.010
|
||||
8.100
|
||||
8.49
|
||||
|
||||
|
||||
|
||||
Numerical Sort: Version Sort:
|
||||
|
||||
$ sort -n input4 $ sort -V input4
|
||||
8.01 8.01
|
||||
8.010 8.1
|
||||
8.1 8.5
|
||||
8.10 8.010
|
||||
8.100 8.10
|
||||
8.49 8.49
|
||||
8.5 8.100
|
||||
@end example
|
||||
|
||||
Numeric sort (@samp{sort -n}) treats the entire string as a single numeric
|
||||
value, and compares it to other values. For example, @code{8.1}, @code{8.10} and
|
||||
@code{8.100} are numerically equivalent, and are ordered together. Similarly,
|
||||
@code{8.49} is numerically smaller than @code{8.5}, and appears before first.
|
||||
|
||||
Version sort (@samp{sort -V}) first breaks down the string into digits and
|
||||
non-digits parts, and only then compares each part (see annotated
|
||||
example in Version-sort ordering rules).
|
||||
|
||||
Comparing the string @code{8.1} to @code{8.01}, first the
|
||||
@samp{@code{8}} characters are compared (and are identical), then the
|
||||
dots (@samp{@code{.}}) are compared and are identical, and lastly the
|
||||
remaining digits are compared numerically (@code{1} and @code{01}) -
|
||||
which are numerically equivalent. Hence, @code{8.01} and @code{8.1}
|
||||
are grouped together.
|
||||
|
||||
Similarly, comparing @code{8.5} to @code{8.49} - the @samp{@code{8}}
|
||||
and @samp{@code{.}} parts are identical, then the numeric values @code{5} and
|
||||
@code{49} are compared. The resulting @code{5} appears before @code{49}.
|
||||
|
||||
This sorting order (where @code{8.5} comes before @code{8.49}) is common when
|
||||
assigning versions to computer programs (while perhaps not intuitive
|
||||
or 'natural' for people).
|
||||
|
||||
@node Punctuation Characters
|
||||
@subsection Punctuation Characters
|
||||
|
||||
Punctuation characters are sorted by ASCII order (rule 2.2).
|
||||
|
||||
@example
|
||||
$ touch 1.0.5_src.tar.gz 1.0_src.tar.gz
|
||||
|
||||
$ ls -v -1
|
||||
1.0.5_src.tar.gz
|
||||
1.0_src.tar.gz
|
||||
@end example
|
||||
|
||||
Why is @file{1.0.5_src.tar.gz} listed before @file{1.0_src.tar.gz} ?
|
||||
|
||||
Based on the @ref{Version-sort ordering rules,algorithm,algorithm}
|
||||
above, the strings are broken down into the following parts:
|
||||
|
||||
@example
|
||||
1 @r{vs} 1 @r{(rule 3, all digit characters)}
|
||||
. @r{vs} . @r{(rule 2, all non-digit characters)}
|
||||
0 @r{vs} 0 @r{(rule 3)}
|
||||
. @r{vs} _src.tar.gz @r{(rule 2)}
|
||||
5 @r{vs} empty string @r{(no more character in the file name)}
|
||||
_src.tar.gz @r{vs} empty string
|
||||
@end example
|
||||
|
||||
The fourth parts (@samp{@code{.}} and @code{_src.tar.gz}) are compared
|
||||
lexically by ASCII order. The character @samp{@code{.}} (ASCII value 46) is
|
||||
smaller than @samp{@code{_}} (ASCII value 95) - and should be listed before it.
|
||||
|
||||
Hence, @file{1.0.5_src.tar.gz} is listed first.
|
||||
|
||||
If a different character appears instead of the underscore (for
|
||||
example, percent sign @samp{@code{%}} ASCII value 37, which is smaller
|
||||
than dot's ASCII value of 46), that file will be listed first:
|
||||
|
||||
@example
|
||||
$ touch 1.0.5_src.tar.gz 1.0%zzzzz.gz
|
||||
1.0%zzzzz.gz
|
||||
1.0.5_src.tar.gz
|
||||
@end example
|
||||
|
||||
The same reasoning applies to the following example: The character
|
||||
@samp{@code{.}} has ASCII value 46, and is smaller than slash
|
||||
character @samp{@code{/}} ASCII value 47:
|
||||
|
||||
@example
|
||||
$ cat input5
|
||||
3.0/
|
||||
3.0.5
|
||||
|
||||
$ sort -V input5
|
||||
3.0.5
|
||||
3.0/
|
||||
@end example
|
||||
|
||||
|
||||
@node Punctuation Characters vs letters
|
||||
@subsection Punctuation Characters vs letters
|
||||
|
||||
Rule 2.2.1 dictates that letters sorts earlier than all non-letters
|
||||
(after breaking down a string to digits and non-digits parts).
|
||||
|
||||
@example
|
||||
$ cat input6
|
||||
a%
|
||||
az
|
||||
|
||||
$ sort -V input6
|
||||
az
|
||||
a%
|
||||
@end example
|
||||
|
||||
The input strings consist entirely of non-digits, and based on the
|
||||
above algorithm have only one part, all non-digit characters
|
||||
(@samp{@code{a%}} vs @samp{@code{az}}).
|
||||
|
||||
Each part is then compared lexically,
|
||||
character-by-character. @samp{@code{a}} compares identically in both
|
||||
strings.
|
||||
|
||||
Rule 2.2.1 dictates that letters (@samp{@code{z}}) sorts earlier than all
|
||||
non-letters (@samp{@code{%}}) - hence az appears first (despite z having ASCII
|
||||
value of 122, much bigger than @samp{@code{%}} with ASCII value 37).
|
||||
|
||||
@node Tilde @samp{~} character
|
||||
@subsection Tilde @samp{~} character
|
||||
|
||||
Rule 2.2.2 dictates that tilde character @samp{~} (ASCII 126) sorts
|
||||
before all other non-digit characters, including an empty part.
|
||||
|
||||
@example
|
||||
$ cat input7
|
||||
1
|
||||
1%
|
||||
1.2
|
||||
1~
|
||||
~
|
||||
|
||||
$ sort -V input7
|
||||
~
|
||||
1~
|
||||
1
|
||||
1%
|
||||
1.2
|
||||
@end example
|
||||
|
||||
The sorting algorithm starts by breaking down the string into
|
||||
non-digits (rule 2) and digits parts (rule 3).
|
||||
|
||||
In the above input file, only the last line in the input file starts
|
||||
with a non-digit (@code{~}). This is the first part. All other lines
|
||||
in the input file start with a digit - their first non-digit part is
|
||||
empty.
|
||||
|
||||
Based on rule 2.2.2, tilde @code{~} sorts before all other non-digits
|
||||
including the empty part - hence it comes before all other strings,
|
||||
and is listed first in the sorted output.
|
||||
|
||||
The remaining lines (@code{1}, @code{1%}, @code{1.2}, @code{1~})
|
||||
follow similar logic: The digit part is extracted (1 for all strings)
|
||||
and compares identical. The following extracted parts for the remaining
|
||||
input lines are: empty part, @code{%}, @code{.}, @code{~}.
|
||||
|
||||
Tilde sorts before all others, hence the line @code{1~} appears next.
|
||||
|
||||
The remaining lines (@code{1}, @code{1%}, @code{1.2}) are sorted based
|
||||
on previously explained rules.
|
||||
|
||||
@node Version sort ignores locale
|
||||
@subsection Version sort uses ASCII order, ignores locale, unicode characters
|
||||
|
||||
In version sort unicode characters are compared byte-by-byte according
|
||||
to their binary representation, ignoring their unicode value or the
|
||||
current locale.
|
||||
|
||||
Most commonly, unicode characters (e.g. Greek Small Letter Alpha
|
||||
U+03B1 @samp{α}) are encoded as UTF-8 bytes (e.g. @samp{α} is encoded as UTF-8
|
||||
sequence @code{0xCE 0xB1}). The encoding will be compared byte-by-byte,
|
||||
e.g. first @code{0xCE} (decimal value 206) then @code{0xB1} (decimal value 177).
|
||||
|
||||
@example
|
||||
$ touch aa az "a%" "aα"
|
||||
|
||||
$ ls -1 -v
|
||||
aa
|
||||
az
|
||||
a%
|
||||
aα
|
||||
@end example
|
||||
|
||||
Ignoring the first letter (@code{a}) which is identical in all
|
||||
strings, the compared values are:
|
||||
|
||||
@samp{@code{a}} and @samp{@code{z}} are letters, and sort earlier than
|
||||
all other non-digit characters.
|
||||
|
||||
Then, percent sign @samp{@code{%}} (ASCII value 37) is compared to the
|
||||
first byte of the UTF-8 sequence of @samp{@code{α}}, which is 0xCE or 206). The
|
||||
value 37 is smaller, hence @samp{@code{a%}} is listed before @samp{@code{aα}}.
|
||||
|
||||
@node Differences from the official Debian Algorithm
|
||||
@section Differences from the official Debian Algorithm
|
||||
|
||||
The GNU coreutils' version sort algorithm differs slightly from the
|
||||
official Debian algorithm, in order to accommodate more general usage
|
||||
and file name listing.
|
||||
|
||||
|
||||
@node Minus/Hyphen @samp{-} and Colons @samp{:} characters
|
||||
@subsection Minus/Hyphen @samp{-} and Colons @samp{:} characters
|
||||
|
||||
In Debian's version string syntax the version consists of three parts:
|
||||
@code{[epoch:]upstream_version[-debian_revision]} (@code{epoch} and
|
||||
@code{debian_revision} are optional).
|
||||
|
||||
Example of such version strings:
|
||||
|
||||
@example
|
||||
60.7.2esr-1~deb9u1
|
||||
52.9.0esr-1~deb9u1
|
||||
1:2.3.4-1+b2
|
||||
327-2
|
||||
1:1.0.13-3
|
||||
2:1.19.2-1+deb9u5
|
||||
@end example
|
||||
|
||||
If the @code{debian_revision part} is not present,
|
||||
hyphen characters @samp{-} are not allowed.
|
||||
If epoch is not present, colons @samp{:} are not allowed.
|
||||
|
||||
If these parts are present, hyphen and/or colons can appear only onces
|
||||
in valid Debian version strings.
|
||||
|
||||
In GNU coreutils, such restrictions are not reasonable (a file name can
|
||||
have many hyphens, a line of text can have many colons).
|
||||
|
||||
As a result, in GNU coreutils hyphens and colons are treated exactly
|
||||
like all other punctuation characters (i.e., they are sorted after
|
||||
letters. See Punctuation Characters above).
|
||||
|
||||
In Debian, these characters are treated differently than in coreutils:
|
||||
a version string with hyphen will sort before similar strings without
|
||||
hyphens.
|
||||
|
||||
Compare:
|
||||
|
||||
@example
|
||||
$ touch abb ab-cd
|
||||
|
||||
$ ls -v -1
|
||||
abb
|
||||
ab-cd
|
||||
@end example
|
||||
|
||||
With Debian's @command{dpkg} they will be listed as @code{ab-cd} first and
|
||||
@code{abb} second.
|
||||
|
||||
For further technical details see @uref{https://bugs.gnu.org/35939,bug35939}.
|
||||
|
||||
@node Additional hard-coded priorities In GNU coreutils' version sort
|
||||
@subsection Additional hard-coded priorities In GNU coreutils' version sort
|
||||
|
||||
In GNU coreutils' version sort algorithm, the following items have
|
||||
special priority and sort earlier than all other characters (listed in
|
||||
order);
|
||||
|
||||
@enumerate
|
||||
@item The empty string
|
||||
|
||||
@item The string @samp{@code{.}} (a single dot character, ASCII 46)
|
||||
|
||||
@item The string @samp{@code{..}} (two dot characters)
|
||||
|
||||
@item Strings start with a dot (@samp{@code{.}}) sort earlier than
|
||||
strings starting with any other characters.
|
||||
@end enumerate
|
||||
|
||||
Example:
|
||||
|
||||
@example
|
||||
$ printf "%s\n" a "" b "." c ".." ".d20" ".d3" | sort -V
|
||||
|
||||
.
|
||||
..
|
||||
.d3
|
||||
.d20
|
||||
a
|
||||
b
|
||||
c
|
||||
@end example
|
||||
|
||||
These priorities make perfect sense for @samp{ls -v}: The special
|
||||
files dot @samp{@code{.}} and dot-dot @samp{@code{..}} will be listed
|
||||
first, followed by any hidden files (files starting with a dot),
|
||||
followed by non-hidden files.
|
||||
|
||||
For @samp{sort -V} these priorities might seem arbitrary. However,
|
||||
because the sorting code is shared between the ls and sort program,
|
||||
the ordering rules are the same.
|
||||
|
||||
@node Special handling of file extensions
|
||||
@subsection Special handling of file extensions
|
||||
|
||||
GNU coreutils' version sort algorithm implements specialized handling
|
||||
of file extensions (or strings that look like file names with
|
||||
extensions).
|
||||
|
||||
This nuanced implementation enables slightly more natural ordering of files.
|
||||
|
||||
The additional rules are:
|
||||
|
||||
@enumerate
|
||||
@item
|
||||
A suffix (i.e., a file extension) is defined as: a dot, followed by a
|
||||
letter or tilde, followed by one or more letters, digits, or tildes
|
||||
(possibly repeated more than once), until the end of the string
|
||||
(technically, matching the regular expression
|
||||
@code{(\.[A-Za-z~][A-Za-z0-9~]*)*}).
|
||||
|
||||
@item
|
||||
If the strings contains suffixes, the suffixes are temporarily
|
||||
removed, and the strings are compared without them (using the
|
||||
@ref{Version-sort ordering rules,algorithm,algorithm} above).
|
||||
|
||||
@item
|
||||
If the suffix-less strings are identical, the suffix is restored and
|
||||
the entire strings are compared.
|
||||
|
||||
@item
|
||||
If the non-suffixed strings differ, the result is returned and the
|
||||
suffix is effectively ignored.
|
||||
@end enumerate
|
||||
|
||||
Examples for rule 1:
|
||||
|
||||
@itemize
|
||||
@item
|
||||
@code{hello-8.txt}: the suffix is @code{.txt}
|
||||
|
||||
@item
|
||||
@code{hello-8.2.txt}: the suffix is @code{.txt}
|
||||
(@samp{@code{.2}} is not included because the dot is not followed by a letter)
|
||||
|
||||
@item
|
||||
@code{hello-8.0.12.tar.gz}: the suffix is @code{.tar.gz} (@samp{@code{.0.12}}
|
||||
is not included)
|
||||
|
||||
@item
|
||||
@code{hello-8.2}: no suffix (suffix is an empty string)
|
||||
|
||||
@item
|
||||
@code{hello.foobar65}: the suffix is @code{.foobar65}
|
||||
|
||||
@item
|
||||
@code{gcc-c++-10.8.12-0.7rc2.fc9.tar.bz2}: the suffix is
|
||||
@code{.fc9.tar.bz2} (@code{.7rc2} is not included as it begins with a digit)
|
||||
@end itemize
|
||||
|
||||
Examples for rule 2:
|
||||
|
||||
@itemize
|
||||
@item
|
||||
Comparing @code{hello-8.txt} to @code{hello-8.2.12.txt}, the
|
||||
@code{.txt} suffix is temporarily removed from both strings.
|
||||
|
||||
@item
|
||||
Comparing @code{foo-10.3.tar.gz} to @code{foo-10.tar.xz}, the suffixes
|
||||
@code{.tar.gz} and @code{.tar.xz} are temporarily removed from the
|
||||
strings.
|
||||
@end itemize
|
||||
|
||||
Example for rule 3:
|
||||
|
||||
@itemize
|
||||
@item
|
||||
Comparing @code{hello.foobar65} to @code{hello.foobar4}, the suffixes
|
||||
(@code{.foobar65} and @code{.foobar4}) are temporarily removed. The
|
||||
remaining strings are identical (@code{hello}). The suffixes are then
|
||||
restored, and the entire strings are compared (@code{hello.foobar4} comes
|
||||
first).
|
||||
@end itemize
|
||||
|
||||
Examples for rule 4:
|
||||
|
||||
@itemize
|
||||
@item
|
||||
When comparing the strings @code{hello-8.2.txt} and @code{hello-8.10.txt}, the
|
||||
suffixes (@code{.txt}) are temporarily removed. The remaining strings
|
||||
(@code{hello-8.2} and @code{hello-8.10}) are compared as previously described
|
||||
(@code{hello-8.2} comes first).
|
||||
@slanted{(In this case the suffix removal algorithm
|
||||
does not have a noticeable effect on the resulting order.)}
|
||||
@end itemize
|
||||
|
||||
@b{How does the suffix-removal algorithm effect ordering results?}
|
||||
|
||||
Consider the comparison of hello-8.txt and hello-8.2.txt.
|
||||
|
||||
Without the suffix-removal algorithm, the strings will be broken down
|
||||
to the following parts:
|
||||
|
||||
@example
|
||||
hello- @r{vs} hello- @r{(rule 2, all non-digit characters)}
|
||||
8 @r{vs} 8 @r{(rule 3, all digit characters)}
|
||||
.txt @r{vs} . @r{(rule 2)}
|
||||
empty @r{vs} 2
|
||||
empty @r{vs} .txt
|
||||
@end example
|
||||
|
||||
The comparison of the third parts (@samp{@code{.}} vs
|
||||
@samp{@code{.txt}}) will determine that the shorter string comes first -
|
||||
resulting in @file{hello-8.2.txt} appearing first.
|
||||
|
||||
Indeed this is the order in which Debian's @command{dpkg} compares the strings.
|
||||
|
||||
A more natural result is that @file{hello-8.txt} should come before
|
||||
@file{hello-8.2.txt}, and this is where the suffix-removal comes into play:
|
||||
|
||||
The suffixes (@code{.txt}) are removed, and the remaining strings are
|
||||
broken down into the following parts:
|
||||
|
||||
@example
|
||||
hello- @r{vs} hello- @r{(rule 2, all non-digit characters)}
|
||||
8 @r{vs} 8 @r{(rule 3, all digit characters)}
|
||||
empty @r{vs} . @r{(rule 2)}
|
||||
empty @r{vs} 2
|
||||
@end example
|
||||
|
||||
As empty strings sort before non-empty strings, the result is @code{hello-8}
|
||||
being first.
|
||||
|
||||
A real-world example would be listing files such as:
|
||||
@file{gcc_10.fc9.tar.gz}
|
||||
and @file{gcc_10.8.12.7rc2.fc9.tar.bz2}: Debian's algorithm would list
|
||||
@file{gcc_10.8.12.7rc2.fc9.tar.bz2 first}, while @samp{ls -v} will list
|
||||
@file{gcc_10.fc9.tar.gz} first.
|
||||
|
||||
These priorities make sense for @samp{ls -v}:
|
||||
Versioned files will be listed in a more natural order.
|
||||
|
||||
For @samp{sort -V} these priorities might seem arbitrary. However,
|
||||
because the sorting code is shared between the ls and sort program,
|
||||
the ordering rules are the same.
|
||||
|
||||
|
||||
@node Advanced Topics
|
||||
@section Advanced Topics
|
||||
|
||||
|
||||
@node Comparing two strings using Debian's algorithm
|
||||
@subsection Comparing two strings using Debian's algorithm
|
||||
|
||||
The Debian program @command{dpkg} (available on all Debian and Ubuntu
|
||||
installations) can compare two strings using the @option{--compare-versions}
|
||||
option.
|
||||
|
||||
To use it, create a helper shell function (simply copy & paste the
|
||||
following snippet to your shell command-prompt):
|
||||
|
||||
@example
|
||||
compver() @{
|
||||
dpkg --compare-versions "$1" lt "$2" \
|
||||
&& printf "%s\n" "$1" "$2" \
|
||||
|| printf "%s\n" "$2" "$1" ; \
|
||||
@}
|
||||
@end example
|
||||
|
||||
Then compare two strings by calling compver:
|
||||
|
||||
@example
|
||||
$ compver 8.49 8.5
|
||||
8.5
|
||||
8.49
|
||||
@end example
|
||||
|
||||
Note that @command{dpkg} will warn if the strings have invalid syntax:
|
||||
|
||||
@example
|
||||
$ compver "foo07.7z" "foo7a.7z"
|
||||
dpkg: warning: version 'foo07.7z' has bad syntax:
|
||||
version number does not start with digit
|
||||
dpkg: warning: version 'foo7a.7z' has bad syntax:
|
||||
version number does not start with digit
|
||||
foo7a.7z
|
||||
foo07.7z
|
||||
|
||||
$ compver "3.0/" "3.0.5"
|
||||
dpkg: warning: version '3.0/' has bad syntax:
|
||||
invalid character in version number
|
||||
3.0.5
|
||||
3.0/
|
||||
@end example
|
||||
|
||||
To illustrate the different handling of hyphens between Debian and
|
||||
coreutils' algorithms (see
|
||||
@ref{Minus/Hyphen @samp{-} and Colons @samp{:} characters}):
|
||||
|
||||
@example
|
||||
$ compver abb ab-cd 2>/dev/null $ printf "abb\nab-cd\n" | sort -V
|
||||
ab-cd abb
|
||||
abb ab-cd
|
||||
@end example
|
||||
|
||||
To illustrate the different handling of file extension: (see @ref{Special
|
||||
handling of file extensions}):
|
||||
|
||||
@example
|
||||
$ compver hello-8.txt hello-8.2.txt 2>/dev/null
|
||||
hello-8.2.txt
|
||||
hello-8.txt
|
||||
|
||||
$ printf "%s\n" hello-8.txt hello-8.2.txt | sort -V
|
||||
hello-8.txt
|
||||
hello-8.2.txt
|
||||
@end example
|
||||
|
||||
|
||||
|
||||
@node Reporting bugs or incorrect results
|
||||
@subsection Reporting bugs or incorrect results
|
||||
|
||||
If you suspect a bug in GNU coreutils' version sort (i.e., in the
|
||||
output of @samp{ls -v} or @samp{sort -V}), please first check the following:
|
||||
|
||||
@enumerate
|
||||
@item
|
||||
Is the result consistent with Debian's own ordering (using @command{dpkg}, see
|
||||
@ref{Comparing two strings using Debian's algorithm}) ? If it is, then this
|
||||
is not a bug - please do not report it.
|
||||
|
||||
@item
|
||||
If the result differs from Debian's, is it explained by one of the
|
||||
sections in @ref{Differences from the official Debian Algorithm}? If it is,
|
||||
then this is not a bug - please do not report it.
|
||||
|
||||
@item
|
||||
If you have a question about specific ordering which is not explained
|
||||
here, please write to @email{coreutils@@gnu.org}, and provide a
|
||||
concise example that will help us diagnose the issue.
|
||||
|
||||
@item
|
||||
If you still suspect a bug which is not explained by the above, please
|
||||
write to @email{bug-coreutils@@gnu.org} with a concrete example of the
|
||||
suspected incorrect output, with details on why you think it is
|
||||
incorrect.
|
||||
|
||||
@end enumerate
|
||||
|
||||
@node Other version/natural sort implementations
|
||||
@subsection Other version/natural sort implementations
|
||||
|
||||
As previously mentioned, there are multiple variations on
|
||||
version/natural sort, each with its own rules. Some examples are:
|
||||
|
||||
@itemize
|
||||
|
||||
@item
|
||||
Natural Sorting variants in
|
||||
@uref{https://rosettacode.org/wiki/Natural_sorting,Rosetta Code}.
|
||||
|
||||
@item
|
||||
Python's @uref{https://pypi.org/project/natsort/,natsort package}
|
||||
(includes detailed description of their sorting rules:
|
||||
@uref{https://natsort.readthedocs.io/en/master/howitworks.html,
|
||||
natsort - how it works}).
|
||||
|
||||
@item
|
||||
Ruby's @uref{https://github.com/github/version_sorter,version_sorter}.
|
||||
|
||||
@item
|
||||
Perl has multiple packages for natual and version sorts
|
||||
(each likely with its own rules and nuances):
|
||||
@uref{https://metacpan.org/pod/Sort::Naturally,Sort::Naturally},
|
||||
@uref{https://metacpan.org/pod/Sort::Versions,Sort::Versions},
|
||||
@uref{https://metacpan.org/pod/CPAN::Version,CPAN::Version}.
|
||||
|
||||
@item
|
||||
PHP has a builtin function
|
||||
@uref{https://www.php.net/manual/en/function.natsort.php,natsort}.
|
||||
|
||||
@item
|
||||
NodeJS's @uref{https://www.npmjs.com/package/natural-sort,natural-sort package}.
|
||||
|
||||
@item
|
||||
in zsh, the
|
||||
@uref{http://zsh.sourceforge.net/Doc/Release/Expansion.html#Glob-Qualifiers,
|
||||
glob modifier} @code{*(n)} will expand to files in natural sort order.
|
||||
|
||||
@item
|
||||
When writing @code{C} programs, the GNU libc library (@code{glibc})
|
||||
provides the
|
||||
@uref{http://man7.org/linux/man-pages/man3/strverscmp.3.html,
|
||||
strvercmp(3)} function to compare two strings, and
|
||||
@uref{http://man7.org/linux/man-pages/man3/versionsort.3.html,versionsort(3)}
|
||||
function to compare two directory entries (despite the names, they are
|
||||
not identical to GNU coreutils' version sort ordering).
|
||||
|
||||
@item
|
||||
Using Debian's sorting algorithm in:
|
||||
|
||||
@itemize
|
||||
@item
|
||||
python: @uref{https://stackoverflow.com/a/4957741,
|
||||
Stack Overflow Example #4957741}.
|
||||
|
||||
@item
|
||||
NodeJS: @uref{https://www.npmjs.com/package/deb-version-compare,
|
||||
deb-version-compare}.
|
||||
@end itemize
|
||||
|
||||
@end itemize
|
||||
|
||||
|
||||
@node Related Source code
|
||||
@subsection Related Source code
|
||||
|
||||
@itemize
|
||||
|
||||
@item
|
||||
Debian's code which splits a version string into
|
||||
@code{epoch/upstream_version/debian_revision} parts:
|
||||
@uref{https://git.dpkg.org/cgit/dpkg/dpkg.git/tree/lib/dpkg/parsehelp.c#n191,
|
||||
parsehelp.c:parseversion()}.
|
||||
|
||||
@item
|
||||
Debian's code which performs the @code{upstream_version} comparison:
|
||||
@uref{https://git.dpkg.org/cgit/dpkg/dpkg.git/tree/lib/dpkg/version.c#n140,
|
||||
version.c}.
|
||||
|
||||
@item
|
||||
GNULIB code (used by GNU coreutils) which performs the version comparison:
|
||||
@uref{https://git.savannah.gnu.org/cgit/gnulib.git/tree/lib/filevercmp.c,
|
||||
filevercmp.c}.
|
||||
@end itemize
|
||||
Reference in New Issue
Block a user