1
0
mirror of git://git.sv.gnu.org/coreutils.git synced 2026-04-10 14:13:31 +02:00

31429 Commits

Author SHA1 Message Date
Bruno Haible
e2e8d9f389 tests: Avoid accidental matching of the vendor field of $host
* tests/chgrp/basic.sh: Test $host_os, not $host_triplet.
* tests/chown/separator.sh: Likewise.
* tests/rm/r-root.sh: Likewise.
* tests/tail/pipe-f.sh: Likewise.
* tests/tail/tail-c.sh: Likewise.
* tests/tee/tee.sh: Likewise.
* tests/touch/dangling-symlink.sh: Likewise.
2026-04-10 11:09:10 +01:00
Collin Funk
f634724b8f env: avoid locking standard output for each printed variable
* src/env.c (main): Use fputs and putchar instead of printf.
2026-04-09 21:59:12 -07:00
Collin Funk
77ba5a9815 printenv: avoid locking standard output for each printed variable
* src/printenv.c (main): Use fputs and putchar instead of printf.
2026-04-09 21:53:28 -07:00
Pádraig Brady
d0fa892c16 maint: remove last remaining assert()
* src/split.c (bytes_chunk_extract): Prefer affirm to assert,
as it allows for better static checking when compiling with -DNDEBUG.
2026-04-09 21:36:43 +01:00
Pádraig Brady
eef9ce5071 maint: move tty-eof.pl to misc directory
* tests/tty/tty-eof.pl: Rename to ...
* tests/misc/tty-eof.pl: ... this more general directory.
* tests/local.mk: Adjust accordingly.
2026-04-09 21:25:27 +01:00
Pádraig Brady
e5912211de tests: tty-eof.pl: address FIXME re hardcoded Ctrl-d
* tests/tty/tty-eof.pl: Try to explicitly set EOF char to Ctrl-d
in case it's different.
2026-04-09 21:25:27 +01:00
Pádraig Brady
574d8660c7 tests: tty-eof.pl: make fully table driven
* tests/tty/tty-eof.pl: Remove command specific logic,
and adjust commands to support general input.
Also add cut -b, as cut_bytes has its own read loop.
2026-04-09 21:25:27 +01:00
Pádraig Brady
f312af49a2 tests: all: check empty tty input is handled appropriately
* tests/tty/tty-eof.pl: Test all commands twice.
Once with input and once with empty input.
2026-04-09 21:25:27 +01:00
Pádraig Brady
d27e48c122 maint: cat: avoid coverity NULL dreference warning
* src/cat.c (ensure_buf_size): Affirm we won't return NULL;
2026-04-09 21:25:27 +01:00
Pádraig Brady
3131fbf0e1 cat: avoid memory allocation per file
* src/cat.c (main): Only resize the allocated buffer when needed,
which avoids per file heap manipulation and mmap/munmap syscalls.
2026-04-09 16:39:12 +01:00
Pádraig Brady
e0a641674a cat: fix splice() from empty input
* src/cat.c (splice_cat): Ensure we don't retry a read() after
splice() completes, as this is significant on a tty.
2026-04-09 16:39:06 +01:00
oech3
00ba75a02a tests: tee: ensure intermittent data is handled
* tests/tee/tee.sh: Add test case for input from pipe containing sleep.
https://github.com/coreutils/coreutils/pull/247
2026-04-09 12:08:57 +01:00
Collin Funk
bd5d98e8a7 maint: touch: prefer timespec_cmp
* src/touch.c (main): Use timespec_cmp instead of comparing each member
of the timespec.
2026-04-08 22:34:13 -07:00
Pádraig Brady
745f4000fe tests: date: fix false failure on OpenBSD 7.8
* tests/date/date.pl: Set the max supported year to INT_MAX.
Most systems support INT_MAX+1900, but mktime() on OpenBSD 7.8
limits the passed tm_year to INT_MAX.
Reported by Bruno Haible.
2026-04-07 23:17:36 +01:00
Pádraig Brady
5b0cb4379a tests: numfmt: avoid false failure on systems without long double
* tests/numfmt/numfmt.pl: Move recently added test that depends
on long double support to the appropriately guarded set.
Also reduce the value to be in the definitely safe long double range.
Reported by Bruno Haible.
2026-04-07 21:47:21 +01:00
Pádraig Brady
2e7fc39ec1 maint: cut: avoid discarded-qualifiers warnings
Seen on GCC 15.2.1 with GLIBC 2.43 on Arch
Not seen on GCC 15.2.1 on GLIBC 2.42 on Fedora

* src/cut.c (search_bytes): Cast the return from memchr()
to avoid const propagation.
(find_field_delim): Adjust the return from strstr() similarly.
https://github.com/coreutils/coreutils/issues/244
2026-04-07 14:57:18 +01:00
Pádraig Brady
aaae9d0db6 tests: cat: avoid false failure on systems without splice
* tests/cat/splice.sh: Ensure splice is called multiple times
before we check specific invocation counts.
On Linux kernel 5.10 for example, splice from /dev/zero
returns EINVAL.
2026-04-07 12:20:10 +01:00
Collin Funk
457f88513a cat: use splice if operating on pipes or if copy_file_range fails
On a AMD Ryzen 7 3700X system:

    $ timeout 10 taskset 1 ./src/cat-prev /dev/zero \
        | taskset 2 pv -r > /dev/null
    [1.67GiB/s]
    $ timeout 10 taskset 1 ./src/cat /dev/zero \
        | taskset 2 pv -r > /dev/null
    [9.03GiB/s]

On a Power10 system:

    $ taskset 1 ./src/yes | timeout 10 taskset 2 ./src/cat-prev \
        | taskset 3 pv -r > /dev/null
    [12.9GiB/s]
    $ taskset 1 ./src/yes | timeout 10 taskset 2 ./src/cat \
            | taskset 3 pv -r > /dev/null
    [81.8GiB/s]

* NEWS: Mention the improvement.
* src/cat.c: Include isapipe.h, splice.h, and unistd--.h.
(splice_cat): New function.
(main): Use it.
* src/local.mk (noinst_HEADERS): Add src/splice.h.
* src/splice.h: New file, based on definitions from src/yes.c.
* src/yes.c: Include splice.h.
(pipe_splice_size): Use increase_pipe_size from src/splice.h.
(SPLICE_PIPE_SIZE): Remove definition, moved to src/splice.h.
* tests/cat/splice.sh: New file, based on some tests in
tests/misc/yes.sh.
* tests/local.mk (all_tests): Add the new test.
2026-04-06 21:57:07 -07:00
Collin Funk
77a0bac87d build: update gnulib submodule to latest
For the Gnulib commit 2c480fa522 (mbrtowc, mbrtoc32: Silence -Wshadow
warnings (regr. 2026-04-02)., 2026-04-06).
2026-04-06 17:07:33 -07:00
Pádraig Brady
7064be3061 build: cut: fix compilation error on non C23 compilers
* src/cut.c (main): Add curly brackets around variable
declaration in case label.
Reported by Bruno Haible.
2026-04-06 22:37:15 +01:00
Sylvestre Ledru
374a14e841 tests: date: add large year test
* tests/date/date.pl: Add the test case.
Add test case for https://github.com/uutils/coreutils/issues/9774
to verify with large dates.
https://github.com/coreutils/coreutils/pull/237
2026-04-06 19:22:24 +01:00
Paul Eggert
f48cc50f7a maint: revert “avoid pthread_sigmask lock”
* configure.ac (GNULIB_SIGACTION_SINGLE_THREAD): Remove.
This never worked (it was a misspelling) and the properly-spelled
identifier (whose spelling has since been renamed) is useful
mostly for programs like gzip that do not need Gnulib’s ‘lock’ module.
For coreutils, which needs ‘lock’ for other reasons, it’s overkill.

maint: avoid pthread_sigmask lock overhead
This matters only for MS-Windows.
* configure.ac (GNULIB_PTHREAD_SIGMASK_SINGLE_THREAD):
Define this instead of defining GNULIB_SIGACTION_SINGLE_THREAD.
The latter was a typo, and Gnulib has evolved anyway.
2026-04-06 11:15:10 -07:00
Paul Eggert
cff1fa2239 maint: simplify c32issep
* src/system.h (c32issep): Avoid unnecessary ‘!!’.
2026-04-06 11:15:10 -07:00
Sylvestre Ledru
793f45e916 tests: expr: add short-circuit tests with parenthesized branches
* tests/expr/expr.pl: Add tests to verify that short-circuit
evaluation of | and & correctly skips parenthesized dead branches,
including nested parenthesized expressions containing division by zero.
https://github.com/uutils/coreutils/pull/11395
https://github.com/coreutils/coreutils/pull/238
2026-04-06 18:22:56 +01:00
Sylvestre Ledru
829593317d tests: split: verify non-UTF-8 bytes are preserved in filenames
* tests/split/non-utf8.sh: New test to ensure that non-UTF-8 bytes
in the prefix and --additional-suffix are preserved as-is in output
filenames, rather than being replaced by UTF-8 replacement characters.
* tests/local.mk: Register new test.
https://github.com/uutils/coreutils/pull/11397
https://github.com/coreutils/coreutils/pull/239
2026-04-06 17:53:45 +01:00
Sylvestre Ledru
2625209807 tests: ln: add test for non-UTF-8 source names in target-dir mode
* tests/ln/non-utf8-src.sh: New test ensuring ln handles source
filenames containing non-UTF-8 bytes when linking into a target
directory, for both hard links and symbolic links with -t.
* tests/local.mk: Register the new test.
https://github.com/uutils/coreutils/pull/11403
https://github.com/coreutils/coreutils/pull/240
2026-04-06 17:29:24 +01:00
Sylvestre Ledru
fac454616b test: od: verify -t f defaults to double precision
* tests/od/od-float.sh: Add cases to ensure -t f = -t fD,
and also verify the resulting number.
https://github.com/uutils/coreutils/pull/11396
https://github.com/coreutils/coreutils/pull/241
2026-04-06 17:16:05 +01:00
Sylvestre Ledru
3a9901daad tests: ls: add quoting-utf8 test for Unicode quotes in UTF-8 locales
* tests/ls/quoting-utf8.sh: New test verifying that
--quoting-style=locale and --quoting-style=clocale use Unicode
left/right single quotation marks in UTF-8 locales, and that
embedded apostrophes and double quotes are not escaped when the
delimiters are different characters.
Also check C locale fallback to ASCII quotes.
* tests/local.mk: Reference the new test.
https://github.com/coreutils/coreutils/pull/243
2026-04-06 16:29:15 +01:00
Sylvestre Ledru
262bc9f49e tests: numfmt: cover GNU/uutils compatibility edge cases
* tests/numfmt/numfmt.pl: Add tests exercising corner cases around
negative-argument rejection, large integer precision, scientific
notation rejection, '--from-unit' fractional precision, zero-padded
format sign ordering, '--to-unit' prefix selection, and
'--format=%.0f' with '--to=<scale>'.
https://github.com/uutils/coreutils/pull/11668
2026-04-06 15:57:51 +01:00
Pádraig Brady
a325c99781 doc: document cut(1) multi-byte and interface consolidation
This patch set updates cut(1) to be multi-byte aware.
It also reduces interface divergence across implementations.

multi-byte awareness was added to the existing -c, n, and -d options.
Also considered for compatibility are the -w, -F, and -O options,
as these are present on at least two other common implementations.

= Interface / New functionality =

    macOS,  i18n, uutils, Toybox, Busybox, GNU
-c    x      x       x      x        x      x
-n    x      x                              x
-w    x              x                      x
-F                          x        x      x
-O                          x        x      x

-c is needed anyway as specified by all, including POSIX.
-n is needed also as specified by i18n/macOS/POSIX
-w is somewhat less important, but seeing as it's
on two other common platforms (and its functionality is
provided on two more), providing it is worthwhile for compat.
-F and -O are really just aliases to other options
so trivial to add, and probably worthwhile for compatibility.

Interface / functionality notes:

There is a slight divergence between -n implementations.
There was already a difference between FreeBSD and i18n, and
we've aligned with the more sensible FreeBSD implementation.
Note the i18n -n implementation is otherwise buggy in any case,
so I doubt this will be a practical compatibility concern.
Actually -n is specified by POSIX, and it matches FreeBSD.
Specifically our -n will not output a character unless the
byte range encompasses _the end_ of the multi-byte character.
I.e. the -b is a limit that is not passed, and thus ensures
we don't output overlapping characters for separate cut
invocations that do not have overlapping byte ranges.

-d <regex> from toybox is not implemented.
That's edge case functionality IMHO and not well suited to cut(1).
This functionality is supported by awk, and regex functionality
is best restricted to awk I think.

cut is a significant part of the i18n patch, so it will be good
to avoid that downstream divergence.  Unfortunately there were
no tests with the cut i18n implementation.
Note the i18n cut implementation used fread() as so was
not reponsive to new data < BUFSIZ, whereas this implementation
uses read() and thus is responsive to data as it becomes available.

= Performance =

General performance notes:

We prefer byte searching (with -d) as that can be much faster
than character by character processing, and it's supported
on single byte and UTF-8 charsets.  We also use byte searching
with -w on uni-byte locales.
This was seen to give up to 100x perf increase over the i18n patch.

Where we do use per character processing, we avoid conversion to
wide char when processing ASCII data (mcel provides this optimization).
This was seen to give a 14x performance increase over the i18n patch.

We prefer memchr() and strstr() as these are tuned for specific
platforms on glibc, even if memchr2() or memmem()
are algorithmically better.

We maintain the important memory behavior
of only buffering when necessary.

Performance testing:

There are _lots_ of combinations and optimziation opportunities.
I performance tested this patch set with the following setup:

$ yes | head -n10M > sl.in
$ yes $(yes eeeaae | head -n10K | paste -s -d,) | head -n10K > ll.in
$ yes $(yes eeeaae | head -n9 | paste -s -d,) | head -n1M > as.in
$ yes $(yes éééááé | head -n9 | paste -s -d,) | head -n1M \
  > mb.in

$ for type in sl ll as mb; do
    cat $type.in >/dev/null;
    for imp in '' src/; do  # '' maps to the system i18n ver on Fedora
      echo ============ "${imp:-i18n}" $type ==============;
      for d in -d, -dc -d, -dç -w -b -c; do
        fields='-f1 -f10 -f100'
        test "$d" = "-b" && { fields='-b1 -b10 -b100'; d=''; }
        test "$d" = "-c" && { fields='-c1 -c10 -c100'; d=''; }
        for f in $fields; do
          for loc in C C.UTF-8; do
            # SKip -b for UTF-8 as no different
            test "$loc" = C.UTF-8 && echo "$f" | grep -q -- -b \
             && continue
            # Skip multi-byte delimiter for C and not allowed
            test "$loc" = C && test $(echo -n "$d" | wc -c) -ge 4 \
             && continue
            LC_ALL=$loc ${imp}cut $f $d /dev/null 2>/dev/null &&
            hyperfine -m2 -M4 \
             "LC_ALL=$loc ${imp}cut $f $d $type.in >/dev/null" ||
            printf 'Benchmark 1: %s\n  unsupported\n\n' \
             "LC_ALL=$loc ${imp}cut $f $d $type.in >/dev/null"
          done;
        done;
      done;
    done;
  done

After a little post-processing of the results, we get:

-- cut-i18n

| command         |       sl |       ll |       as |       mb |
| --------------- | -------- | -------- | -------- | -------- |
| C -f1 -d,       |  66.3 ms |  1.605 s | 145.9 ms | 366.4 ms |
| UTF8 -f1 -d,    |  65.8 ms |  1.593 s | 145.8 ms | 370.0 ms |
| C -f10 -d,      | 301.4 ms |  1.590 s | 161.8 ms | 126.7 ms |
| UTF8 -f10 -d,   | 303.5 ms |  1.599 s | 161.8 ms | 124.6 ms |
| C -f100 -d,     | 300.6 ms |  1.596 s | 162.1 ms | 126.7 ms |
| UTF8 -f100 -d,  | 301.3 ms |  1.595 s | 162.0 ms | 124.9 ms |
| C -f1 -dc       |  66.6 ms |  1.845 s | 179.1 ms | 365.7 ms |
| UTF8 -f1 -dc    |  73.8 ms |  1.878 s | 179.1 ms | 363.1 ms |
| C -f10 -dc      | 300.7 ms | 349.8 ms |  76.0 ms | 125.3 ms |
| UTF8 -f10 -dc   | 300.4 ms | 347.2 ms |  75.7 ms | 124.8 ms |
| C -f100 -dc     | 300.1 ms | 348.1 ms |  76.5 ms | 125.5 ms |
| UTF8 -f100 -dc  | 300.8 ms | 348.7 ms |  76.4 ms | 125.8 ms |
| UTF8 -f1 -d,   | 563.5 ms | 21.775 s |  1.963 s |  1.665 s |
| UTF8 -f10 -d,  | 833.6 ms | 20.504 s |  2.022 s |  1.612 s |
| UTF8 -f100 -d, | 825.2 ms | 20.448 s |  2.009 s |  1.616 s |
| UTF8 -f1 -dç    | 563.7 ms | 21.827 s |  1.964 s |  2.319 s |
| UTF8 -f10 -dç   | 825.3 ms | 21.713 s |  2.011 s |  2.248 s |
| UTF8 -f100 -dç  | 831.6 ms | 20.505 s |  2.019 s |  2.276 s |
| C -f1 -w        |        - |        - |        - |        - |
| UTF8 -f1 -w     |        - |        - |        - |        - |
| C -f10 -w       |        - |        - |        - |        - |
| UTF8 -f10 -w    |        - |        - |        - |        - |
| C -f100 -w      |        - |        - |        - |        - |
| UTF8 -f100 -w   |        - |        - |        - |        - |
| C -b1           |  60.8 ms |  1.596 s | 154.8 ms | 313.7 ms |
| C -b10          |  51.6 ms |  1.594 s | 154.3 ms | 310.8 ms |
| C -b100         |  51.4 ms |  1.594 s | 153.0 ms | 312.2 ms |
| C -c1           |  60.7 ms |  1.597 s | 153.8 ms | 313.0 ms |
| UTF8 -c1        | 526.5 ms | 14.662 s |  1.362 s |  1.573 s |
| C -c10          |  51.8 ms |  1.591 s | 153.3 ms | 311.4 ms |
| UTF8 -c10       | 436.9 ms | 14.450 s |  1.336 s |  1.563 s |
| C -c100         |  51.0 ms |  1.593 s | 152.7 ms | 313.2 ms |
| UTF8 -c100      | 426.7 ms | 14.429 s |  1.344 s |  1.551 s |

-- src/cut

| command         |       sl |       ll |       as |       mb |
| --------------- | -------- | -------- | -------- | -------- |
| C -f1 -d,       |   4.6 ms | 108.2 ms |  45.4 ms |  24.2 ms |
| UTF8 -f1 -d,    |   4.8 ms | 108.4 ms |  45.4 ms |  24.5 ms |
| C -f10 -d,      |   4.5 ms | 109.3 ms | 123.7 ms |  24.3 ms |
| UTF8 -f10 -d,   |   4.9 ms | 114.1 ms | 124.1 ms |  24.5 ms |
| C -f100 -d,     |   4.7 ms | 119.2 ms | 124.1 ms |  24.5 ms |
| UTF8 -f100 -d,  |   4.8 ms | 120.0 ms | 125.1 ms |  24.5 ms |
| C -f1 -dc       |   4.4 ms | 120.5 ms |  11.9 ms |  24.1 ms |
| UTF8 -f1 -dc    |   4.9 ms | 120.5 ms |  12.1 ms |  24.6 ms |
| C -f10 -dc      |   4.7 ms | 125.3 ms |  11.8 ms |  24.1 ms |
| UTF8 -f10 -dc   |   4.8 ms | 126.7 ms |  12.0 ms |  24.4 ms |
| C -f100 -dc     |   4.6 ms | 127.0 ms |  11.9 ms |  24.3 ms |
| UTF8 -f100 -dc  |   4.7 ms | 126.4 ms |  12.0 ms |  24.4 ms |
| UTF8 -f1 -d,   |   6.0 ms | 169.4 ms |  15.6 ms |  67.4 ms |
| UTF8 -f10 -d,  |   6.1 ms | 173.9 ms |  15.6 ms | 237.2 ms |
| UTF8 -f100 -d, |   6.1 ms | 174.0 ms |  15.6 ms | 237.8 ms |
| UTF8 -f1 -dç    |   6.3 ms | 170.8 ms |  15.7 ms |  32.2 ms |
| UTF8 -f10 -dç   |   6.0 ms | 172.9 ms |  15.9 ms |  32.1 ms |
| UTF8 -f100 -dç  |   6.7 ms | 173.1 ms |  15.5 ms |  32.3 ms |
| C -f1 -w        | 159.6 ms | 170.1 ms |  69.1 ms |  98.9 ms |
| UTF8 -f1 -w     | 128.1 ms |  2.525 s | 246.5 ms |  1.086 s |
| C -f10 -w       | 183.3 ms | 199.2 ms |  74.6 ms | 105.0 ms |
| UTF8 -f10 -w    | 130.3 ms |  2.659 s | 276.5 ms |  1.099 s |
| C -f100 -w      | 183.8 ms | 202.5 ms |  74.1 ms | 103.6 ms |
| UTF8 -f100 -w   | 130.1 ms |  2.663 s | 276.6 ms |  1.097 s |
| C -b1           |  65.0 ms | 110.2 ms |  22.4 ms |  35.6 ms |
| C -b10          |  48.7 ms | 109.6 ms |  24.2 ms |  36.7 ms |
| C -b100         |  48.7 ms | 110.6 ms |  19.0 ms |  36.6 ms |
| C -c1           |  65.8 ms | 109.5 ms |  22.4 ms |  35.6 ms |
| UTF8 -c1        |  63.2 ms |  1.130 s | 116.9 ms | 610.2 ms |
| C -c10          |  48.7 ms | 109.8 ms |  24.3 ms |  36.8 ms |
| UTF8 -c10       |  39.7 ms |  1.133 s | 118.7 ms | 610.0 ms |
| C -c100         |  48.3 ms | 110.7 ms |  18.9 ms |  36.7 ms |
| UTF8 -c100      |  39.4 ms |  1.141 s | 115.0 ms | 598.8 ms |

In summary, compared to the i18n patch we're now as fast in all cases,
and much faster in most cases.

We can see the -f byte searching performing well,
being 120x faster in the no matching delimiter case,
to at least 3x faster in the matching delimiter case.

When we resort to per character processing we also compare well,
being 14x faster in the ASCII processing case
(due to mcel short-circuiting the wide char conversion).
Note the processing mb.in results above also show a 2x win
in per character processing cases, but the i18n patch would have
also picked that win up as it's achieved separately to this patch set:
https://lists.gnu.org/r/coreutils/2026-03/msg00117.html
2026-04-06 15:52:58 +01:00
Pádraig Brady
65ee5e6d0a tests: cut: add remaining tests to ensure 100% coverage
* tests/cut/cut.pl: Add new tests to ensure
`make coverage` shows 100% coverage for cut.c.
2026-04-06 15:52:58 +01:00
Pádraig Brady
e4258ef02f tests: cut: expand GB18030 tests
* tests/cut/mb-non-utf8.sh: Add more test cases.
2026-04-06 15:52:58 +01:00
Paul Eggert
4c6cf6043a maint: cut: refactor delimiter handling
* src/cut.c: Use mcel_scanz() to parse in all cases,
and avoid redundant storage of delimiter_length and
the single byte delim.
2026-04-06 15:52:58 +01:00
Pádraig Brady
1a44a25808 cut: -f: fix handling of multi-byte delimiters that span buffers
* src/cut.c (cut_fields_bytesearch): Ensure up to delim_bytes -1
is left for the next refill.
* tests/cut/cut.pl: Add a test case.
2026-04-06 15:52:58 +01:00
Pádraig Brady
57c87043f6 cut,fold,expand,unexpand: ensure we process all available characters
* gl/lib/mbbuf.h: Adjust mbbuf_fill() to process full characters
in the slop at the end of a read().  Previously valid characters
in the last MCEL_LEN_MAX bytes were ignored until the next read().
* src/cut.c (cut_fields_bytesearch): Adjust to the new naming.
* NEWS: Mention the fold(1) responsiveness fix, which was
improved with the change from fread() to read(),
and completed with this patch.
2026-04-06 15:52:56 +01:00
Pádraig Brady
1b4b8104d3 cut: -b: avoid function calls in hot loop
$ time LC_ALL=C src/cut-before -b1 sl.in >/dev/null
  real	0m0.115s

  $ time LC_ALL=C src/cut-after -b1 sl.in >/dev/null
  real	0m0.076s

* src/cut.c (cut_bytes): Hoist the fileno() invariant outside the loop.
Avoid memchr for very short lines.
(search_bytes): Similar to copy_bytes() and write_bytes() helpers.
Note adding code to probe 3 or 4 bytes resulted in worse register
allocation.  I.e. slower operation even if the input was only 2 bytes.
2026-04-05 13:15:56 +01:00
Pádraig Brady
374ff00c36 cut: fix logic issue with field delim in last byte of buffer
With field delimiter = line delimiter we need to know
if there is any more data to be read, as field delimiter
in the last byte of the file is treated differently.
So reiterate the loop to ensure enough read()s to make
the appropriate determination.
2026-04-05 13:15:56 +01:00
Pádraig Brady
d757d32f86 cut: ensure responsive input processing
* gl/lib/mbbuf.h (fill_buf): Switch from fread() to read()
as the former retries read() internally to fill the buffer.
* src/cut.c: Adjust accordingly, and avoid getc() interface entirely.
* bootstrap.h: Depend explicitly on fseterr.  This is already depended
on transitively, so should not introduce new build portability issues.
2026-04-05 13:15:56 +01:00
Pádraig Brady
8daf2a91be maint: cut: rename line_in to bytes_in
* src/cut.c: We're not reading a line, rather a buffer of bytes.
Suggested by Collin Funk.
2026-04-05 13:15:56 +01:00
Pádraig Brady
1c9e223298 tests: cut: add more multi-byte tests
* tests/cut/cut.pl: Add more multi-byte combinations.
2026-04-05 13:15:56 +01:00
Pádraig Brady
76edef14d8 cut: make the dependency on memchr2 explicit
* bootstrap.conf: Remove now unused getndelim2, add memchr2.
* src/cut.c: Remove now unused getndelim2.h.
2026-04-05 13:15:56 +01:00
Pádraig Brady
890cf82593 cut: combine cut_bytes_no_split and cut_characters
per character based so merge.
2026-04-05 13:15:56 +01:00
Pádraig Brady
fe00823330 doc: cut: clarify that combining characters are not treated specially
This is for consistency with other implementations and since the
interface separates -b and -c it might in future support -g (graphemes).
Normalizing content with a filter seems like the most appropriate
approach anyway, as there are various normalizations possible including
case etc. rather than baking that into every tool
2026-04-05 13:15:56 +01:00
Pádraig Brady
25f0702eaa maint: cut: various code cleanups and comments
* src/cut.c: Document some functions, and remove extraneous
 abstractions.
2026-04-05 13:15:56 +01:00
Pádraig Brady
5d339d583d cut: support no delimiter match fast path with -s
* src/cut.c (cut_fields_bytesearch): Just skip the data with -s.
2026-04-05 13:15:56 +01:00
Pádraig Brady
c3e819fadc doc: cut: resintate and expand -d info
* doc/coreutils.texi (cut invocation): Add back the -d description,
and adjust for multi-byte support, and expand on specifying a NUL
delimitier, and detail the behavior when the delimiter matches
the line delimiter.
2026-04-05 13:15:56 +01:00
Pádraig Brady
032ecdee9b maint: cut: cleanup context management for byte search
* src/cut.c: Hoist at_eof into context so we're not
querying it multiple times.  Also add a helper
to explicitly init bytesearch_context.
2026-04-05 13:15:56 +01:00
Pádraig Brady
24571c41f3 cut: optimize UTF-8 input with 0xF5-0xFF delimiters
* src/cut.c (bytesearch_field_delim_ok): Expand the range
of bytes that can be simply searched for. 0xF5-0xFF can't
appear in valid UTF-8 characters, and so may be used as
delimiters in UTF-8 input, so it's worth optimizing for.
* tests/cut/cut.pl: Add a test case (mainly as documentation).
2026-04-05 13:15:56 +01:00
Pádraig Brady
03a686a456 doc: cut: clarify that -s suppressed lines with only trimmed spaces
* doc/coreutils.texi (cut invocation): State explicitly that
-s --whitespace-delimited=trimmed will suppress lines that
do not have field separating blanks.
2026-04-05 13:15:56 +01:00
Pádraig Brady
2c1ea231ca doc: cut: mention the default -O used with -w
* doc/coreutils.texi (cut invocation): Mention the default
--output-delimiter is a TAB when matching runs of blanks in the input.
2026-04-05 13:15:56 +01:00