1
0
mirror of git://git.sv.gnu.org/coreutils.git synced 2026-04-12 06:57:33 +02:00
Commit Graph

31368 Commits

Author SHA1 Message Date
Pádraig Brady
a16d56d60c cut: optimize -b by avoiding per byte iteration
Always memchr(line_delim) which is fast and allows:

- skipping whole segments when the next selected byte is beyond them
- skipping unselected prefixes in bulk
- writing contiguous selected spans in bulk

This wins for lines >= 4 characters,
but is slower lines <= 3 characters, especially if selecting bytes 1-3.
That is unusual though.
2026-04-05 13:15:56 +01:00
Pádraig Brady
ea6a7ba547 cut: optimize when no delimiter in input
This is about 20x faster.
Note we only do the delimiter search once per chunk,
and it's usually quick as delimiters wouldn't be too far
into the a chunk if present, so we don't bother
to cache the found delimiter.
2026-04-05 13:15:56 +01:00
Pádraig Brady
7d017f83bc tests: cut: ensure multi-byte delimiter is rejected in uni-byte locales
tests/cut/cut.pl: Check the appropriate diagnostic is presented.
2026-04-05 13:15:56 +01:00
Pádraig Brady
700ffc51a1 cut: optimize -w for uni-byte case
* src/cut.c: Limit search to SPACE and TAB
2026-04-05 13:15:56 +01:00
Pádraig Brady
f5b7d38d13 doc: cut: reorder -s in texi
Keep in alphabetical order.
2026-04-05 13:15:56 +01:00
Pádraig Brady
c1d7b492c6 doc: cut: document the -w option
* src/cut.c (usage): Mention blank characters are used to separate.
* doc/coreutils.texi (cut invocation): Likewise.  Also describe
the 'trimmed' argument and the relation to -F.
2026-04-05 13:15:56 +01:00
Pádraig Brady
cf25ef286a cut: refactor find_bytesearch_field_terminator to be stateful
Allows better/simpler avoidance of repeated line/delim scans

TODO: speed up our really slow cut_fields_mb_any.
Compare for example:
  time src/cut -w  -f1 ll.in >/dev/null #14s
  time src/cut -d, -f1 ll.in >/dev/null #.1s
Could adjust so that LC_ALL=C does memchr2(space,tab) ?
2026-04-05 13:15:56 +01:00
Pádraig Brady
0adb7c6edd cut: avoid repeated searchs for line_delim in the multi-byte delim case
TODO: Refactor all this into find_bytesearch_field_terminator.
Also handle in the delim_length==1 case.
2026-04-05 13:15:56 +01:00
Pádraig Brady
d9825aa9b1 cut: refactor all byte search to find_bytesearch_field_terminator
TODO: Perhaps also add search only fields mode
to avoid rescans of very long lines
2026-04-05 13:15:56 +01:00
Pádraig Brady
6250b59ef9 cut: optimize -f when finished processing fields for a line
TODO: simplify and compare perf
2026-04-05 13:15:56 +01:00
Pádraig Brady
352a396a16 cut: optimize -f for fhe common case of single byte delimiters
* TODO: perf comparison
2026-04-05 13:15:56 +01:00
Pádraig Brady
2a6b36ff5b cut: optimize -d '?' in UTF-8 case
ensure all ascii delims are processed with byte search in UTF-8
2026-04-05 13:15:56 +01:00
Pádraig Brady
f6b3055f74 cut: merge cut_fields and cut_fields_bytesearch
TODO: See why this is much slower:
time LC_ALL=C.UTF-8 src/cut -f1 -dc as.in > /dev/null
2026-04-05 13:15:56 +01:00
Pádraig Brady
b3ef6231bd cut: refactor -f to byte search and character processing
Not sure about this at all.
Only worthwhile if can also remove cut_fields_line_delim.

Refactored src/cut.c so the old UTF-8 byte-search path is now the
general byte-search field engine for all safe byte-search cases:
ordinary single-byte delimiters and valid UTF-8 delimiters. The old
cut_fields path did not go away completely; it is now
cut_fields_line_delim and is used only when the field delimiter equals
the record delimiter, because -d $'\n' and -z -d '' have different
semantics from normal line-based field splitting.

As part of that, I also folded the duplicated “start selected field”
logic into a shared helper, and renamed the byte-search helpers to match
their broader use. The current dispatcher in src/cut.c is now:
whitespace parser, then line- delimiter field mode, then byte-search
field mode, then the decoded multibyte parser.
2026-04-05 13:15:56 +01:00
Pádraig Brady
16b1ff40ae cut: fix 25% perf regression mentioned in previous change
* src/cut.c: Prefer strstr() over memmem() as the former is
optimized per platforma on GLIBC, where as the latter is only
optimized for S390 as of glibc-2.42

TODO: look at merging cut_fields and cut_fields_mb_utf8
as they're both byte search
2026-04-05 13:15:56 +01:00
Pádraig Brady
26028fb2c6 cut: use bounded memory in utf8 mode when possible
TODO: See why a bit slower than old code

$ time src/cut.old -f1 -dç mb.in >/dev/null
real	0m0.136s
user	0m0.096s
sys	0m0.039s

$ time src/cut.new -f1 -dç mb.in >/dev/null
real	0m0.170s
user	0m0.139s
sys	0m0.030s
2026-04-05 13:15:56 +01:00
Pádraig Brady
ba7c1fbadb cut: add utf8 helper to mbbuf
* gl/lib/mbbuf.h: To safely search a bounded buffer,
without needed to have unbounded memory with getndelim2.

TODO: rename mbbuf_utf8_safe_prefix to mbbuf_fill_utf8
2026-04-05 13:15:56 +01:00
Pádraig Brady
801686242e cut: faster utf8 processing
TODO: improve to use bounded memory where possible
2026-04-05 13:15:56 +01:00
Pádraig Brady
a14ac29629 cut: support -F as an alias for -f -w -O ' '
To improve compatibility with toybox/busybox scripts.
2026-04-05 13:15:56 +01:00
Pádraig Brady
74f6692aa0 maint: cut: refactor buffered and ordinary field scanning
* src/cut.c: Merge scan_mb_field and read_mb_field_to_buffer
2026-04-05 13:15:56 +01:00
Pádraig Brady
eb1f057746 cut: support --whitespace-delimited=trimmed
Support ignoring leading and trailing whitespace.
E.g. this matches awk's default field splitting mode.

* src/cut.c
* tests/cut/cut.pl: Add test cases.
2026-04-05 13:15:56 +01:00
Pádraig Brady
77ccacb9a7 cut: support -O as an alias for --output-delimiter
To improve compatibility with toybox/busybox scripts.

* doc/coreutils.texi (cut invocation): Add -O description.
* src/cut.c: Support -O as well as --output-delimiter
* tests/cut/cut.pl: Adjust one case to use -O.
2026-04-05 13:15:56 +01:00
Pádraig Brady
0ae17ffd99 doc: cut: adjust for multi-byte support
* doc/coreutils.texi (cut invocation): Remove the note about
-c being the same as -b.
2026-04-05 13:15:56 +01:00
Pádraig Brady
f644b4ca53 cut: refactor multi-byte updates
* src/cut.c: 160 fewer lines

Helpers extracted (replacing repeated inline patterns):
- write_line_delim(), write_pending_line_delim(), reset_item_line()
  - line boundary code used by cut_bytes{,no_split}, cut_characters
- write_selected_item()
  - output-delimiter + write logic used by all three byte/char functions
- reset_field_line()
  - field line reset used by cut_fields_mb_any

Field functions unified via cut_fields_mb_any(stream, whitespace_mode):
- struct mbfield_parser encapsulates the whitespace vs.
  fixed-delimiter state (saved char, mode flag)
- mbfield_get_char() - dispatches to saved-char or direct read
- mbfield_terminator()
  - returns FIELD_{DATA,DELIMETER,LINE_DELIMITER} based on mode
- read_mb_field_to_buffer()
  - replaces the two duplicated first-field buffering loops
- scan_mb_field(mbbuf, parser, pending, write_field)
  - replaces the four duplicated field scan loops
  (print+skip × two modes) with a single function and a write_field bool
- cut_fields_mb and cut_fields_ws are now trivial wrappers
2026-04-05 13:15:56 +01:00
Pádraig Brady
57110d8bae cut: implement -n to avoid outputting partial characters
Both the i18n patch and FreeBSD/macOS support this option.
They do differ in behavior somewhat as the i18n patch
may output more bytes than requested.

  $ printf '\xc3\xa9b\n' | i18n-cut -n -b1
  é

There is also a bug in the i18n patch with multi-byte
at the start of a line:

  $ printf '\xc3\xa9b\n' | i18n-cut -n -b1-2
  éb

We follow the FreeBSD behavior since it seems more
useful to have -b be a hard limit, rather than a soft limit.
This also reduces the possibility of duplicate character output
with separate cut invocations with non overlapping byte ranges.

* src/cut.c (cut_bytes_no_split): A new function
similar to cut_characters, to handle multi-byte characters
with byte limit semantics.
* tests/cut/cut.pl: Add test cases.
2026-04-05 13:15:56 +01:00
Pádraig Brady
caf1e91266 tests: cut: add a test for divergence from i18n patch
* tests/cut/cut.pl: We don't fall back to byte mode
upon invalid uni-byte delimiter.
2026-04-05 13:15:56 +01:00
Pádraig Brady
bed93d46f8 tests: cut: add case currently failing for coreutils-i18n patch
* tests/cut/cut.pl: Test for extraneous character output with:
printf 'aéb\n' | cut -s -d 'é' -f1 | od -tx1
2026-04-05 13:15:56 +01:00
Pádraig Brady
1cdc079860 tests: cut: check multi-byte output delimiter
* tests/cut/cut.pl: Add a test case.
2026-04-05 13:15:56 +01:00
Pádraig Brady
74047ec55e cut: adjust error message to be less specific
* src/cut.c (main): Cater for both misplaced -w and -d.
2026-04-05 13:15:56 +01:00
Pádraig Brady
19aa72b4ea cut: implement -w,--whitespace-delimited
* src/cut.c (cut_fields_ws): A new function handling both
uni-byte and multi-byte cases.
* tests/cut/cut.pl: Add a test cases.
2026-04-05 13:15:56 +01:00
Pádraig Brady
32f1de5b4f cut: support single byte -d with GB18030 input
* src/cut.c
* tests/cut/mb-non-utf8.sh
* tests/local.mk
2026-04-05 13:15:55 +01:00
Pádraig Brady
94ddf45a60 cut: support single byte -d that may be part of multi-byte
Note this is a slight divergence from the i18n patch
as that switched to uni-byte for any single byte delimiter
that is not valid multi-byte.

That results in possibly splitting in the middle of
a valid multi-byte character.

Instead we only split on a single byte when they're
not part of a multi-byte character.

* src/cut.c
2026-04-05 13:15:55 +01:00
Pádraig Brady
a021b0b698 cut: support multi-byte field delimiters
* src/cut.c
* tests/cut/cut.pl
2026-04-05 13:15:55 +01:00
Pádraig Brady
97703386e6 cut: support multi-byte input with -c
* src/cut.c
* tests/cut/cut.pl
2026-04-05 13:15:55 +01:00
Pádraig Brady
fb78200249 maint: cut: refactor output calls
* src/cut.c (cut_fields): Refactor calls to fwrite() and putchar()
2026-04-05 13:15:55 +01:00
Pádraig Brady
e3c7dc2b03 tests: cut: ensure no unecessary buffering
* tests/misc/write-errors.sh: Ensure we write output when possible.
2026-04-05 13:15:55 +01:00
Pádraig Brady
a04a1054f8 doc: cut: reorder --complement alphabetically in help
* src/cut.c (usage): Move placement of --comlement description.
* doc/coreutils.texi (cut invocation): Likewise.
2026-04-05 13:15:55 +01:00
Pádraig Brady
1173ebb7c8 doc: cut: clarify description of -b and -c
* src/cut.c (usage): State the arguments are positions,
in case users may think they were values.
2026-04-05 13:15:55 +01:00
Pádraig Brady
1204b29bab build: update to latest gnulib
Pick up mbrto{c32,wc} optimizations on UTF-8 on GLIBC.
Note configure.ac defines the required GNULIB_WCHAR_SINGLE_LOCALE.
This speeds up wc -m by 2.6x, when processing non ASCII chars,
and will similarly speed up per character processing
in the impending cut multi-byte implementation.
* NEWS: Mention the wc -m speed improvement.
2026-04-05 13:11:21 +01:00
Collin Funk
38fc6bde64 basename: avoid duplicate strlen calls on the suffix
$ ltrace -c ./src/basename-prev -s a $(seq 100000) > /dev/null
    % time     seconds  usecs/call     calls      function
    ------ ----------- ----------- --------- --------------------
     50.00   30.030316          75    400000 strlen
    [...]
    $ ltrace -c ./src/basename -s a $(seq 100000) > /dev/null
    % time     seconds  usecs/call     calls      function
    ------ ----------- ----------- --------- --------------------
     42.88   22.413953          74    300001 strlen
    [...]

* src/basename.c (remove_suffix, perform_basename): Add a length
argument for the suffix and use it instead of strlen.
(main): Calculate the suffix length. Refactor code to avoid calling
perform_basename in multiple places.
2026-04-04 14:55:56 -07:00
Paul Eggert
b64f9bfe4f date: simplify -u by not calling putenv
* src/date.c (TZSET): Remove; no longer needed.
(main): Simplify -u’s implementation by passing "UTC0" to tzalloc,
rather than by setting TZ in the environment and then calling getenv.
The old way of doing things dates back to before we had tzalloc.
* configure.ac (LOCALTIME_CACHE): Remove; no longer needed.
2026-04-02 18:54:35 -07:00
Paul Eggert
bb51268465 build: update gnulib submodule to latest 2026-04-01 14:48:25 -07:00
Paul Eggert
3fb7dc8e28 maint: avoid sigaction lock overhead
* configure.ac (GNULIB_SIGACTION_SINGLE_THREAD):
Define to avoid unnecessary locking in Gnulib sigaction.  See:
https://lists.gnu.org/r/bug-gnulib/2026-04/msg00008.html
2026-04-01 14:48:25 -07:00
Paul Eggert
8a6cb56817 maint: avoid Gnulib modules mbiter, mbiterf
* bootstrap.conf (avoided_gnulib_modules): Avoid mbiter and
mbiterf, for the same reason we avoid mbuiter and mbuiterf: these
modules are not needed because (due to mcel-prefer) we use mcel in
preference to mbiter/mbiterf/mbuiter/mbuiterf.
2026-04-01 12:32:47 -07:00
Paul Eggert
afe3ce9cd6 build: update gnulib submodule to latest 2026-04-01 12:32:47 -07:00
oech3
3558839bdd tests: dd: ensure memory exhaustion is handled gracefully
* tests/dd/no-allocate.sh: Ensure we exit 1 upon mem allocation failure.
Also check other buffer size edge cases.
https://github.com/uutils/coreutils/issues/11436
https://github.com/uutils/coreutils/issues/11580
https://github.com/coreutils/coreutils/pull/235
2026-04-01 16:54:12 +01:00
Pádraig Brady
178c48154d tests: dd: avoid false failure with no controlling terminal
* tests/dd/misc.sh: test -w /dev/tty is not a strong enough check,
we need to actually open /dev/tty to ensure it's available.
It's not available under setsid for example.
2026-04-01 13:31:57 +01:00
oech3
ee5092971e tests: dd: check that erroneous seeks are not done in output
* tests/dd/misc.sh: Add test case for of=/dev/tty.
The same occurs for /dev/stdout, but that varies
in the test hardness so is best avoided.
https://github.com/coreutils/coreutils/pull/234
2026-03-31 12:09:26 +01:00
oech3
368bfc7cb0 tests: coreutils: ensure empty arg is diagnosed
* tests/misc/coreutils.sh: Add a test case.
https://github.com/coreutils/coreutils/pull/232
2026-03-30 13:43:58 +01:00
Collin Funk
4cd0644472 date: avoid calling putenv multiple times unnecessarily
Adding environment variables can become quite expensive in some
admittedly unlikely situations.

    $ for i in $(seq 10000); do export A$i=A$i; done
    $ time ./src/date-prev -u $(yes -- -u | head -n 100000)
    Sun Mar 29 01:59:49 AM UTC 2026

    real	0m3.753s
    user	0m3.684s
    sys	0m0.050s
    $ time ./src/date -u $(yes -- -u | head -n 100000)
    Sun Mar 29 02:00:00 AM UTC 2026

    real	0m0.061s
    user	0m0.022s
    sys	0m0.045s

* src/date.c (main): Only add TZ=UTC0 to the environment once.
2026-03-28 18:57:49 -07:00