Always memchr(line_delim) which is fast and allows:
- skipping whole segments when the next selected byte is beyond them
- skipping unselected prefixes in bulk
- writing contiguous selected spans in bulk
This wins for lines >= 4 characters,
but is slower lines <= 3 characters, especially if selecting bytes 1-3.
That is unusual though.
This is about 20x faster.
Note we only do the delimiter search once per chunk,
and it's usually quick as delimiters wouldn't be too far
into the a chunk if present, so we don't bother
to cache the found delimiter.
* src/cut.c (usage): Mention blank characters are used to separate.
* doc/coreutils.texi (cut invocation): Likewise. Also describe
the 'trimmed' argument and the relation to -F.
Allows better/simpler avoidance of repeated line/delim scans
TODO: speed up our really slow cut_fields_mb_any.
Compare for example:
time src/cut -w -f1 ll.in >/dev/null #14s
time src/cut -d, -f1 ll.in >/dev/null #.1s
Could adjust so that LC_ALL=C does memchr2(space,tab) ?
Not sure about this at all.
Only worthwhile if can also remove cut_fields_line_delim.
Refactored src/cut.c so the old UTF-8 byte-search path is now the
general byte-search field engine for all safe byte-search cases:
ordinary single-byte delimiters and valid UTF-8 delimiters. The old
cut_fields path did not go away completely; it is now
cut_fields_line_delim and is used only when the field delimiter equals
the record delimiter, because -d $'\n' and -z -d '' have different
semantics from normal line-based field splitting.
As part of that, I also folded the duplicated “start selected field”
logic into a shared helper, and renamed the byte-search helpers to match
their broader use. The current dispatcher in src/cut.c is now:
whitespace parser, then line- delimiter field mode, then byte-search
field mode, then the decoded multibyte parser.
* src/cut.c: Prefer strstr() over memmem() as the former is
optimized per platforma on GLIBC, where as the latter is only
optimized for S390 as of glibc-2.42
TODO: look at merging cut_fields and cut_fields_mb_utf8
as they're both byte search
TODO: See why a bit slower than old code
$ time src/cut.old -f1 -dç mb.in >/dev/null
real 0m0.136s
user 0m0.096s
sys 0m0.039s
$ time src/cut.new -f1 -dç mb.in >/dev/null
real 0m0.170s
user 0m0.139s
sys 0m0.030s
* gl/lib/mbbuf.h: To safely search a bounded buffer,
without needed to have unbounded memory with getndelim2.
TODO: rename mbbuf_utf8_safe_prefix to mbbuf_fill_utf8
Support ignoring leading and trailing whitespace.
E.g. this matches awk's default field splitting mode.
* src/cut.c
* tests/cut/cut.pl: Add test cases.
To improve compatibility with toybox/busybox scripts.
* doc/coreutils.texi (cut invocation): Add -O description.
* src/cut.c: Support -O as well as --output-delimiter
* tests/cut/cut.pl: Adjust one case to use -O.
* src/cut.c: 160 fewer lines
Helpers extracted (replacing repeated inline patterns):
- write_line_delim(), write_pending_line_delim(), reset_item_line()
- line boundary code used by cut_bytes{,no_split}, cut_characters
- write_selected_item()
- output-delimiter + write logic used by all three byte/char functions
- reset_field_line()
- field line reset used by cut_fields_mb_any
Field functions unified via cut_fields_mb_any(stream, whitespace_mode):
- struct mbfield_parser encapsulates the whitespace vs.
fixed-delimiter state (saved char, mode flag)
- mbfield_get_char() - dispatches to saved-char or direct read
- mbfield_terminator()
- returns FIELD_{DATA,DELIMETER,LINE_DELIMITER} based on mode
- read_mb_field_to_buffer()
- replaces the two duplicated first-field buffering loops
- scan_mb_field(mbbuf, parser, pending, write_field)
- replaces the four duplicated field scan loops
(print+skip × two modes) with a single function and a write_field bool
- cut_fields_mb and cut_fields_ws are now trivial wrappers
Both the i18n patch and FreeBSD/macOS support this option.
They do differ in behavior somewhat as the i18n patch
may output more bytes than requested.
$ printf '\xc3\xa9b\n' | i18n-cut -n -b1
é
There is also a bug in the i18n patch with multi-byte
at the start of a line:
$ printf '\xc3\xa9b\n' | i18n-cut -n -b1-2
éb
We follow the FreeBSD behavior since it seems more
useful to have -b be a hard limit, rather than a soft limit.
This also reduces the possibility of duplicate character output
with separate cut invocations with non overlapping byte ranges.
* src/cut.c (cut_bytes_no_split): A new function
similar to cut_characters, to handle multi-byte characters
with byte limit semantics.
* tests/cut/cut.pl: Add test cases.
Note this is a slight divergence from the i18n patch
as that switched to uni-byte for any single byte delimiter
that is not valid multi-byte.
That results in possibly splitting in the middle of
a valid multi-byte character.
Instead we only split on a single byte when they're
not part of a multi-byte character.
* src/cut.c
Pick up mbrto{c32,wc} optimizations on UTF-8 on GLIBC.
Note configure.ac defines the required GNULIB_WCHAR_SINGLE_LOCALE.
This speeds up wc -m by 2.6x, when processing non ASCII chars,
and will similarly speed up per character processing
in the impending cut multi-byte implementation.
* NEWS: Mention the wc -m speed improvement.
* src/date.c (TZSET): Remove; no longer needed.
(main): Simplify -u’s implementation by passing "UTC0" to tzalloc,
rather than by setting TZ in the environment and then calling getenv.
The old way of doing things dates back to before we had tzalloc.
* configure.ac (LOCALTIME_CACHE): Remove; no longer needed.
* bootstrap.conf (avoided_gnulib_modules): Avoid mbiter and
mbiterf, for the same reason we avoid mbuiter and mbuiterf: these
modules are not needed because (due to mcel-prefer) we use mcel in
preference to mbiter/mbiterf/mbuiter/mbuiterf.
* tests/dd/misc.sh: test -w /dev/tty is not a strong enough check,
we need to actually open /dev/tty to ensure it's available.
It's not available under setsid for example.
* tests/dd/misc.sh: Add test case for of=/dev/tty.
The same occurs for /dev/stdout, but that varies
in the test hardness so is best avoided.
https://github.com/coreutils/coreutils/pull/234
Adding environment variables can become quite expensive in some
admittedly unlikely situations.
$ for i in $(seq 10000); do export A$i=A$i; done
$ time ./src/date-prev -u $(yes -- -u | head -n 100000)
Sun Mar 29 01:59:49 AM UTC 2026
real 0m3.753s
user 0m3.684s
sys 0m0.050s
$ time ./src/date -u $(yes -- -u | head -n 100000)
Sun Mar 29 02:00:00 AM UTC 2026
real 0m0.061s
user 0m0.022s
sys 0m0.045s
* src/date.c (main): Only add TZ=UTC0 to the environment once.