coreutils

mirror of git://git.sv.gnu.org/coreutils.git synced 2026-04-12 06:57:33 +02:00

Author	SHA1	Message	Date
Pádraig Brady	a16d56d60c	cut: optimize -b by avoiding per byte iteration Always memchr(line_delim) which is fast and allows: - skipping whole segments when the next selected byte is beyond them - skipping unselected prefixes in bulk - writing contiguous selected spans in bulk This wins for lines >= 4 characters, but is slower lines <= 3 characters, especially if selecting bytes 1-3. That is unusual though.	2026-04-05 13:15:56 +01:00
Pádraig Brady	ea6a7ba547	cut: optimize when no delimiter in input This is about 20x faster. Note we only do the delimiter search once per chunk, and it's usually quick as delimiters wouldn't be too far into the a chunk if present, so we don't bother to cache the found delimiter.	2026-04-05 13:15:56 +01:00
Pádraig Brady	7d017f83bc	tests: cut: ensure multi-byte delimiter is rejected in uni-byte locales tests/cut/cut.pl: Check the appropriate diagnostic is presented.	2026-04-05 13:15:56 +01:00
Pádraig Brady	700ffc51a1	cut: optimize -w for uni-byte case * src/cut.c: Limit search to SPACE and TAB	2026-04-05 13:15:56 +01:00
Pádraig Brady	f5b7d38d13	doc: cut: reorder -s in texi Keep in alphabetical order.	2026-04-05 13:15:56 +01:00
Pádraig Brady	c1d7b492c6	doc: cut: document the -w option * src/cut.c (usage): Mention blank characters are used to separate. * doc/coreutils.texi (cut invocation): Likewise. Also describe the 'trimmed' argument and the relation to -F.	2026-04-05 13:15:56 +01:00
Pádraig Brady	cf25ef286a	cut: refactor find_bytesearch_field_terminator to be stateful Allows better/simpler avoidance of repeated line/delim scans TODO: speed up our really slow cut_fields_mb_any. Compare for example: time src/cut -w -f1 ll.in >/dev/null #14s time src/cut -d, -f1 ll.in >/dev/null #.1s Could adjust so that LC_ALL=C does memchr2(space,tab) ?	2026-04-05 13:15:56 +01:00
Pádraig Brady	0adb7c6edd	cut: avoid repeated searchs for line_delim in the multi-byte delim case TODO: Refactor all this into find_bytesearch_field_terminator. Also handle in the delim_length==1 case.	2026-04-05 13:15:56 +01:00
Pádraig Brady	d9825aa9b1	cut: refactor all byte search to find_bytesearch_field_terminator TODO: Perhaps also add search only fields mode to avoid rescans of very long lines	2026-04-05 13:15:56 +01:00
Pádraig Brady	6250b59ef9	cut: optimize -f when finished processing fields for a line TODO: simplify and compare perf	2026-04-05 13:15:56 +01:00
Pádraig Brady	352a396a16	cut: optimize -f for fhe common case of single byte delimiters * TODO: perf comparison	2026-04-05 13:15:56 +01:00
Pádraig Brady	2a6b36ff5b	cut: optimize -d '?' in UTF-8 case ensure all ascii delims are processed with byte search in UTF-8	2026-04-05 13:15:56 +01:00
Pádraig Brady	f6b3055f74	cut: merge cut_fields and cut_fields_bytesearch TODO: See why this is much slower: time LC_ALL=C.UTF-8 src/cut -f1 -dc as.in > /dev/null	2026-04-05 13:15:56 +01:00
Pádraig Brady	b3ef6231bd	cut: refactor -f to byte search and character processing Not sure about this at all. Only worthwhile if can also remove cut_fields_line_delim. Refactored src/cut.c so the old UTF-8 byte-search path is now the general byte-search field engine for all safe byte-search cases: ordinary single-byte delimiters and valid UTF-8 delimiters. The old cut_fields path did not go away completely; it is now cut_fields_line_delim and is used only when the field delimiter equals the record delimiter, because -d $'\n' and -z -d '' have different semantics from normal line-based field splitting. As part of that, I also folded the duplicated “start selected field” logic into a shared helper, and renamed the byte-search helpers to match their broader use. The current dispatcher in src/cut.c is now: whitespace parser, then line- delimiter field mode, then byte-search field mode, then the decoded multibyte parser.	2026-04-05 13:15:56 +01:00
Pádraig Brady	16b1ff40ae	cut: fix 25% perf regression mentioned in previous change * src/cut.c: Prefer strstr() over memmem() as the former is optimized per platforma on GLIBC, where as the latter is only optimized for S390 as of glibc-2.42 TODO: look at merging cut_fields and cut_fields_mb_utf8 as they're both byte search	2026-04-05 13:15:56 +01:00
Pádraig Brady	26028fb2c6	cut: use bounded memory in utf8 mode when possible TODO: See why a bit slower than old code $ time src/cut.old -f1 -dç mb.in >/dev/null real 0m0.136s user 0m0.096s sys 0m0.039s $ time src/cut.new -f1 -dç mb.in >/dev/null real 0m0.170s user 0m0.139s sys 0m0.030s	2026-04-05 13:15:56 +01:00
Pádraig Brady	ba7c1fbadb	cut: add utf8 helper to mbbuf * gl/lib/mbbuf.h: To safely search a bounded buffer, without needed to have unbounded memory with getndelim2. TODO: rename mbbuf_utf8_safe_prefix to mbbuf_fill_utf8	2026-04-05 13:15:56 +01:00
Pádraig Brady	801686242e	cut: faster utf8 processing TODO: improve to use bounded memory where possible	2026-04-05 13:15:56 +01:00
Pádraig Brady	a14ac29629	cut: support -F as an alias for -f -w -O ' ' To improve compatibility with toybox/busybox scripts.	2026-04-05 13:15:56 +01:00
Pádraig Brady	74f6692aa0	maint: cut: refactor buffered and ordinary field scanning * src/cut.c: Merge scan_mb_field and read_mb_field_to_buffer	2026-04-05 13:15:56 +01:00
Pádraig Brady	eb1f057746	cut: support --whitespace-delimited=trimmed Support ignoring leading and trailing whitespace. E.g. this matches awk's default field splitting mode. * src/cut.c * tests/cut/cut.pl: Add test cases.	2026-04-05 13:15:56 +01:00
Pádraig Brady	77ccacb9a7	cut: support -O as an alias for --output-delimiter To improve compatibility with toybox/busybox scripts. * doc/coreutils.texi (cut invocation): Add -O description. * src/cut.c: Support -O as well as --output-delimiter * tests/cut/cut.pl: Adjust one case to use -O.	2026-04-05 13:15:56 +01:00
Pádraig Brady	0ae17ffd99	doc: cut: adjust for multi-byte support * doc/coreutils.texi (cut invocation): Remove the note about -c being the same as -b.	2026-04-05 13:15:56 +01:00
Pádraig Brady	f644b4ca53	cut: refactor multi-byte updates * src/cut.c: 160 fewer lines Helpers extracted (replacing repeated inline patterns): - write_line_delim(), write_pending_line_delim(), reset_item_line() - line boundary code used by cut_bytes{,no_split}, cut_characters - write_selected_item() - output-delimiter + write logic used by all three byte/char functions - reset_field_line() - field line reset used by cut_fields_mb_any Field functions unified via cut_fields_mb_any(stream, whitespace_mode): - struct mbfield_parser encapsulates the whitespace vs. fixed-delimiter state (saved char, mode flag) - mbfield_get_char() - dispatches to saved-char or direct read - mbfield_terminator() - returns FIELD_{DATA,DELIMETER,LINE_DELIMITER} based on mode - read_mb_field_to_buffer() - replaces the two duplicated first-field buffering loops - scan_mb_field(mbbuf, parser, pending, write_field) - replaces the four duplicated field scan loops (print+skip × two modes) with a single function and a write_field bool - cut_fields_mb and cut_fields_ws are now trivial wrappers	2026-04-05 13:15:56 +01:00
Pádraig Brady	57110d8bae	cut: implement -n to avoid outputting partial characters Both the i18n patch and FreeBSD/macOS support this option. They do differ in behavior somewhat as the i18n patch may output more bytes than requested. $ printf '\xc3\xa9b\n' \| i18n-cut -n -b1 é There is also a bug in the i18n patch with multi-byte at the start of a line: $ printf '\xc3\xa9b\n' \| i18n-cut -n -b1-2 éb We follow the FreeBSD behavior since it seems more useful to have -b be a hard limit, rather than a soft limit. This also reduces the possibility of duplicate character output with separate cut invocations with non overlapping byte ranges. * src/cut.c (cut_bytes_no_split): A new function similar to cut_characters, to handle multi-byte characters with byte limit semantics. * tests/cut/cut.pl: Add test cases.	2026-04-05 13:15:56 +01:00
Pádraig Brady	caf1e91266	tests: cut: add a test for divergence from i18n patch * tests/cut/cut.pl: We don't fall back to byte mode upon invalid uni-byte delimiter.	2026-04-05 13:15:56 +01:00
Pádraig Brady	bed93d46f8	tests: cut: add case currently failing for coreutils-i18n patch * tests/cut/cut.pl: Test for extraneous character output with: printf 'aéb\n' \| cut -s -d 'é' -f1 \| od -tx1	2026-04-05 13:15:56 +01:00
Pádraig Brady	1cdc079860	tests: cut: check multi-byte output delimiter * tests/cut/cut.pl: Add a test case.	2026-04-05 13:15:56 +01:00
Pádraig Brady	74047ec55e	cut: adjust error message to be less specific * src/cut.c (main): Cater for both misplaced -w and -d.	2026-04-05 13:15:56 +01:00
Pádraig Brady	19aa72b4ea	cut: implement -w,--whitespace-delimited * src/cut.c (cut_fields_ws): A new function handling both uni-byte and multi-byte cases. * tests/cut/cut.pl: Add a test cases.	2026-04-05 13:15:56 +01:00
Pádraig Brady	32f1de5b4f	cut: support single byte -d with GB18030 input * src/cut.c * tests/cut/mb-non-utf8.sh * tests/local.mk	2026-04-05 13:15:55 +01:00
Pádraig Brady	94ddf45a60	cut: support single byte -d that may be part of multi-byte Note this is a slight divergence from the i18n patch as that switched to uni-byte for any single byte delimiter that is not valid multi-byte. That results in possibly splitting in the middle of a valid multi-byte character. Instead we only split on a single byte when they're not part of a multi-byte character. * src/cut.c	2026-04-05 13:15:55 +01:00
Pádraig Brady	a021b0b698	cut: support multi-byte field delimiters * src/cut.c * tests/cut/cut.pl	2026-04-05 13:15:55 +01:00
Pádraig Brady	97703386e6	cut: support multi-byte input with -c * src/cut.c * tests/cut/cut.pl	2026-04-05 13:15:55 +01:00
Pádraig Brady	fb78200249	maint: cut: refactor output calls * src/cut.c (cut_fields): Refactor calls to fwrite() and putchar()	2026-04-05 13:15:55 +01:00
Pádraig Brady	e3c7dc2b03	tests: cut: ensure no unecessary buffering * tests/misc/write-errors.sh: Ensure we write output when possible.	2026-04-05 13:15:55 +01:00
Pádraig Brady	a04a1054f8	doc: cut: reorder --complement alphabetically in help * src/cut.c (usage): Move placement of --comlement description. * doc/coreutils.texi (cut invocation): Likewise.	2026-04-05 13:15:55 +01:00
Pádraig Brady	1173ebb7c8	doc: cut: clarify description of -b and -c * src/cut.c (usage): State the arguments are positions, in case users may think they were values.	2026-04-05 13:15:55 +01:00
Pádraig Brady	1204b29bab	build: update to latest gnulib Pick up mbrto{c32,wc} optimizations on UTF-8 on GLIBC. Note configure.ac defines the required GNULIB_WCHAR_SINGLE_LOCALE. This speeds up wc -m by 2.6x, when processing non ASCII chars, and will similarly speed up per character processing in the impending cut multi-byte implementation. * NEWS: Mention the wc -m speed improvement.	2026-04-05 13:11:21 +01:00
Collin Funk	38fc6bde64	basename: avoid duplicate strlen calls on the suffix $ ltrace -c ./src/basename-prev -s a $(seq 100000) > /dev/null % time seconds usecs/call calls function ------ ----------- ----------- --------- -------------------- 50.00 30.030316 75 400000 strlen [...] $ ltrace -c ./src/basename -s a $(seq 100000) > /dev/null % time seconds usecs/call calls function ------ ----------- ----------- --------- -------------------- 42.88 22.413953 74 300001 strlen [...] * src/basename.c (remove_suffix, perform_basename): Add a length argument for the suffix and use it instead of strlen. (main): Calculate the suffix length. Refactor code to avoid calling perform_basename in multiple places.	2026-04-04 14:55:56 -07:00
Paul Eggert	b64f9bfe4f	date: simplify -u by not calling putenv * src/date.c (TZSET): Remove; no longer needed. (main): Simplify -u’s implementation by passing "UTC0" to tzalloc, rather than by setting TZ in the environment and then calling getenv. The old way of doing things dates back to before we had tzalloc. * configure.ac (LOCALTIME_CACHE): Remove; no longer needed.	2026-04-02 18:54:35 -07:00
Paul Eggert	bb51268465	build: update gnulib submodule to latest	2026-04-01 14:48:25 -07:00
Paul Eggert	3fb7dc8e28	maint: avoid sigaction lock overhead * configure.ac (GNULIB_SIGACTION_SINGLE_THREAD): Define to avoid unnecessary locking in Gnulib sigaction. See: https://lists.gnu.org/r/bug-gnulib/2026-04/msg00008.html	2026-04-01 14:48:25 -07:00
Paul Eggert	8a6cb56817	maint: avoid Gnulib modules mbiter, mbiterf * bootstrap.conf (avoided_gnulib_modules): Avoid mbiter and mbiterf, for the same reason we avoid mbuiter and mbuiterf: these modules are not needed because (due to mcel-prefer) we use mcel in preference to mbiter/mbiterf/mbuiter/mbuiterf.	2026-04-01 12:32:47 -07:00
Paul Eggert	afe3ce9cd6	build: update gnulib submodule to latest	2026-04-01 12:32:47 -07:00
oech3	3558839bdd	tests: dd: ensure memory exhaustion is handled gracefully * tests/dd/no-allocate.sh: Ensure we exit 1 upon mem allocation failure. Also check other buffer size edge cases. https://github.com/uutils/coreutils/issues/11436 https://github.com/uutils/coreutils/issues/11580 https://github.com/coreutils/coreutils/pull/235	2026-04-01 16:54:12 +01:00
Pádraig Brady	178c48154d	tests: dd: avoid false failure with no controlling terminal * tests/dd/misc.sh: test -w /dev/tty is not a strong enough check, we need to actually open /dev/tty to ensure it's available. It's not available under setsid for example.	2026-04-01 13:31:57 +01:00
oech3	ee5092971e	tests: dd: check that erroneous seeks are not done in output * tests/dd/misc.sh: Add test case for of=/dev/tty. The same occurs for /dev/stdout, but that varies in the test hardness so is best avoided. https://github.com/coreutils/coreutils/pull/234	2026-03-31 12:09:26 +01:00
oech3	368bfc7cb0	tests: coreutils: ensure empty arg is diagnosed * tests/misc/coreutils.sh: Add a test case. https://github.com/coreutils/coreutils/pull/232	2026-03-30 13:43:58 +01:00
Collin Funk	4cd0644472	date: avoid calling putenv multiple times unnecessarily Adding environment variables can become quite expensive in some admittedly unlikely situations. $ for i in $(seq 10000); do export A$i=A$i; done $ time ./src/date-prev -u $(yes -- -u \| head -n 100000) Sun Mar 29 01:59:49 AM UTC 2026 real 0m3.753s user 0m3.684s sys 0m0.050s $ time ./src/date -u $(yes -- -u \| head -n 100000) Sun Mar 29 02:00:00 AM UTC 2026 real 0m0.061s user 0m0.022s sys 0m0.045s * src/date.c (main): Only add TZ=UTC0 to the environment once.	2026-03-28 18:57:49 -07:00

1 2 3 4 5 ...

31368 Commits