Commit Graph

75 Commits

Author SHA1 Message Date
Simon Wanner
09f2d79cb1 LibTextCodec: Bring TextCodec::get_standardized_encoding closer to spec 2024-06-04 10:21:07 +02:00
Simon Wanner
11bb216912 LibTextCodec: Add replacement decoder 2024-05-31 07:56:26 +02:00
Simon Wanner
7f3b457e62 LibTextCodec: Add EUC-KR decoder 2024-05-31 07:56:26 +02:00
Simon Wanner
ded6512ca8 LibTextCodec: Add Shift_JIS decoder 2024-05-31 07:56:26 +02:00
Simon Wanner
06f7c393b2 LibTextCodec: Add ISO-2022-JP decoder 2024-05-31 07:56:26 +02:00
Simon Wanner
45f0ae52be LibTextCodec: Add EUC-JP decoder 2024-05-31 07:56:26 +02:00
Simon Wanner
9943bb1d8e LibTextCodec: Add Big5 decoder 2024-05-31 07:56:26 +02:00
Simon Wanner
2ce61fe6ea LibTextCodec: Add GBK/GB18030 decoder
Includes changes from GB-18030-2022, which are not yet included in the
Encoding Specification, but WebKit, Blink and WPT are already updated.
2024-05-31 07:56:26 +02:00
Simon Wanner
9ed52504ab LibTextCodec: Delegate to process() in default validate() implementation 2024-05-31 07:56:26 +02:00
Simon Wanner
88c2586f25 LibTextCodec: Remove unused decoder classes 2024-05-31 07:56:26 +02:00
Simon Wanner
b79815c5a5 LibTextCodec: Add x-mac-cyrillic decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
07a9435da5 LibTextCodec: Add windows-1258 decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
275b89720b LibTextCodec: Add windows-1257 decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
c76308c7e6 LibTextCodec: Add windows-1256 decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
eb9ed10573 LibTextCodec: Add windows-1253 decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
2d35687db0 LibTextCodec: Add windows-874 decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
1b6878b6ca LibTextCodec: Add KOI8-U decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
1fd3a6f48c LibTextCodec: Add ISO-8859-16 decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
3e882f26db LibTextCodec: Sort checks in decoder_for mostly alphabetically
Keeps checks for common encodings (Latin1 & UTF-*) at the top.
2024-05-27 20:50:50 +02:00
Simon Wanner
56241df604 LibTextCodec: Add ISO-8859-14 decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
4188e328ac LibTextCodec: Add ISO-8859-13 decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
cc640f4363 LibTextCodec: Add ISO-8859-10 decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
d73220837e LibTextCodec: Add ISO-8859-8(-I) decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
24028e353e LibTextCodec: Add ISO-8859-7 decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
01c3b8091a LibTextCodec: Add ISO-8859-6 decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
763d904ad5 LibTextCodec: Add ISO-8859-5 decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
c6b17320db LibTextCodec: Add ISO-8859-4 decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
6c84edaaa2 LibTextCodec: Add ISO-8859-3 decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
fc783199f1 LibTextCodec: Add IBM866 decoder 2024-05-27 20:50:50 +02:00
Simon Wanner
96b3c35358 LibTextCodec: Implement table based decoders as SingleByteDecoder
Instead of copy-pasting the implementation, let's use a single class.
This "Single Byte Decoder" concept even exists in the Encoding Spec :^)
2024-05-27 20:50:50 +02:00
Michal Grich
7a6d84d036 LibTextCodec: Add Windows-1250 text decoder
This commit is adding Windows-1250 decoding based on unicode.org
mapping table.
2024-04-23 16:26:16 +02:00
Andreas Kling
3c039903fb LibTextCodec+AK: Don't validate UTF-8 strings twice
UTF8Decoder was already converting invalid data into replacement
characters while converting, so we know for sure we have valid UTF-8
by the time conversion is finished.

This patch adds a new StringBuilder::to_string_without_validation()
and uses it to make UTF8Decoder avoid half the work it was doing.
2023-12-30 13:49:50 +01:00
Nico Weber
8f47acee6a LibTextCodec: Add PDFDocEncoding decoder 2023-11-22 09:08:06 -07:00
Idan Horowitz
079c96376c LibTextCodec: Support validating encoded inputs 2023-11-17 16:02:36 +01:00
Luke Wilde
eaa4048870 LibTextCodec: Add "get output encoding" from the Encoding specification 2023-06-19 06:12:26 +02:00
Timothy Flynn
00fa23237a LibTextCodec: Change UTF-8's decoder to replace invalid code points
The UTF-8 decoder will currently crash if it is provided invalid UTF-8
input. Instead, change its behavior to match that of all other decoders
to replace invalid code points with U+FFFD. This is required by the web.
2023-05-12 05:47:36 +02:00
Andreas Kling
a504ac3e2a Everywhere: Rename equals_ignoring_case => equals_ignoring_ascii_case
Let's make it clear that these functions deal with ASCII case only.
2023-03-10 13:15:44 +01:00
Luke Wilde
e864444fe3 LibTextCodec/Latin1: Iterate over input string with u8 instead of char
Using char causes bytes equal to or over 0x80 to be treated as a
negative value and produce incorrect results when implicitly casting to
u32.

For example, `atob` in LibWeb uses this decoder to convert non-ASCII
values to UTF-8, but non-ASCII values are >= 0x80 and thus produces
incorrect results in such cases:
```js
Uint8Array.from(atob("u660"), c => c.charCodeAt(0));
```
This used to produce [253, 253, 253] instead of [187, 174, 180].

Required by Cloudflare's IUAM challenges.
2023-02-28 08:46:06 +00:00
Sam Atkins
2db168acc1 LibTextCodec+Everywhere: Port Decoders to new Strings 2023-02-19 17:15:47 +01:00
Sam Atkins
3c5090e172 LibTextCodec: Return Optional<Decoder&> from bom_sniff_to_decoder() 2023-02-19 17:15:47 +01:00
Sam Atkins
f2a9426885 LibTextCodec+Everywhere: Return Optional<Decoder&> from decoder_for() 2023-02-19 17:15:47 +01:00
Sam Atkins
d6075ef5b5 LibTextCodec+Everywhere: Make TextCodec::decoder_for() take a StringView
We don't need a full String/DeprecatedString inside this function, so we
might as well not force users to create one.
2023-02-15 12:48:26 -05:00
Nico Weber
eac2b2382c LibTextCodec: Add a MacRoman decoder
Allows displaying `<meta charset="x-mac-roman">` html files.
(`:set fenc=macroman`, `:w` in vim to save in that encoding.)
2023-01-24 14:37:20 +00:00
Nico Weber
b14b5a4d06 LibTextCodec: Simplify Latin1Decoder::process() a tiny bit 2023-01-24 14:37:20 +00:00
Nico Weber
3423b54eb9 LibTextCodec: Make utf-16be and utf-16le codecs actually work
There were two problems:

1. They didn't handle surrogates
2. They used signed chars, leading to eg 0x00e4 being treated as 0xffe4

Also add a basic test that catches both issues.
There's some code duplication with Utf16CodePointIterator::operator*(),
but let's get things working first.
2023-01-22 21:30:44 +00:00
Linus Groh
57dc179b1f Everywhere: Rename to_{string => deprecated_string}() where applicable
This will make it easier to support both string types at the same time
while we convert code, and tracking down remaining uses.

One big exception is Value::to_string() in LibJS, where the name is
dictated by the ToString AO.
2022-12-06 08:54:33 +01:00
Linus Groh
6e19ab2bbc AK+Everywhere: Rename String to DeprecatedString
We have a new, improved string type coming up in AK (OOM aware, no null
state), and while it's going to use UTF-8, the name UTF8String is a
mouthful - so let's free up the String name by renaming the existing
class.
Making the old one have an annoying name will hopefully also help with
quick adoption :^)
2022-12-06 08:54:33 +01:00
Tim Schumacher
7834e26ddb Everywhere: Explicitly link all binaries against the LibC target
Even though the toolchain implicitly links against -lc, it does not know
where it should get LibC from except for the sysroot. In the case of
Clang this causes it to pick up the LibC stub instead, which might be
slightly outdated and feature missing symbols.

This is currently not an issue that manifests because we pass through
the dependency on LibC and other libraries by accident, which causes
CMake to link against the LibC target (instead of just the library),
and thus points the linker at the build output directory.

Since we are looking to fix that in the upcoming commits, let's make
sure that everything will still be able to find the proper LibC first.
2022-11-01 14:49:09 +00:00
sin-ack
3f3f45580a Everywhere: Add sv suffix to strings relying on StringView(char const*)
Each of these strings would previously rely on StringView's char const*
constructor overload, which would call __builtin_strlen on the string.
Since we now have operator ""sv, we can replace these with much simpler
versions. This opens the door to being able to remove
StringView(char const*).

No functional changes.
2022-07-12 23:11:35 +02:00
Idan Horowitz
086969277e Everywhere: Run clang-format 2022-04-01 21:24:45 +01:00