SpaceBlocks Mac OS

broken image


IDLE is the Python IDE built with the tkinter GUI toolkit.

IDLE has the following features:

  • coded in 100% pure Python, using the tkinter GUI toolkit
  • cross-platform: works on Windows, Unix, and Mac OS X
  • multi-window text editor with multiple undo, Python colorizing,smart indent, call tips, and many other features
  • Python shell window (a.k.a. interactive interpreter)
  • debugger (not complete, but you can set breakpoints, view and step)

25.5.1. Menus¶

IDLE has two main window types, the Shell window and the Editor window. It ispossible to have multiple editor windows simultaneously. Output windows, suchas used for Edit / Find in Files, are a subtype of edit window. They currentlyhave the same top menu as Editor windows but a different default title andcontext menu.

IDLE's menus dynamically change based on which window is currently selected.Each menu documented below indicates which window type it is associated with.Click on the dotted line at the top of a menu to 'tear it off': a separatewindow containing the menu is created (for Unix and Windows only). Legend of assassin: jungle mac os.

If you think iPads are a nightmare to manage then you are doing it wrong. 2 people manage 5600 devices spread across 3 1:1's and 5 additional elementary buildings is a sign it can be done. Is it as easy as the Chrome OS panel no, but it is also not as robust of an OS as iOS is. That being said its cheaper. GrandPerspective (free) or from the Mac App Store for £1.99 here and DaisyDisk (£9.99/$9.99, buy it here) give good visual indications while OmniDiskSweeper (free) uses the standard hierarchical.

25.5.1.1. File menu (Shell and Editor)¶

New File
Create a new file editing window.
Open.
Open an existing file with an Open dialog.
Recent Files
Open a list of recent files. Click one to open it.
Open Module.
Open an existing module (searches sys.path).
Class Browser
Show functions, classes, and methods in the current Editor file in atree structure. In the shell, open a module first.
Path Browser
Show sys.path directories, modules, functions, classes and methods in atree structure.
Save
Save the current window to the associated file, if there is one. Windowsthat have been changed since being opened or last saved have a * beforeand after the window title. If there is no associated file,do Save As instead.
Save As.
Save the current window with a Save As dialog. The file saved becomes thenew associated file for the window.
Save Copy As.
Save the current window to different file without changing the associatedfile.
Print Window
Print the current window to the default printer.
Close
Close the current window (ask to save if unsaved).
Exit
Close all windows and quit IDLE (ask to save unsaved windows).

25.5.1.2. Edit menu (Shell and Editor)¶

Undo
Undo the last change to the current window. A maximum of 1000 changes maybe undone.
Redo
Redo the last undone change to the current window.
Cut
Copy selection into the system-wide clipboard; then delete the selection.
Copy
Copy selection into the system-wide clipboard.
Paste
Insert contents of the system-wide clipboard into the current window.

The clipboard functions are also available in context menus.

Select All
Select the entire contents of the current window.
Find.
Open a search dialog with many options
Find Again
Repeat the last search, if there is one.
Find Selection
Search for the currently selected string, if there is one.
Find in Files.
Open a file search dialog. Put results in an new output window.
Replace.
Open a search-and-replace dialog.
Go to Line
Move cursor to the line number requested and make that line visible.
Show Completions
Open a scrollable list allowing selection of keywords and attributes. SeeCompletions in the Tips sections below.
Expand Word
Expand a prefix you have typed to match a full word in the same window;repeat to get a different expansion.
Show call tip
After an unclosed parenthesis for a function, open a small window withfunction parameter hints.
Show surrounding parens
Highlight the surrounding parenthesis.

25.5.1.3. Format menu (Editor window only)¶

Indent Region
Shift selected lines right by the indent width (default 4 spaces).
Dedent Region
Shift selected lines left by the indent width (default 4 spaces).
Comment Out Region
Insert ## in front of selected lines.
Uncomment Region
Remove leading # or ## from selected lines.
Tabify Region
Turn leading stretches of spaces into tabs. (Note: We recommend using4 space blocks to indent Python code.)
Untabify Region
Turn all tabs into the correct number of spaces.
Toggle Tabs
Open a dialog to switch between indenting with spaces and tabs.
New Indent Width
Open a dialog to change indent width. The accepted default by the Pythoncommunity is 4 spaces.
Format Paragraph
Reformat the current blank-line-delimited paragraph in comment block ormultiline string or selected line in a string. All lines in theparagraph will be formatted to less than N columns, where N defaults to 72.
Strip trailing whitespace
Remove any space characters after the last non-space character of a line.

25.5.1.4. Run menu (Editor window only)¶

Python Shell
Open or wake up the Python Shell window.
Check Module
Check the syntax of the module currently open in the Editor window. If themodule has not been saved IDLE will either prompt the user to save orautosave, as selected in the General tab of the Idle Settings dialog. Ifthere is a syntax error, the approximate location is indicated in theEditor window.
Run Module
Do Check Module (above). If no error, restart the shell to clean theenvironment, then execute the module.

25.5.1.5. Shell menu (Shell window only)¶

View Last Restart
Scroll the shell window to the last Shell restart.
Restart Shell
Restart the shell to clean the environment.

25.5.1.6. Debug menu (Shell window only)¶

Go to File/Line
Look on the current line. with the cursor, and the line above for a filenameand line number. If found, open the file if not already open, and show theline. Use this to view source lines referenced in an exception tracebackand lines found by Find in Files. Also available in the context menu ofthe Shell window and Output windows.
Debugger (toggle)
When actived, code entered in the Shell or run from an Editor will rununder the debugger. In the Editor, breakpoints can be set with the contextmenu. This feature is still incomplete and somewhat experimental.
Stack Viewer
Show the stack traceback of the last exception in a tree widget, withaccess to locals and globals.
Auto-open Stack Viewer
Toggle automatically opening the stack viewer on an unhandled exception.

25.5.1.7. Options menu (Shell and Editor)¶

Configure IDLE
Open a configuration dialog. Fonts, indentation, keybindings, and colorthemes may be altered. Startup Preferences may be set, and additionalhelp sources can be specified. Non-default user setting are saved in a.idlerc directory in the user's home directory. Problems caused by bad userconfiguration files are solved by editing or deleting one or more of thefiles in .idlerc.
Configure Extensions
Open a configuration dialog for setting preferences for extensions(discussed below). See note above about the location of user settings.
Code Context (toggle)(Editor Window only)
Open a pane at the top of the edit window which shows the block contextof the code which has scrolled above the top of the window.

25.5.1.8. Window menu (Shell and Editor)¶

Zoom Height
Toggles the window between normal size and maximum height. The initial sizedefaults to 40 lines by 80 chars unless changed on the General tab of theConfigure IDLE dialog.

The rest of this menu lists the names of all open windows; select one to bringit to the foreground (deiconifying it if necessary).

25.5.1.9. Help menu (Shell and Editor)¶

About IDLE
Display version, copyright, license, credits, and more.
IDLE Help
Display a help file for IDLE detailing the menu options, basic editing andnavigation, and other tips.
Python Docs
Access local Python documentation, if installed, or start a web browserand open docs.python.org showing the latest Python documentation.
Turtle Demo
Run the turtledemo module with example python code and turtle drawings.

Additional help sources may be added here with the Configure IDLE dialog underthe General tab.

25.5.1.10. Context Menus¶

Open a context menu by right-clicking in a window (Control-click on OS X).Context menus have the standard clipboard functions also on the Edit menu.

Cut
Copy selection into the system-wide clipboard; then delete the selection.
Copy
Copy selection into the system-wide clipboard.
Paste
Insert contents of the system-wide clipboard into the current window.

Editor windows also have breakpoint functions. Lines with a breakpoint set arespecially marked. Breakpoints only have an effect when running under thedebugger. Breakpoints for a file are saved in the user's .idlerc directory.

Set Breakpoint
Set a breakpoint on the current line.
Clear Breakpoint
Clear the breakpoint on that line.

Shell and Output windows have the following.

Go to file/line
Same as in Debug menu.

25.5.2. Editing and navigation¶

In this section, ‘C' refers to the Control key on Windows and Unix andthe Command key on Mac OSX.

  • Backspace deletes to the left; Del deletes to the right

  • C-Backspace delete word left; C-Del delete word to the right

  • Arrow keys and PageUp/PageDown to move around

  • Little red (shxynez) mac os. C-LeftArrow and C-RightArrow moves by words

  • Home/End go to begin/end of line

  • C-Home/C-End go to begin/end of file

  • Some useful Emacs bindings are inherited from Tcl/Tk:

    • C-a beginning of line
    • C-e end of line
    • C-k kill line (but doesn't put it in clipboard)
    • C-l center window around the insertion point
    • C-b go backwards one character without deleting (usually you canalso use the cursor key for this)
    • C-f go forward one character without deleting (usually you canalso use the cursor key for this)
    • C-p go up one line (usually you can also use the cursor key forthis)
    • C-d delete next character

Standard keybindings (like C-c to copy and C-v to paste)may work. Keybindings are selected in the Configure IDLE dialog.

25.5.2.1. Automatic indentation¶

After a block-opening statement, the next line is indented by 4 spaces (in thePython Shell window by one tab). After certain keywords (break, return etc.)the next line is dedented. In leading indentation, Backspace deletes upto 4 spaces if they are there. Tab inserts spaces (in the PythonShell window one tab), number depends on Indent width. Currently tabsare restricted to four spaces due to Tcl/Tk limitations.

See also the indent/dedent region commands in the edit menu.

25.5.2.2. Completions¶

Completions are supplied for functions, classes, and attributes of classes,both built-in and user-defined. Completions are also provided forfilenames.

The AutoCompleteWindow (ACW) will open after a predefined delay (default istwo seconds) after a ‘.' or (in a string) an os.sep is typed. If after oneof those characters (plus zero or more other characters) a tab is typedthe ACW will open immediately if a possible continuation is found.

If there is only one possible completion for the characters entered, aTab will supply that completion without opening the ACW.

‘Show Completions' will force open a completions window, by default theC-space will open a completions window. In an emptystring, this will contain the files in the current directory. On ablank line, it will contain the built-in and user-defined functions andclasses in the current name spaces, plus any modules imported. If somecharacters have been entered, the ACW will attempt to be more specific.

Space Blocks Mac Os Download

If a string of characters is typed, the ACW selection will jump to theentry most closely matching those characters. Entering a tab willcause the longest non-ambiguous match to be entered in the Editor window orShell. Two tab in a row will supply the current ACW selection, aswill return or a double click. Cursor keys, Page Up/Down, mouse selection,and the scroll wheel all operate on the ACW.

'Hidden' attributes can be accessed by typing the beginning of hiddenname after a ‘.', e.g. ‘_'. This allows access to modules with__all__ set, or to class-private attributes.

Completions and the ‘Expand Word' facility can save a lot of typing!

Completions are currently limited to those in the namespaces. Names inan Editor window which are not via __main__ and sys.modules willnot be found. Run the module once with your imports to correct this situation.Note that IDLE itself places quite a few modules in sys.modules, somuch can be found by default, e.g. the re module.

If you don't like the ACW popping up unbidden, simply make the delaylonger or disable the extension. Or another option is the delay couldbe set to zero. Another alternative to preventing ACW popups is todisable the call tips extension.

25.5.2.3. Python Shell window¶

  • C-c interrupts executing command

  • C-d sends end-of-file; closes window if typed at a >>> prompt

  • Alt-/ (Expand word) is also useful to reduce typing

    Command history

    • Alt-p retrieves previous command matching what you have typed. OnOS X use C-p.
    • Alt-n retrieves next. On OS X use C-n.
    • Return while on any previous command retrieves that command

25.5.3. Syntax colors¶

The coloring is applied in a background 'thread,' so you may occasionally seeuncolorized text. To change the color scheme, edit the [Colors] section inconfig.txt.

Python syntax colors:
Keywords
orange
Strings
green
Comments
red
Definitions
blue
Shell colors:
Console output
brown
stdout
blue
stderr
dark green
stdin
black

25.5.4. Startup¶

Upon startup with the -s option, IDLE will execute the file referenced bythe environment variables IDLESTARTUP or PYTHONSTARTUP.IDLE first checks for IDLESTARTUP; if IDLESTARTUP is present the filereferenced is run. If IDLESTARTUP is not present, IDLE checks forPYTHONSTARTUP. Files referenced by these environment variables areconvenient places to store functions that are used frequently from the IDLEshell, or for executing import statements to import common modules.

In addition, Tk also loads a startup file if it is present. Note that theTk file is loaded unconditionally. This additional file is .Idle.py and islooked for in the user's home directory. Statements in this file will beexecuted in the Tk namespace, so this file is not useful for importingfunctions to be used from IDLE's Python shell.

25.5.4.1. Command line usage¶

If there are arguments:

  1. If -e is used, arguments are files opened for editing andsys.argv reflects the arguments passed to IDLE itself.
  2. Otherwise, if -c is used, all arguments are placed insys.argv[1:.], with sys.argv[0] set to '-c'.
  3. Otherwise, if neither -e nor -c is used, the firstargument is a script which is executed with the remaining arguments insys.argv[1:.] and sys.argv[0] set to the script name. If thescript name is ‘-‘, no script is executed but an interactive Python sessionis started; the arguments are still available in sys.argv.

25.5.4.2. Running without a subprocess¶

If IDLE is started with the -n command line switch it will run in asingle process and will not create the subprocess which runs the RPCPython execution server. This can be useful if Python cannot createthe subprocess or the RPC socket interface on your platform. However,in this mode user code is not isolated from IDLE itself. Also, theenvironment is not restarted when Run/Run Module (F5) is selected. Ifyour code has been modified, you must reload() the affected modules andre-import any specific items (e.g. from foo import baz) if the changesare to take effect. For these reasons, it is preferable to run IDLEwith the default subprocess if at all possible.

25.5.5. Help and preferences¶

25.5.5.1. Additional help sources¶

IDLE includes a help menu entry called 'Python Docs' that will open theextensive sources of help, including tutorials, available at docs.python.org.Selected URLs can be added or removed from the help menu at any time using theConfigure IDLE dialog. See the IDLE help option in the help menu of IDLE formore information.

25.5.5.2. Setting preferences¶

Mac Os Download

The font preferences, highlighting, keys, and general preferences can bechanged via Configure IDLE on the Option menu. Keys can be user defined;IDLE ships with four built in key sets. In addition a user can create acustom key set in the Configure IDLE dialog under the keys tab.

25.5.5.3. Extensions¶

IDLE contains an extension facility. Peferences for extensions can bechanged with Configure Extensions. See the beginning of config-extensions.defin the idlelib directory for further information. The default extensionsare currently:

  • FormatParagraph
  • AutoExpand
  • ZoomHeight
  • ScriptBinding
  • CallTips
  • ParenMatch
  • AutoComplete
  • CodeContext
  • RstripExtension
(Redirected from UTF-5)

This article compares Unicode encodings. Two situations are considered: 8-bit-clean environments (which can be assumed), and environments that forbid use of byte values that have the high bit set. Originally such prohibitions were to allow for links that used only seven data bits, but they remain in some standards and so some standard-conforming software must generate messages that comply with the restrictions. Standard Compression Scheme for Unicode and Binary Ordered Compression for Unicode are excluded from the comparison tables because it is difficult to simply quantify their size.

Compatibility issues[edit]

A UTF-8 file that contains only ASCII characters is identical to an ASCII file. Legacy programs can generally handle UTF-8 encoded files, even if they contain non-ASCII characters. For instance, the Cprintf function can print a UTF-8 string, as it only looks for the ASCII '%' character to define a formatting string, and prints all other bytes unchanged, thus non-ASCII characters will be output unchanged.

UTF-16 and UTF-32 are incompatible with ASCII files, and thus require Unicode-aware programs to display, print and manipulate them, even if the file is known to contain only characters in the ASCII subset. Because they contain many zero bytes, the strings cannot be manipulated by normal null-terminated string handling for even simple operations such as copy.

Therefore, even on most UTF-16 systems such as Windows and Java, UTF-16 text files are not common; older 8-bit encodings such as ASCII or ISO-8859-1 are still used, forgoing Unicode support; or UTF-8 is used for Unicode. One rare counter-example is the 'strings' file used by Mac OS X (10.3 and later) applications for lookup of internationalized versions of messages which defaults to UTF-16, with 'files encoded using UTF-8 . Crashy cars mac os. not guaranteed to work.'[1]

XML is, by default, encoded as UTF-8, and all XML processors must at least support UTF-8 (including US-ASCII by definition) and UTF-16.[2]

Efficiency[edit]

UTF-8 requires 8, 16, 24 or 32 bits (one to four bytes) to encode a Unicode character, UTF-16 requires either 16 or 32 bits to encode a character, and UTF-32 always requires 32 bits to encode a character. The first 128 Unicode code points, U+0000 to U+007F, used for the C0 Controls and Basic Latin characters and which correspond one-to-one to their ASCII-code equivalents, are encoded using 8 bits in UTF-8, 16 bits in UTF-16, and 32 bits in UTF-32.

The next 1,920 characters, U+0080 to U+07FF (encompassing the remainder of almost all Latin-script alphabets, and also Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac, Tāna and N'Ko), require 16 bits to encode in both UTF-8 and UTF-16, and 32 bits in UTF-32. For U+0800 to U+FFFF, i.e. the remainder of the characters in the Basic Multilingual Plane (BMP, plane 0, U+0000 to U+FFFF), which encompasses the rest of the characters of most of the world's living languages, UTF-8 needs 24 bits to encode a character, while UTF-16 needs 16 bits and UTF-32 needs 32. Code points U+010000 to U+10FFFF, which represent characters in the supplementary planes (planes 1-16), require 32 bits in UTF-8, UTF-16 and UTF-32.

SpaceBlocks Mac OS

All printable characters in UTF-EBCDIC use at least as many bytes as in UTF-8, and most use more, due to a decision made to allow encoding the C1 control codes as single bytes. For seven-bit environments, UTF-7 is more space efficient than the combination of other Unicode encodings with quoted-printable or base64 for almost all types of text (see 'Seven-bit environments' below).

Storage utilization[edit]

Each format has its own set of advantages and disadvantages with respect to storage efficiency (and thus also of transmission time) and processing efficiency. Storage efficiency is subject to the location within the Unicode code space in which any given text's characters are predominantly from. Since Unicode code space blocks are organized by character set (i.e. alphabet/script), storage efficiency of any given text effectively depends on the alphabet/script used for that text. So, for example, UTF-8 needs one less byte per character (8 versus 16 bits) than UTF-16 for the 128 code points between U+0000 and U+007F, but needs one more byte per character (24 versus 16 bits) for the 63,488 code points between U+0800 and U+FFFF. Therefore, if there are more characters in the range U+0000 to U+007F than there are in the range U+0800 to U+FFFF then UTF-8 is more efficient, while if there are fewer, then UTF-16 is more efficient. If the counts are equal then they are exactly the same size. A surprising result is that real-world documents written in languages that use characters only in the high range are still often shorter in UTF-8, due to the extensive use of spaces, digits, punctuation, newlines, html markup, and embedded words and acronyms written with Latin letters.[citation needed]

Processing time[edit]

Mac

All printable characters in UTF-EBCDIC use at least as many bytes as in UTF-8, and most use more, due to a decision made to allow encoding the C1 control codes as single bytes. For seven-bit environments, UTF-7 is more space efficient than the combination of other Unicode encodings with quoted-printable or base64 for almost all types of text (see 'Seven-bit environments' below).

Storage utilization[edit]

Each format has its own set of advantages and disadvantages with respect to storage efficiency (and thus also of transmission time) and processing efficiency. Storage efficiency is subject to the location within the Unicode code space in which any given text's characters are predominantly from. Since Unicode code space blocks are organized by character set (i.e. alphabet/script), storage efficiency of any given text effectively depends on the alphabet/script used for that text. So, for example, UTF-8 needs one less byte per character (8 versus 16 bits) than UTF-16 for the 128 code points between U+0000 and U+007F, but needs one more byte per character (24 versus 16 bits) for the 63,488 code points between U+0800 and U+FFFF. Therefore, if there are more characters in the range U+0000 to U+007F than there are in the range U+0800 to U+FFFF then UTF-8 is more efficient, while if there are fewer, then UTF-16 is more efficient. If the counts are equal then they are exactly the same size. A surprising result is that real-world documents written in languages that use characters only in the high range are still often shorter in UTF-8, due to the extensive use of spaces, digits, punctuation, newlines, html markup, and embedded words and acronyms written with Latin letters.[citation needed]

Processing time[edit]

As far as processing time is concerned, text with variable-length encoding such as UTF-8 or UTF-16 is harder to process if there is a need to find the individual code units, as opposed to working with sequences of code units. Searching is unaffected by whether the characters are variable sized, since a search for a sequence of code units does not care about the divisions (it does require that the encoding be self-synchronizing, which both UTF-8 and UTF-16 are). A common misconception is that there is a need to 'find the nth character' and that this requires a fixed-length encoding; however, in real use the number n is only derived from examining the n−1 characters, thus sequential access is needed anyway.[citation needed]UTF-16BE and UTF-32BE are big-endian, UTF-16LE and UTF-32LE are little-endian. When character sequences in one endian order are loaded onto a machine with a different endian order, the characters need to be converted before they can be processed efficiently, unless data is processed with a byte granularity (as required for UTF-8). Accordingly, the issue at hand is more pertinent to the protocol and communication than to a computational difficulty.

Processing issues[edit]

For processing, a format should be easy to search, truncate, and generally process safely. All normal Unicode encodings use some form of fixed size code unit. Depending on the format and the code point to be encoded, one or more of these code units will represent a Unicode code point. To allow easy searching and truncation, a sequence must not occur within a longer sequence or across the boundary of two other sequences. UTF-8, UTF-16, UTF-32 and UTF-EBCDIC have these important properties but UTF-7 and GB 18030 do not.

Fixed-size characters can be helpful, but even if there is a fixed byte count per code point (as in UTF-32), there is not a fixed byte count per displayed character due to combining characters. Considering these incompatibilities and other quirks among different encoding schemes, handling unicode data with the same (or compatible) protocol throughout and across the interfaces (e.g. using an API/library, handling unicode characters in client/server model, etc.) can in general simplify the whole pipeline while eliminating a potential source of bugs at the same time.

UTF-16 is popular because many APIs date to the time when Unicode was 16-bit fixed width. However, using UTF-16 makes characters outside the Basic Multilingual Plane a special case which increases the risk of oversights related to their handling. That said, programs that mishandle surrogate pairs probably also have problems with combining sequences, so using UTF-32 is unlikely to solve the more general problem of poor handling of multi-code-unit characters.

If any stored data is in UTF-8 (such as file contents or names), it is very difficult to write a system that uses UTF-16 or UTF-32 as an API. This is due to the oft-overlooked fact that the byte array used by UTF-8 can physically contain invalid sequences. For instance, it is impossible to fix an invalid UTF-8 filename using a UTF-16 API, as no possible UTF-16 string will translate to that invalid filename. The opposite is not true: it is trivial to translate invalid UTF-16 to a unique (though technically invalid) UTF-8 string, so a UTF-8 API can control both UTF-8 and UTF-16 files and names, making UTF-8 preferred in any such mixed environment. An unfortunate but far more common workaround used by UTF-16 systems is to interpret the UTF-8 as some other encoding such as CP-1252 and ignore the mojibake for any non-ASCII data.

For communication and storage[edit]

UTF-16 and UTF-32 do not have endianness defined, so a byte order must be selected when receiving them over a byte-oriented network or reading them from a byte-oriented storage. This may be achieved by using a byte-order mark at the start of the text or assuming big-endian (RFC 2781). UTF-8, UTF-16BE, UTF-32BE, UTF-16LE and UTF-32LE are standardised on a single byte order and do not have this problem.

If the byte stream is subject to corruption then some encodings recover better than others. UTF-8 and UTF-EBCDIC are best in this regard as they can always resynchronize after a corrupt or missing byte at the start of the next code point; GB 18030 is unable to recover until the next ASCII non-number. UTF-16 can handle altered bytes, but not an odd number of missing bytes, which will garble all the following text (though it will produce uncommon and/or unassigned characters). If bits can be lost all of them will garble the following text, though UTF-8 can be resynchronized as incorrect byte boundaries will produce invalid UTF-8 in text longer than a few bytes.

In detail[edit]

The tables below list the number of bytes per code point for different Unicode ranges. Any additional comments needed are included in the table. The figures assume that overheads at the start and end of the block of text are negligible.

N.B. The tables below list numbers of bytes per code point, not per user visible 'character' (or 'grapheme cluster'). It can take multiple code points to describe a single grapheme cluster, so even in UTF-32, care must be taken when splitting or concatenating strings.

Eight-bit environments[edit]

Code range (hexadecimal)UTF-8UTF-16UTF-32UTF-EBCDICGB 18030
000000 – 00007F12411
000080 – 00009F22 for characters inherited from
GB 2312/GBK (e.g. most
Chinese characters) 4 for
everything else.
0000A0 – 0003FF2
000400 – 0007FF3
000800 – 003FFF3
004000 – 00FFFF4
010000 – 03FFFF444
040000 – 10FFFF5

Seven-bit environments[edit]

This table may not cover every special case and so should be used for estimation and comparison only. To accurately determine the size of text in an encoding, see the actual specifications.

Code range (hexadecimal)UTF-7UTF-8 quoted-
printable
UTF-8 base64UTF-16 q.-p.UTF-16 base64GB 18030 q.-p.GB 18030 base64
ASCII
graphic characters
(except U+003D '=')
1 for 'direct characters' (depends on the encoder setting for some code points), 2 for U+002B '+', otherwise same as for 000080 – 00FFFF11+1342+2311+13
00003D (equals sign)363
ASCII
control characters:
000000 – 00001F
and 00007F
1 or 3 depending on directness1 or 3 depending on directness
000080 – 0007FF5 for an isolated case inside a run of single byte characters. For runs 2+23 per character plus padding to make it a whole number of bytes plus two to start and finish the run62+232–6 depending on if the byte values need to be escaped4–6 for characters inherited from GB2312/GBK (e.g.
most Chinese characters) 8 for everything else.
2+23 for characters inherited from GB2312/GBK (e.g.
most Chinese characters) 5+13 for everything else.
000800 – 00FFFF94
010000 – 10FFFF8 for isolated case, 5+13 per character plus padding to integer plus 2 for a run125+138–12 depending on if the low bytes of the surrogates need to be escaped.5+1385+13

Endianness does not affect sizes (UTF-16BE and UTF-32BE have the same size as UTF-16LE and UTF-32LE, respectively).The use of UTF-32 under quoted-printable is highly impractical, but if implemented, will result in 8–12 bytes per code point (about 10 bytes in average), namely for BMP, each code point will occupy exactly 6 bytes more than the same code in quoted-printable/UTF-16. Base64/UTF-32 gets 5+13 bytes for any code point.

An ASCII control character under quoted-printable or UTF-7 may be represented either directly or encoded (escaped). The need to escape a given control character depends on many circumstances, but newlines in text data are usually coded directly.

Compression schemes[edit]

BOCU-1 and SCSU are two ways to compress Unicode data. Their encoding relies on how frequently the text is used. Most runs of text use the same script; for example, Latin, Cyrillic, Greek and so on. This normal use allows many runs of text to compress down to about 1 byte per code point. These stateful encodings make it more difficult to randomly access text at any position of a string.

These two compression schemes are not as efficient as other compression schemes, like zip or bzip2. Those general-purpose compression schemes can compress longer runs of bytes to just a few bytes. The SCSU and BOCU-1 compression schemes will not compress more than the theoretical 25% of text encoded as UTF-8, UTF-16 or UTF-32. Other general-purpose compression schemes can easily compress to 10% of original text size. The general purpose schemes require more complicated algorithms and longer chunks of text for a good compression ratio.

Unicode Technical Note #14 contains a more detailed comparison of compression schemes.

Historical: UTF-5 and UTF-6[edit]

Proposals have been made for a UTF-5 and UTF-6 for the internationalization of domain names (IDN). Portage mac os. The UTF-5 proposal used a base 32 encoding, where Punycode is (among other things, and not exactly) a base 36 encoding. The name UTF-5 for a code unit of 5 bits is explained by the equation 25 = 32.[3] The UTF-6 proposal added a running length encoding to UTF-5, here 6 simply stands for UTF-5 plus 1.[4]The IETF IDN WG later adopted the more efficient Punycode for this purpose.[5]

Not being seriously pursued[edit]

UTF-1 never gained serious acceptance. UTF-8 is much more frequently used.

UTF-9 and UTF-18, despite being functional encodings, were April Fools' Day RFC joke specifications.

References[edit]

  1. ^Apple Developer Connection: Internationalization Programming Topics: Strings Files
  2. ^'Character Encoding in Entities'. Extensible Markup Language (XML) 1.0 (Fifth Edition). W3C. 2008.
  3. ^Seng, James, UTF-5, a transformation format of Unicode and ISO 10646, 28 January 2000
  4. ^Welter, Mark; Spolarich, Brian W. (16 November 2000). 'UTF-6 - Yet Another ASCII-Compatible Encoding for ID'. Internet Engineering Task Force. Archived from the original on 23 May 2016. Retrieved 9 April 2016.
  5. ^Historical IETF IDN WG page
Retrieved from 'https://en.wikipedia.org/w/index.php?title=Comparison_of_Unicode_encodings&oldid=1006160156#UTF-5'




broken image