Web pages authored using hypertext markup language (HTML HTML, which stands for HyperText Markup Language, is the predominant markup language for web pages. It is written in the form of HTML elements consisting of "tags" surrounded by angle brackets within the web page content) may contain multilingual text represented with the Unicode universal character set.

The relationship between Unicode Unicode is a computing industry standard for the consistent representation and handling of text expressed in most of the world's writing systems. Developed in conjunction with the Universal Character Set standard and published in book form as The Unicode Standard, the latest version of Unicode consists of a repertoire of more than 107,000 and HTML tends to be a difficult topic for many computer professionals, document authors, and web The World Wide Web, abbreviated as WWW and commonly known as the Web, is a system of interlinked hypertext documents accessed via the Internet. With a web browser, one can view web pages that may contain text, images, videos, and other multimedia and navigate between them by using hyperlinks. Using concepts from earlier hypertext systems, British users alike. The accurate representation of text in web pages A web page or webpage is a document or resource of information that is suitable for the World Wide Web and can be accessed through a web browser and displayed on a monitor or mobile device from different natural languages In the philosophy of language, a natural language is any language which arises in an unpremeditated fashion as the result of the innate facility for language possessed by the human intellect. A natural language is typically used for communication, and may be spoken, signed, or written. Natural language is distinguished from constructed languages and writing systems Writing systems are distinguished from other possible symbolic communication systems in that the reader must usually understand something of the associated spoken language to comprehend the text. In contrast, other possible symbolic systems such as information signs, painting, maps and mathematics often do not require prior knowledge of a spoken is complicated by the details of character encoding A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in computers, markup language A markup language is a modern system for annotating a text in a way that is syntactically distinguishable from that text. The idea and terminology evolved from the "marking up" of manuscripts, i.e. the revision instructions by editors, traditionally written with a blue pencil on authors' manuscripts. Examples are typesetting instructions syntax, font In typography, a typeface is a set of one or more fonts, in one or more sizes, designed with stylistic unity, each comprising a coordinated set of glyphs. A typeface usually comprises an alphabet of letters, numerals, and punctuation marks; it may also include ideograms and symbols, or consist entirely of them, for example, mathematical or map-, and varying levels of support by web browsers A web browser is a software application for retrieving, presenting, and traversing information resources on the World Wide Web. An information resource is identified by a Uniform Resource Identifier and may be a web page, image, video, or other piece of content. Hyperlinks present in resources enable users to easily navigate their browsers to.

Contents

HTML document characters

Web pages are typically HTML HTML, which stands for HyperText Markup Language, is the predominant markup language for web pages. It is written in the form of HTML elements consisting of "tags" surrounded by angle brackets within the web page content or XHTML XHTML is a family of XML markup languages that mirror or extend versions of the widely used Hypertext Markup Language (HTML), the language in which web pages are written documents. Both types of documents consist, at a fundamental level, of characters In computer and machine-based telecommunications terminology, a character is a unit of information that roughly corresponds to a grapheme, grapheme-like unit, or symbol, such as in an alphabet or syllabary in the written form of a natural language, which are graphemes A grapheme is a fundamental unit in a written language. Examples of graphemes include alphabetic letters, Chinese characters, numerical digits, punctuation marks, and the individual symbols of any of the world's writing systems and grapheme-like units, independent of how they manifest in computer storage Computer data storage, often called storage or memory, refers to computer components, devices, and recording media that retain digital data used for computing for some interval of time. Computer data storage provides one of the core functions of the modern computer, that of information retention. It is one of the fundamental components of all systems and networks A computer network, often simply referred to as a network, is a collection of computers and devices connected by communications channels that facilitates communications among users and allows users to share resources with other users. Networks may be classified according to a wide variety of characteristics. This article provides a general.

An HTML document is a sequence of Unicode characters. More specifically, HTML 4.0 documents are required to consist of characters in the HTML document character set: a character repertoire wherein each character is assigned a unique, non-negative integer code point. This set is defined in the HTML 4.0 DTD Document Type Definition is a set of markup declarations that define a document type for SGML-family markup languages (SGML, XML, HTML). A DTD is a kind of XML schema, which also establishes the syntax (allowable sequences of characters) that can produce a valid HTML document. The HTML document character set for HTML 4.0 consists of most, but not all, of the characters jointly defined by Unicode Unicode is a computing industry standard for the consistent representation and handling of text expressed in most of the world's writing systems. Developed in conjunction with the Universal Character Set standard and published in book form as The Unicode Standard, the latest version of Unicode consists of a repertoire of more than 107,000 and ISO/IEC 10646: the Universal Character Set The Universal Character Set , defined by the International Standard ISO/IEC 10646, Information technology — Universal multiple-octet coded character set (UCS) (plus amendments to that standard), is a standard set of characters upon which many character encodings are based. The UCS contains nearly one hundred thousand abstract characters, each (UCS).

Like HTML documents, an XHTML document is a sequence of Unicode characters. However, an XHTML document is an XML Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards document, which, while not having an explicit "document character" layer of abstraction Abstraction is a conceptual process by which higher, more abstract concepts are derived from the usage and classification of literal concepts. An "abstraction" (noun) is a concept that acts as super-categorical noun for all subordinate concepts, and connects any related concepts as a group, field, or category, nevertheless relies upon a similar definition of permissible characters that cover most, but not all, of the Unicode/UCS character definitions. The sets used by HTML and XHTML/XML are slightly different, but these differences have little effect on the average document author.

Regardless of whether the document is HTML or XHTML, when stored on a file system A file system is a method of storing and organizing computer files and their data. Essentially, it organizes these files into a database for the storage, organization, manipulation, and retrieval by the computer's operating system or transmitted over a network, the document's characters are encoded as a sequence of bit A bit or binary digit is the basic unit of information in computing and telecommunications; it is the amount of information that can be stored by a digital device or other physical system that can usually exist in only two distinct states. These may be the two stable positions of an electrical switch, two distinct voltage or current levels allowed octets Octet refers to an entity having exactly eight bits. As such, it is often used where the term byte might be ambiguous. For that reason, computer networking standards almost exclusively use octet. It is prominently used in Requests for Comments published by the Internet Engineering Task Force. The earliest example is RFC 635 from 1974. In France, (bytes The byte is a unit of digital information in computing and telecommunications. It is an ordered collection of bits, in which each bit denotes the binary value of 1 or 0. Historically, a byte was the number of bits (typically 5, 6, 7, 8, 9, or 16) used to encode a single character of text in a computer and it is for this reason the basic) according to a particular character encoding. This encoding may either be a Unicode Transformation Format, like UTF-8 UTF-8 is a variable-length character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set, but unlike them it has the special property of being backwards-compatible with ASCII. For this reason, it is steadily becoming the dominant character encoding for files, e-mail, web pages, and, that can directly encode any Unicode character, or a legacy encoding, like Windows-1252 Windows-1252 or CP-1252 is a character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows in English and some other Western languages. It is one version within the group of Windows code pages. In LaTeX packages, it is referred to as ansinew. The encoding is a superset of ISO 8859-1, but differs from the, that cannot. However, even when using encodings that do not support all Unicode characters, the encoded document may make use of numeric character references. For example &​#x263A; (☺) is used to indicate a smiling face character in the Unicode character set.

Numeric character references

Main article: Numeric character reference A numeric character reference is a common markup construct used in SGML and other SGML-related markup languages such as HTML and XML. It consists of a short sequence of characters that, in turn, represent a single character from the Universal Character Set (UCS) of Unicode. NCRs are typically used in order to represent characters that are not

In order to work around the limitations of legacy encodings, HTML is designed such that it is possible to represent characters from the whole of Unicode inside an HTML document by using a numeric character reference A numeric character reference is a common markup construct used in SGML and other SGML-related markup languages such as HTML and XML. It consists of a short sequence of characters that, in turn, represent a single character from the Universal Character Set (UCS) of Unicode. NCRs are typically used in order to represent characters that are not: a sequence of characters that explicitly spell out the Unicode code point of the character being represented. A character reference takes the form &#N;, where N is either a decimal The decimal numeral system has ten as its base. It is the numerical base most widely used by modern civilizations number for the Unicode code point, or a hexadecimal In mathematics and computer science, hexadecimal is a positional numeral system with a radix, or base, of 16. It uses sixteen distinct symbols, most often the symbols 0–9 to represent values zero to nine, and A, B, C, D, E, F (or alternatively a through f) to represent values ten to fifteen. For example, the hexadecimal number 2AF3 is equal, in number, in which case it must be prefixed by x. The characters that compose the numeric character reference are universally representable in every encoding approved for use on the Internet.

For example, a Unicode code point like U+53F6, which corresponds to a particular Chinese character, has to be converted to a decimal number, preceded by &# and followed by ;, like this: 叶, which produces this: 叶 (if it doesn't look like a Chinese character, see the special characters note at bottom of article).

The support for hexadecimal in this context is more recent, so older browsers might have problems displaying characters referenced with hexadecimal numbers—but they will probably have a problem displaying Unicode characters above code point 255 anyway. To ensure better compatibility with older browsers, it is still a common practice to convert the hexadecimal code point into a decimal value (for example 叶 instead of 叶).

Named character entities

Main article: character entity reference

In HTML there is a standard set of 252 named character entities for characters — some common, some obscure — that are either not found in certain character encodings or are markup sensitive in some contexts (for example angle brackets and quotation marks). Although any Unicode character can be referenced by its numeric code point, some HTML document authors prefer to use these named entities instead, where possible, as they are less cryptic and were better supported by early browsers.

Character entities can be included in an HTML document via the use of entity references, which take the form &EntityName;, where EntityName is the name of the entity. For example, —, much like — or —, represents U+ Unicode is a computing industry standard for the consistent representation and handling of text expressed in most of the world's writing systems. Developed in conjunction with the Universal Character Set standard and published in book form as The Unicode Standard, the latest version of Unicode consists of a repertoire of more than 107,0002014: the em dash character — like this — even if the character encoding used doesn't contain that character.

For the full list, see: List of XML and HTML character entity references In SGML, HTML and XML documents, the logical constructs known as character data and attribute values consist of sequences of characters, in which each character can manifest directly , or can be represented by a series of characters called a character reference, of which there are two types: a numeric character reference and a character entity.

Character encoding determination

In order to correctly process HTML, a web browser must ascertain which Unicode characters are represented by the encoded form of an HTML document. In order to do this, the web browser must know what encoding was used. When a document is transmitted via a MIME MIME's use, however, has grown beyond describing the content of e-mail to describing content type in general, including for the web message or a transport that uses MIME content types such as an HTTP The Hypertext Transfer Protocol is a networking protocol for distributed, collaborative, hypermedia information systems. HTTP is the foundation of data communication for the World Wide Web response, the message may signal the encoding via a Content-Type header, such as Content-Type: text/html; charset=ISO-8859-1. Other external means of declaring encoding are permitted but rarely used. The encoding may also be declared within the document itself, in the form of a META element, like <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">. This requires an extension of ASCII The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text. Most modern character-encoding schemes are based on ASCII, though they support many more characters than did ASCII to be used, like UTF-8 UTF-8 is a variable-length character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set, but unlike them it has the special property of being backwards-compatible with ASCII. For this reason, it is steadily becoming the dominant character encoding for files, e-mail, web pages, and. When there is no encoding declaration, the default varies depending on the localisation of the browser.

For a system set up mainly for Western European languages, it will generally be ISO-8859-1 ISO/IEC 8859-1:1998, Information technology — 8-bit single-byte coded graphic character sets — Part 1: Latin alphabet No. 1, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1987. It is informally referred to as Latin-1. It is generally intended for “Western European” languages or its close relation Windows-1252 ISO/IEC 8859-1:1998, Information technology — 8-bit single-byte coded graphic character sets — Part 1: Latin alphabet No. 1, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1987. It is informally referred to as Latin-1. It is generally intended for “Western European” languages. For a browser from a location where multibyte character encodings are the norm, some form of autodetection is likely to be applied.

Because of the legacy of 8-bit text representations in programming languages A programming language is an artificial language designed to express computations that can be performed by a machine, particularly a computer. Programming languages can be used to create programs that control the behavior of a machine, to express algorithms precisely, or as a mode of human communication and operating systems An operating system is the software on a computer that manages the way different programs use its hardware, and regulates the ways that a user controls the computer. Operating systems are found on almost any device that contains a computer with multiple programs—from cellular phones and video game consoles to supercomputers and web servers. Some and the desire to avoid burdening users with the need to understand the nuances of encoding many text editors used by HTML authors are unable or unwilling to offer a choice of encodings when saving files to disk and often do not even allow input of characters beyond a very limited range. Consequently many HTML authors are unaware of encoding issues and may not have any idea what encoding their documents actually use. It is also a common misunderstanding that the encoding declaration effects a change in the actual encoding - whereas it is actually just a label that could be inaccurate.

Many HTML documents are served with inaccurate encoding declarations, or no declarations at all. In order to determine the encoding in such cases, many browsers allow the user to manually select one from a list. They may also employ an encoding autodetection algorithm that works in concert with the manual override. The manual override may apply to all documents, or only those for which the encoding cannot be ascertained by looking at declarations and/or byte patterns. The fact that the manual override is present and widely used hinders the adoption of accurate encoding declarations on the Web; therefore the problem is likely to persist. This has been addressed somewhat by XHTML, which, being XML, requires that encoding declarations be accurate and that no workarounds be employed when they're found to be inaccurate. Though XML does permit higher protocols to override encodings or handoff encoding information for documents without encoding declarations, HTTP only does so for resources with content-type "text/*". Therefore for XML documents and XHTML documents delivered as content-type "application/xhtml+xml" there is no danger that inaccurate HTTP header information will override a correctly declared XML (and XHTML) document.

For both serializations of HTML (content-type "text/html" and content/type "application/xhtml+xml") using UTF-16 (or UTF-32) also provides an effective way to transmit encoding information within an HTML document. Since these UTF encodings require an initial byte-order Mark character (U+FEFF), the encoding automatically declares itself to any processing application. Processing applications need only look for an initial 0x0000FEFF or 0xFEFF in the byte steam to identify the document as UTF-32 or UTF-16 encoded respectively. No additional metadata mechanisms are required for these encodings since the byte-order mark includes all of the information necessary for processing applications. In most circumstances the byte-order mark character is handled by editing applications separately from the other characters so there is little risk of an author removing or otherwise changing the byte order mark to indicate the wrong encoding (as can happen when the encoding is declared in English/Latin script). If the document lacks a byte-order mark, the fact that the first non-blank printable character in an HTML document is supposed to be < (U+003C) can be used to determine a UTF-8/UTF-16/UTF-32 encoding.

Show All>>

 

The above information uses material from Wikipedia and is licensed under the GNU Free Documentation License The purpose of this License is to make a manual, textbook, or other functional and useful document "free" in the sense of freedom: to assure everyone the effective freedom to copy and redistribute it, with or without modifying it, either commercially or noncommercially. Secondarily, this License preserves for the author and publisher a.
Some facts may not have been fully verified for accuracy. [Disclaimers Wikipedia is an online open-content collaborative encyclopedia, that is, a voluntary association of individuals and groups working to develop a common resource of human knowledge. The structure of the project allows anyone with an Internet connection to alter its content. Please be advised that nothing found here has necessarily been reviewed by]
This page was last archived by our server on Sun Sep 5 17:11:44 2010. [ refresh local cache ]
Displaying this page or its contents does not use any Wikimedia Foundation's resources.
The owners of this site proudly support the Wikimedia Foundation.