HTML HTML, which stands for HyperText Markup Language, is the predominant markup language for web pages. It is written in the form of HTML elements consisting of "tags" surrounded by angle brackets within the web page content (Hypertext Markup Language) has been in use since 1991, but HTML 4.0 (December 1997) was the first standardized version where international characters In computer and machine-based telecommunications terminology, a character is a unit of information that roughly corresponds to a grapheme, grapheme-like unit, or symbol, such as in an alphabet or syllabary in the written form of a natural language were given reasonably complete treatment. When an HTML document includes special characters outside the range of seven-bit ASCII The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text. Most modern character-encoding schemes are based on ASCII, though they support many more characters than did ASCII two goals are worth considering: the information's integrity Integrity is a concept of consistency of actions, values, methods, measures, principles, expectations and outcomes. In western ethics, integrity is regarded as the quality of having an intuitive sense of honesty and truthfulness in regard to the motivations for one's actions.[citation needed] Integrity can be regarded as the opposite of hypocrisy,, and universal browser A web browser is a software application for retrieving, presenting, and traversing information resources on the World Wide Web. An information resource is identified by a Uniform Resource Identifier and may be a web page, image, video, or other piece of content. Hyperlinks present in resources enable users to easily navigate their browsers to display.
Contents |
Specifying the document's character encoding
There are several ways to specify which character encoding is used in the document. First, the web server A web server is a computer program that delivers content, such as web pages, using the Hypertext Transfer Protocol (HTTP), over the World Wide Web. The term web server can also refer to the computer or virtual machine running the program can include the character encoding or "charset" in the Hypertext Transfer Protocol The Hypertext Transfer Protocol is an Application Layer protocol for distributed, collaborative, hypermedia information systems (HTTP) Content-Type header, which would typically look like this:[1]
Content-Type: text/html; charset=ISO-8859-1
For HTML (but not in XHTML XHTML is a family of XML markup languages that mirror or extend versions of the widely used Hypertext Markup Language (HTML), the language in which web pages are written) it is possible to include this information inside the head element near the top of the document:[2]
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
HTML5 HTML5 is currently under development as the next major revision of the HTML standard. Like its immediate predecessors, HTML 4.01 and XHTML 1.1, HTML5 is a standard for structuring and presenting content on the World Wide Web. The new standard incorporates features like video playback and drag-and-drop that have been previously dependent on third- also allows the following syntax to mean exactly the same:[2]
<meta charset="utf-8">
XHTML XHTML is a family of XML markup languages that mirror or extend versions of the widely used Hypertext Markup Language (HTML), the language in which web pages are written documents have a third option: to express the character encoding via XML Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards processing instruction, as follows:[3]
<?xml version="1.0" encoding="ISO-8859-1"?>
A known misconception about <meta http-equiv="Content-Type"> is that meta element Meta elements are HTML or XHTML elements used to provide structured metadata about a Web page. Such elements must be placed as tags in the< code>head section of an HTML or XHTML document. Meta elements can be used to specify page description, keywords and any other metadata not provided through the other head elements and attributes is intended to be interpreted directly by a browser, like an ordinary HTML tag. According to WWW Consortium, it helps HTTP server[4] to generate some headers when it serves the document. The HTTP/1.1 Hypertext Transfer Protocol is an application-level protocol for distributed, collaborative, hypermedia information systems. Its use for retrieving inter-linked resources led to the establishment of the World Wide Web header specification for a HTML document must label an appropriate encoding in the Content-Type header,[5] missing charset= parameter results in acceptance of ISO-8859-1 ISO/IEC 8859-1:1998, Information technology — 8-bit single-byte coded graphic character sets — Part 1: Latin alphabet No. 1, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1987. It is informally referred to as Latin-1. It is generally intended for “Western European” languages (so HTTP/1.1 formally does not offer such option as an unspecified character encoding), and this specification supersedes all HTML (or XHTML) meta element ones. This can pose a problem if the server generates an incorrect header and one does not have the access or the knowledge to change them.
As each of these methods explain to the receiver how the file being sent should be interpreted, it would be inappropriate for these declaration not to match the actual character encoding used. Because a server usually can't know how a document is encoded—especially if documents are created on different platforms or in different regions—many servers[citation needed] simply do not include a reference to the "charset" in the Content-Type header, thus avoiding making false promises. However, if the document does not specify the encoding either, this may result in the equally bad situation where the user agent A user agent is a client application implementing a network protocol used in communications within a client–server distributed computing system. The term most notably refers to applications that access the World Wide Web, but other systems, such as the Session Initiation Protocol , use the term user agent to refer to both end points of a displays mojibake Mojibake , from the Japanese 文字 (moji) "character" + 化け (bake) "change", is the happenstance of incorrect, unreadable characters shown when computer software fails to render text correctly according to its associated character encoding because it cannot find out which character encoding was used. Due to widespread and persistent ignorance of HTTP charset= over the Internet (at its server side), WWW Consortium disappointed in HTTP/1.1’s strict approach[6] and encourage browser developers to use some fixes in violation of RFC 2616.
If a user agent reads a document with no character encoding information, it can fall back to using some other information. For example, it can rely on the user's settings, either browser-wide or specific for a given document, or it can pick a default encoding based on the user's language. For Western European languages, it is typical and fairly safe to assume Windows-1252 Windows-1252 or CP-1252 is a character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows in English and some other Western languages. It is one version within the group of Windows code pages. In LaTeX packages, it is referred to as ansinew. The encoding is a superset of ISO 8859-1, but differs from the, which is similar to ISO-8859-1 but has printable characters in place of some control codes. The consequence of choosing incorrectly is that characters outside the printable ASCII range (32 to 126) usually appear incorrectly. This presents few problems for English English is a West Germanic language that arose in the Anglo-Saxon kingdoms of England and spread into South-East Scotland under the influence of the Anglian medieval kingdom of Northumbria. Following the economic, political, military, scientific, cultural, and colonial influence of Great Britain and the United Kingdom from the 18th century, via-speaking users, but other languages regularly—in some cases, always—require characters outside that range. In CJK CJK is a collective term for Chinese, Japanese, and Korean, which constitute the main East Asian languages. The term is used in the field of software and communications internationalization environments where there are several different multi-byte encodings in use, auto-detection is also often employed. Finally, browsers usually permit to override incorrect charset label manually as well.
It is increasingly common for multilingual websites and websites in non-Western languages to use UTF-8 UTF-8 is a variable-length character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set, but unlike them it has the special property of being backwards-compatible with ASCII. For this reason, it is steadily becoming the dominant character encoding for files, e-mail, web pages, and, which allows use of the same encoding for all languages. UTF-16 In computing, UTF-16 is a variable-length character encoding for Unicode, capable of encoding the entire Unicode repertoire, by mapping each character (or code point) to a sequence of 16-bit code units. For characters in the Basic Multilingual Plane (BMP) the encoding is a single code unit equal to the code point. For characters in the other or UTF-32, which can be used for all languages as well, are less widely used because they can be harder to handle in programming languages that assume a byte-oriented ASCII superset encoding, and they are less efficient for text with a high frequency of ASCII characters, which is usually the case for HTML documents.
Successful viewing of a page is not necessarily an indication that its encoding is specified correctly. If the page's creator and reader are both assuming some platform-specific character encoding, and the server does not send any identifying information, then the reader will nonetheless see the page as the creator intended, but other readers on different platforms or with different native languages will not see the page as intended.
Character references
Main articles: character entity reference and numeric character reference A numeric character reference is a common markup construct used in SGML and other SGML-related markup languages such as HTML and XML. It consists of a short sequence of characters that, in turn, represent a single character from the Universal Character Set (UCS) of Unicode. NCRs are typically used in order to represent characters that are notIn addition to native character encodings, characters can also be encoded as character references, which can be numeric character references (decimal The decimal numeral system has ten as its base. It is the numerical base most widely used by modern civilizations or hexadecimal In mathematics and computer science, hexadecimal is a positional numeral system with a radix, or base, of 16. It uses sixteen distinct symbols, most often the symbols 0–9 to represent values zero to nine, and A, B, C, D, E, F (or alternatively a through f) to represent values ten to fifteen. For example, the hexadecimal number 2AF3 is equal, in) or character entity references. Character entity references are also sometimes referred to as named entities, or HTML entities for HTML. HTML's usage of character references derives from SGML The Standard Generalized Markup Language is an ISO-standard technology for defining generalized markup languages for documents. ISO 8879 Annex A.1 defines generalized markup:.
HTML character references
Numeric character references can be in decimal format, &#DD;, where DD is a variable number of decimal digits. Similarly there is a hexadecimal format, &#xHHHH;, where HHHH is a variable number of hexadecimal digits. Hexadecimal character references are case-insensitive in HTML. For example, the character 'λ' can be represented as λ, λ or λ. Numeric references always refer to Unicode Unicode is a computing industry standard for the consistent representation and handling of text expressed in most of the world's writing systems. Developed in conjunction with the Universal Character Set standard and published in book form as The Unicode Standard, the latest version of Unicode consists of a repertoire of more than 107,000 code points, regardless of the page's encoding. Using numeric references that refer to permanently undefined characters and control characters is forbidden, with the exception of the linefeed, tab, and carriage return characters. That is, characters in the hexadecimal ranges 00–08, 0B–0C, 0E–1F, 7F, and 80–9F cannot be used in an HTML document, not even by reference, so "™", for example, is not allowed. However, for backward compatibility with early HTML authors and browsers that ignored this restriction, raw characters and numeric character references in the 80–9F range are interpreted by some browsers as representing the characters mapped to bytes 80–9F in the Windows-1252 encoding.
Character entity references have the format &name; where "name" is a case-sensitive alphanumeric string. For example, 'λ' can also be encoded as λ in an HTML document. (For a list of all named HTML character entity references, see List of XML and HTML character entity references In SGML, HTML and XML documents, the logical constructs known as character data and attribute values consist of sequences of characters, in which each character can manifest directly , or can be represented by a series of characters called a character reference, of which there are two types: a numeric character reference and a character entity.) The character entity references <, >, " and & are predefined in HTML and SGML, because <, >, " and & are already used to delimit markup. This notably does not include XML's ' (') entity. For a list of all named HTML character entity references, see List of XML and HTML character entity references In SGML, HTML and XML documents, the logical constructs known as character data and attribute values consist of sequences of characters, in which each character can manifest directly , or can be represented by a series of characters called a character reference, of which there are two types: a numeric character reference and a character entity (approximately 250 entries).
Unnecessary use of HTML character references may significantly reduce HTML readability. If the character encoding for a web page is chosen appropriately then HTML character references are usually only required for a markup delimiting characters mentioned above, and for a few special characters (or not at all if a native Unicode Unicode is a computing industry standard for the consistent representation and handling of text expressed in most of the world's writing systems. Developed in conjunction with the Universal Character Set standard and published in book form as The Unicode Standard, the latest version of Unicode consists of a repertoire of more than 107,000 encoding like UTF-8 UTF-8 is a variable-length character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set, but unlike them it has the special property of being backwards-compatible with ASCII. For this reason, it is steadily becoming the dominant character encoding for files, e-mail, web pages, and is used).
XML character references
Unlike traditional HTML with its large range of character entity references, in XML Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards there are only five predefined character entity references. These are used to escape characters that are markup sensitive in certain contexts:[7]
&→ & (ampersand An ampersand (&) is a logogram representing the conjunction word "and", U+0026)<→ < (less-than sign, U+003C)>→ > (greater-than sign, U+003E)"→ " (quotation mark, U+0022)'→ ' (apostrophe, U+0027)
All other character entity references have to be defined before they can be used. For example, use of é (which gives é, Latin lower-case E with acute accent, U+00E9 in Unicode) in an XML document will generate an error unless the entity has already been defined. XML also requires that the x in hexadecimal numeric references be in lowercase: for example ਛ rather than ਛ. XHTML XHTML is a family of XML markup languages that mirror or extend versions of the widely used Hypertext Markup Language (HTML), the language in which web pages are written, which is an XML application, supports the HTML entity set, along with XML's predefined entities.
However, use of ' in XHTML should generally be avoided for compatibility reasons. ' or ' may be used instead.
& has the special problem that it starts with the character to be escaped. A simple Internet search finds thousands of sequences &amp;amp;amp; ... in HTML pages for which the algorithm to replace an ampersand by the corresponding character entity reference was applied too often.
References
- ^ Fielding, R. Roy Thomas Fielding is an American computer scientist. He is one of the principal authors of the HTTP specification (RFC 2616), and a frequently-cited authority on computer network architecture; Gettys, J. Jim Gettys is an American computer programmer at Alcatel-Lucent Bell Labs, USA. Until January 2009, he was the Vice President of Software at the One Laptop per Child project, working on the software for the OLPC XO-1. He is one of the original developers of the X Window System at MIT and worked on it again with X.Org, where he served on the board; Mogul, J.; Frystyk, H.; Masinter, L.; Leach, P.; Berners-Lee, T. Sir Timothy John "Tim" Berners-Lee, OM, KBE, FRS, FREng, FRSA , is a British engineer and computer scientist and MIT professor credited with inventing the World Wide Web, making the first proposal for it in March 1989. On 25 December 1990, with the help of Robert Cailliau and a young student at CERN, he implemented the first successful (June 1999), "Content-Type", Hypertext Transfer Protocol – HTTP/1.1, IETF The Internet Engineering Task Force develops and promotes Internet standards, cooperating closely with the W3C and ISO/IEC standards bodies and dealing in particular with standards of the TCP/IP and Internet protocol suite. It is an open standards organization, with no formal membership or membership requirements. All participants and managers are, http://tools.ietf.org/html/rfc2616#section-14.17, retrieved 8 March 2010
- ^ a b Hickson, I. Ian 'Hixie' Hickson is the author and maintainer of the Acid2 and Acid3 tests, and the Web Applications 1.0/HTML 5 specification. He is known as a proponent of web standards, and has played a crucial role in the development of specifications such as CSS.[citation needed] Hickson was a co-editor of the CSS 2.1 specification (5 March 2010), "Specifying the document's character encoding", HTML5, WHATWG The Web Hypertext Application Technology Working Group is a community of people interested in evolving HTML and related technologies. The WHATWG was founded by individuals from Apple, the Mozilla Foundation and Opera Software. Since then, the editor of the WHATWG specifications, Ian Hickson, has moved to Google. Chris Wilson of Microsoft was, http://www.whatwg.org/html/#charset, retrieved 8 March 2010
- ^ Bray, T. Timothy William Bray is a Canadian software developer and entrepreneur. He co-founded Open Text Corporation and Antarctica Systems. Later, Bray was the Director of Web Technologies at Sun Microsystems until his resignation on February 26, 2010. On March 15, 2010, he announced on his blog that he had taken up a position as a Developer Advocate at; Paoli, J.; Sperberg-McQueen, C.; Maler, E.; Yergeau, F. (26 November 2008), "Processing Instructions", XML, W3C The World Wide Web Consortium is the main international standards organization for the World Wide Web (abbreviated WWW or W3), http://www.w3.org/TR/REC-xml/#sec-pi, retrieved 8 March 2010
- ^ The global structure of an HTML document: The META element
- ^ RFC 2616 3.7.1 Canonicalization and Text Defaults
- ^ HTML 4, HTML Document Representation: Specifying the character encoding
- ^ Bray, T. Timothy William Bray is a Canadian software developer and entrepreneur. He co-founded Open Text Corporation and Antarctica Systems. Later, Bray was the Director of Web Technologies at Sun Microsystems until his resignation on February 26, 2010. On March 15, 2010, he announced on his blog that he had taken up a position as a Developer Advocate at; Paoli, J.; Sperberg-McQueen, C.; Maler, E.; Yergeau, F. (26 November 2008), "Character and Entity References", XML, W3C The World Wide Web Consortium is the main international standards organization for the World Wide Web (abbreviated WWW or W3), http://www.w3.org/TR/REC-xml/#sec-references, retrieved 8 March 2010
External links
- Character entity references in HTML4
- The Definitive Guide to Web Character Encoding
- Character Encoder
- (X)HTML Entities or Special Characters simple Reference
Categories: HTML | World Wide Web Consortium standards Categories: Web standards | World Wide Web Consortium
300px x 400px | 23.00kB
[source page]
We will now look at which combinations are most appropriate and complete the table to summarize at the end of this section
