Hackipedia's file type and naming convention by Jonathan Campbell Hackipedia deals with documentation in several different formats, primarily PDF, HTML, and plain ASCII text. To help clarify the file type involved, simple naming rules are applied to the file name. 2010/10/31 - A new policy has been put into effect by Hackipedia editors to remove the original DOC, RTF, and non-UTF8 files from the site to help reduce disk space utilization. But we never ever delete files, we keep them in our archives. If a converted copy is erroneous or inferior to the original, please let us know and we will restore the original file from our archives and consider re-doing the conversion. To make it clear the file is converted we will continue to use the file extension system described below, even if we do not have the original file on-site. ======================== HTML documents ============================ If the document is HTML the file extension is .htm or .html. If the HTML also contains images and JavaScript, the HTML document's content shall be stored in a subdirectory named by taking the HTML document's name and adding _files. HTML document: funny.htm HTML contents: funny_files/* Most modern web browsers, when asked to save an entire web page, will generate this combined structure for you. Hackipedia will generally store the document as-is on site. However if the document contains unnecessary or annoying JavaScript, the JS files will be altered to remove it or deleted entirely. Finally, if the author determines that the extra files are unnecessary, the HTML document and folder contents will be altered to remove them. ========================= plain ASCII text ========================= Plain text is actually kind of a misnomer, most textual documents I've collected are actually extended ASCII in one code page or another. When the code page is known, the text document is renamed to indicate that code page. CP437 (MS-DOS extended ASCII) *.cp437.txt Real plain ASCII (no extended chars) *.ascii.txt Windows ISO-8859-1 *.iso8859-1.txt UTF-8 *.utf-8.txt Shift-JIS *.shift-jis.txt To avoid on-site issues, all non uTF-8 documents are converted to UTF-8 and posted on-site in that format. As long as your web browser understands UTF-8 and unicode, the document will display properly. ========================= VT100 console art ======================== Some documents added to the collection were named as if text, but are in fact VT100 console art. That means that if you open it in a text editor you will see a semi-incoherent jumble of escape codes. Such files are intended to be played out to a terminal that interprets the escape codes. It can be any terminal, be it your PuTTY SSH/Telnet connection, the Linux console, an xterm window, etc anything that interprets VT100 terminal control codes. Such files are given the file extension *.vt100 with the character set encoding as part of the file extension. For web viewing purposes, the VT100 codes are also translated to HTML, named *.vt100.htm. Such files are also commonly refered to as ASCII/ANSI art. Plain ASCII *.ascii.vt100 CP437 (Intended for MS-DOS ANSI.SYS) *.cp437.vt100 UTF-8 *.utf-8.vt100 Some ASCII art was designed to be played back over a slow modem connection (like 9600 baud) to produce animation on a terminal. Such files need additional treatment for conversion to the web, these files are given the file extension *.vt100mp (mp = motion picture) Plain ASCII *.ascii.vt100mp CP437 (Intended for MS-DOS ANSI.SYS) *.cp437.vt100mp UTF-8 *.utf-8.vt100mp Some of them are NOT animations, but they make use of screen clearing and redrawing commands and are apparently intended to present one page after another. These are designated *.vt100mup (mup = multi user page) [you know the drill by now...] *.*.vt100mup =============================== PDF ================================ Where possible, the documentation is retained in PDF format. PDF documents when properly made are very high quality renditions of the documentation and are readable everywhere. The only problem with PDF is that reading them requires a graphical desktop. So hackipedia editors will also use the pdftotext program to generate a plain ASCII text version which in most cases is readable in console environments where a PDF reader is not available. ======================= Word document (DOC) ======================== If a document was written with Microsoft Word, the document is retained in that format in case any conversions are incorrect. There is however a situation with file type and the *.doc extension, *.doc can actually mean one of several types. In some cases, a *.doc file is just plain ASCII text. Those are renamed to .txt. In other cases, a *.doc file is really just a RTF (Rich Text Format) document. Those are renamed to .rtf. Anything that actually looks like an OLE compound document with Word-like contents inside is retained as .doc. ==================== Compressed HTML Help Manuals ================== Some documentation is stored in Microsoft's HTML Help format (*.chm). These files are retained in the collection. They are also "converted" by extracting it's contents to a folder named by the original name + ".html". Some of the extra binary structures extracted are deleted. Extracting these contents allows the document to be read as a standard web page. Some manual.chm -> Some manual.chm.html/[various documents within] ===================== Document conversion trail ==================== When a document is converted, the original is never thrown away. Conversions are never quite perfect. When a conversion is made, the converted file is given the original file name + the file extension of the new file type. For example, a word document converted to PDF, and then plain text: msword.doc msword.doc.pdf msword.doc.pdf.utf-8.txt This allows Hackipedia maintainers to know what file type was what and how it was converted to that format. This also avoid any confusion with situations where a file was received in multiple formats. Such as, a word document that was also packaged with a plain text version: msword2.doc -> msword2.doc msword2.txt -> msword2.iso8859-1.txt.utf-8.txt When placed into hackipedia, the .doc file was kept as-is while the text file was detected to be ISO-8859-1, and marked as such, then converted to UTF-8.