html2text Questions and Answers

Table of Contents

How Can I use html2text to export HTML table data to a comma seperated text document (CSV)?

I'm sorry html2text can't do this, as it renders tables as tables and not as CSV (i.e. it uses spaces, not tabs for tables).
Apart from messing around with sed, awk, and html2text in a pipe, the best thing you could probably do is to load the HTML document into OpenOffice.org (StarOffice) Writer, mark the table, copy it into the clipboard, open an OpenOffice.org Spreadsheet (Calc), and insert the data from the clipboad. Then export the spreadsheet to CSV or whatever you need.


Could you add some URL handling features? I want to be able to define that URLs be translated in plain text, i.e. that

<A href="http://www.google.com/">Google</A>

reads

http://www.google.com/

instead of

Google

Think of html2text as a filter. It aims to show you what you would see if you loaded the file in a grafical browser, without being a browser.
The arguments of HTML elements are not interpreted, with exception for "IMG ALT", which is used to give a good substitute for images that cannot be represented in plain text media. While I understand that it might be desiderable (e.g. for Wiki source code pre-processing) to have the "A HREF" argument contents displayed verbatim, this would completely break with the idea of a filter. html2text is expected not to bother about markup as long as it does not contain any structural information (so called logical markup, think of headings, lists and so on), and "A" does not.


Can I compile html2text under Cygwin?

Sure, firstly you will need to make sure you have:

  1. Cygwin bash -- required
  2. Cygwin gzip/gunzip -- required
  3. Cygwin tar -- required
  4. Cygwin GCC GNU C/C++ compiler -- required
  5. Cygwin man -- optional, if you want to see the man pages

The steps to compile html2text are as follows:

  1. Start a Cygwin environment shell, by executing c:\cygwin\cygwin.bat from the Start Menu (all the rest of the command must be issued in this shell)
  2. Make a work directory: mkdir /cygdrive/c/work/html2text
  3. Move the download file "html2text-1.3.1.tar.gz" to "/cygdrive/c/work/html2text", you can do this using the mv command from Cygwin or the Windows Explorer.
  4. In the Cygwin shell, change to our new working directory: cd /cygdrive/c/work/html2text
  5. Now unzip the download package: gzip -d html2text-1.3.1.tar.gz
  6. Once unzipped, it must be untarred: tar -xvf html2text-1.3.1.tar
  7. Change working directory to ./html2text-1.3.1: cd html2text-1.3.1
  8. Run the configure script: ./configure, this will check your environment for a C compiler and other tools, writing out a new "Makefile"
  9. Provided there were no fatal errors in the above step, you should find a new "Makefile", doing ls -l should reveal this.
  10. Now we simple run this Makefile: make
  11. A lot of messages will come from the C compiler when running make and it is done, you should see a new "html2text.exe" file (use ls -l to show the date)
  12. Using ls -l you should find three files "html2text.exe", "html2text.1.gz", and "html2textrc.5.gz".
  13. Move "html2text.exe" to "/usr/local/bin" (note you may need to create this directory tree first): mv html2text.exe /usr/local/bin
  14. Only provided that you have "man" installed, move "html2text.1.gz" to "/usr/local/man/man1" (note you may need to create this directory tree first): mv html2text.1.gz /usr/local/man/man1
  15. Only provided that you have "man" installed, move "html2textrc.5.gz" to "/usr/local/man/man5" (note you may need to create this directory tree first): mv html2textrc.5.gz /usr/local/man/man5
  16. To do a sanity check, execute: html2text -help, also if you have Cygwin man installed, check the man pages man html2text and man html2textrc
  17. Now you can use html2text, e.g. html2text -o mytextfile.txt myhtmlfile.html it should work within the Cygwin environment

I am trying to compile the program under Cygwin/ under Mac OS and get an error. ./configure runs fine and most of the compilation runs correctly.

Try changing the Makefile line that reads

CXX = CC

into

CXX = g++

According to the manual, "html2text will not follow redirections (HTTP 301/307). Proxy servers are not supported." This turns html2text to a very limited use. Why not just remove all url featching code, and rely on cat/lynx to feed html files to html2text?

As already stated in the documentation, the HTTP implementation in html2text is rather basic: All it does is more or less to issue a "GET" request. It's more a gimmick than a core function. But that's not sufficient for removing it completely and for disappointing all of the other users that might find it usefull. A corresponding compile-time option would probably be the best solution, but lead to a complete re-write of the configure-script.


I am trying to convert new elements above and beyond the basic html2textrc I was given for DokuWiki. Do you have a guide that shows all of the html elements that can be converted and how they work?

Please refer to the html2textrc(5) manual page. It's exhaustive. Note you don't have to set all of the options, neither do you have to write an html2textrc file at all, as reasonable defaults will apply in either case.


How can I use html2text with HTML documents in UTF-8 encoding?

Yury Semenov very kindly wrote a patch that adds support for UTF-8 encoding to html2text. It can be found in the program's downloads directory. The patched version of html2text will assume UTF-8 encoding for both, input and output, if the -utf8 command line option is set.
This patch still has some limitations, though: There are no checks for errors in UTF-8 codes, nor for multicolumn charachers, and the array of the entities is incomplete. However, ancient Greek will be rendered quite nicely.