Known problems with html2text

See also:

Bugs of minor severity

1. When parsing much nested tables, html2text will run out of control.

This problem occurs on very complex tables with more than about 25 nested table elements, because the runtime increases exponentially with each nested table.


2. Any space within the font element not delimited by non-space characters, will not be displayed by html2text.

$ echo 'Lorem<font color="#004080"> </font>Ipsum' | html2text -nobs
LoremIpsum
$ echo 'Lorem<font color="#004080"> Ipsum</font>' | html2text -nobs
LoremIpsum
$ echo '<font color="#004080">Lorem </font>Ipsum' | html2text -nobs
LoremIpsum
$ echo '<font color="#004080">Lorem Ipsum</font>' | html2text -nobs
Lorem Ipsum

3. html2text segfaults on amd64

The problem is in the usage of get_attribute, which is a variable argument function. The function checks for a NULL (char *) argument to terminate processing. Callers used 0 to represent the end of the list, which fails on architectures where int is not the same length as (char *). Callers should use NULL when they mean NULL.
C++ blurs the difference between 0 and NULL much more than C. In a variable argument function call, there is still a difference.

Larry Doolittle very kindly wrote a a patch that fixes this. It can be found in the program's downloads directory.


4. The last char of Bulgarian alphabet with ASCII-code 255 is not displayed.

Try using unsigned instead of char in reading by changing line 365 of the file urlistream.C so that it reads:

int
urlistream::get()
{
  unsigned char ch;
  int ret = ::read(fd_, &ch, 1);
  return (ret > 0 ? ch : -1);
}

5. html2text is too strict in dl tag.

html2text does not output any data if you have a <dl> tag without its appropiate </dl> closing tag.

$ echo '<dl>Lorem ipsum' | html2text -nobs
$ echo '<dl>Lorem ipsum</dl>' | html2text -nobs
Lorem ipsum

6. Warnings about 'auto_ptr' not being defined.

Try changing line 56 of the file html.h so that it reads:

#ifdef AUTO_PTR_BROKEN /* { */
#  define auto_ptr broken_auto_ptr
#  include <memory>
#  undef auto_ptr
#  include "libstd/include/auto_ptr.h"
#else /* } { */
using std::auto_ptr;
#  include <memory>
#endif /* } */

7. File retrieval from remote servers with virtual hosts does not work properly.

We would need to send GET http://example.com/index.html, not just GET /index.html.

Feature requests

Currently, when the -width option is specified, everything in the output is wrapped to fit in the specified width. This is fine for free-form text, but doesn't work well material formatted with <pre>...</pre>.
It may be a good idea to add a new feature, implemented as a command line option or rc file setting, that tells html2text to leave the pre text alone.