Word to Markdown using Pandoc


Markdown has become the de-facto standard for writing software documentation. This post discusses converting Word documents to Markdown using Pandoc.

markdown.png

If you haven’t already, install Pandoc. Word documents need to be in the docx format. Legacy binary doc files are not supported by Pandoc.

Pandoc supports several flavors of Markdown (md) such as the popular GitHub flavored Markdown (GFM). To produce a standalone GFM document from docx, run

pandoc -t gfm --extract-media . -o file.md file.docx

The --extract-media option tells Pandoc to extract media to a ./media folder. All embedded media in Markdown links to files in that folder.

The generation of Markdown document is the first step. If you’re happy with the output, you can stop here, but I discuss additional changes that can make the document easier to maintain, and read using HTML renderers such as GitHub’s markup.

Markdown Editor

You’ll need a text editor to edit a md file. I use Visual Studio Code (Code) which has built-in support for editing and previewing Markdown files. I use a few additional plugins to make editing Markdown files more productive

Tables

Pandoc will render tables whose cells have a single (wrapped or not) line of text using the pipe table syntax. Column text alignment is not rendered, you’ll have to add that back manually.

Tables whose cells have complex data such as lists and multiple lines are rendered in the HTML table syntax. It is not unusual for tables with complex layouts such as merged cells to be missing columns. Review all tables carefully. I suggest simplifying complex tables in the original Word document before conversion.

Small pipe and HTML tables are relatively easy to edit by hand. Editing large tables can quickly become cumbersome. Markdown editors such as Typora provide support for visually editing piple tables. Typora does not support HTML tables.

Table of Contents

Pandora dumps the table of contents (TOC) of the original docx a line per topic. I suggest eliminating that TOC and generating a hyperlinked TOC using the capabilities of Markdown TOC plugin of Code.

The plugin can also add, update, or remove section numbering. If you have cross-references in the Word document that use section numbers, this will, at least for the moment, give you a consistent document. In the long term, I suggest avoiding section numbers, and substituting textual cross-references with intra-document hyperlinks. See TOC generated by Markdown TOC to see intra-document hyperlinking in action.

Another option is to let Pandoc number sections (-N option) and render table of contents automatically (--toc option), when rendering to HTML or PDF.

Images

Images are exported in their native format and size. They are inserted in the document using the ![caption](path) GFM syntax, or the img tag within HTML tables. Image size cannot be customized in GFM syntax, you may need to resize images to get a consistent size.

Diagrams

Pandoc is unable to render diagrams created using figures and shapes available in Word. You’ll need to recreate those by screen grabbing the output rendered by Word. You can also use mermaid.js syntax to recreate diagrams such as flowcharts and message sequence charts, embed them in the Markdown document, and render using mermaid-filter.

mermaid.js.png

GitHub doesn’t yet render mermaid diagrams, but Code is able to render them with the help of the Mermaid Preview plugin, and so is GitLab version 10.3.

Render PDF

To render a PDF using Pandoc

pandoc file.md -f gfm -F mermaid-filter -o file.pdf --toc -N

Remove the -F mermaid-filter option if your document does not have any mermaid diagrams.

I noted several problems in rendered tables. Pipe tables with long lines are not wrapped, and stretch beyond the page. HTML tables are not rendered. To fix these problems, you may need to edit the text in the tables, use a custom latex template, or use a different Markdown format with support for grid or multiline tables.

If you want to render HTML instead, change extension of file.pdf from pdf to html

pandoc file.md -f gfm -o file.html

Large Documents

Pandoc can handle large documents that have hundreds of pages. You may want to break large document into separate Markdown files for maintainability. Users may have to wait a long time to preview large document online such as at GitHub or GitLab. Previewing may fail entirely on big and complex documents.

Pandoc can render multiple Markdown files

pandoc section-1.md section-2.md -f gfm -o file.pdf --toc -N

Regular Expressions

Using regular expressions will significantly speed up your ability to do bulk search and replace operations.

Some useful regular expressions

#+\s*$     search empty headings
\s+$       search lines with trailing spaces
\b\s\s+\b  search repeated space between words
\|.*\|     search through all rows of pipe tables
section\s+(?!(\d+\.*\d*?){1,})
           search for cross-references starting with section but missing section number
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s