Word to Markdown using Pandoc


Markdown has become the de-facto standard for writing software documentation. This post discusses converting Word documents to Markdown using Pandoc.

markdown.png

If you haven’t already, install Pandoc. Word documents need to be in the docx format. Legacy binary doc files are not supported by Pandoc.

Pandoc supports several flavors of Markdown (md) such as the popular GitHub flavored Markdown (GFM). To produce a standalone GFM document from docx, run

pandoc -t gfm --extract-media . -o file.md file.docx

The --extract-media option tells Pandoc to extract media to a ./media folder. All embedded media in Markdown links to files in that folder.

The generation of Markdown document is the first step. If you’re happy with the output, you can stop here, but I discuss additional changes that can make the document easier to maintain, and read using HTML renderers such as GitHub’s markup.

Markdown Editor

You’ll need a text editor to edit a md file. I use Visual Studio Code (Code) which has built-in support for editing and previewing Markdown files. I use a few additional plugins to make editing Markdown files more productive

Tables

Pandoc will render tables whose cells have a single (wrapped or not) line of text using the pipe table syntax. Column text alignment is not rendered, you’ll have to add that back manually.

Tables whose cells have complex data such as lists and multiple lines are rendered in the HTML table syntax. It is not unusual for tables with complex layouts such as merged cells to be missing columns. Review all tables carefully. I suggest simplifying complex tables in the original docx before conversion.

Both formats are relatively easy to edit by hand, but Markdown editors such as Typora provide support for visually editing piple tables. Typora does not support HTML tables.

Table of Contents

Pandora dumps the table of contents (TOC) of the original docx a line per topic. I suggest eliminating that TOC and generating a hyperlinked TOC using the capabilities of Markdown TOC plugin of Code.

The plugin can also add, update, or remove section numbering. If you have cross-references in the Word document that use section numbers, this will, at least for the moment, give you a consistent document. In the long term, I suggest avoiding section numbers, and substituting textual cross-references with intra-document hyperlinks. See TOC generated by Markdown TOC to see intra-document hyperlinking in action.

Another option is to let Pandoc number sections (-N option) and render table of contents automatically (--toc option).

Images

Images are exported in their native format and size. They are inserted in the document using the ![caption](path) GFM syntax, or the img tag within HTML tables. Image size cannot be customized in GFM syntax, you may need to resize images to get a consistent size.

Diagrams

Pandoc is unable to render any diagrams created using figures and shapes available in Word. You’ll need to recreate those by screen grabbing the output rendered by Word. You can also use mermaid.js syntax to create diagrams such as flowcharts and message sequence charts, embed them in the Markdown document, and render using mermaid-filter.

mermaid.js.png

GitHub doesn’t yet render mermaid diagrams, but Code is able to render them with the help of the Mermaid Preview plugin, and so is GitLab version 10.3.

Render PDF

To render a PDF using Pandoc

pandoc file.md -f gfm -F mermaid-filter -o file.pdf

Remove the -F mermaid-filter option if your document does not have any mermaid diagrams.

I’ve noted several problems with rendered tables. Pipe tables with long lines are not wrapped and stretch beyond the page. HTML tables are not rendered. You may need to tweak the text in the table, and the latex template used to render PDF.

If you want to render HTML instead, change extension of file.pdf from pdf to html

pandoc file.md -f gfm -o file.html

Large Documents

Pandoc has can handle large documents that have hundreds of pages. You may want to break large document into separate markdown files for maintainability. Users may have to wait a long time to preview large document online such as at GitHub or GitLab. Previewing may fail entirely on big and complex documents.

Pandoc can render multiple markdown files

pandoc section-1.md section-2.md -f gfm -o file.pdf --toc -N

Regular Expressions

Using regular expressions will significantly speed up your ability to do bulk search and replace operations.

Some useful regular expressions

#+\s*$     search empty headings
\s+$       search lines with trailing spaces
\b\s\s+\b  search repeated space between words
\|.*\|     search through all rows of pipe tables
section\s+(?!(\d+\.*\d*?){1,})
           search for cross-references starting with section but missing section number
Advertisements

Specify a different ssh key for each host


The ~/.ssh/config file can be edited to specify a different key/identity for each host. This is useful when you have different ssh keys setup on different git servers.

Host mycompany
    HostName mycompany.com
    User fooey
Host github.com
    IdentityFile ~/.ssh/github.key

If you specify a HostName that is different from Host, the .git/config file should use the name specified in Host. That should also be the host name used in git commands such as clone and remote.

See Simplify Your Life With an SSH Config File for more.

Gifting e-books


Not all e-book retailers allow gifting. At work, we’re using gifting so that our centralized purchase department can pay for books and gift back to whoever requested the purchase.

Here’s a list of retailers that allow gifting:

  • amazon.com
  • smashwords.com
  • iTunes Store

Mac OS X tips


This is an ever expanding list of tips for Mac OS X. Leave a comment below if you have tips of your own to share.

Go to a folder in Finder

Hit Command-Shift-G and type in or right click and paste the folder path.

Go to a folder in Terminal from Finder

Start Terminal. Type “cd ” without quotes. Drag folder from Finder (or its status bar) to Terminal. You can also open Terminal directly from Finder. Head over to System Preferences, Keyboard, Shortcuts, Services, and enable New Terminal at Folder and/or New Terminal Tab at Folder. Selected option will appear under the Services context menu of an item in Finder when you right click on it.

Goto to a folder in Finder from Terminal

Type the command

open .

Connect to WiFi and continue using your wired internet connection

Head over to System Preferences, Network, Set Service Order, and raise Thunderbolt Ethernet (or any other interface) above WiFi.

Live webcam feed in a web meeting

To show a live webcam feed on your screen, try Photo Booth (comes preinstalled). Most off-the-shelf USB webcams work just fine with Mac OS X and Photo Booth.

Combine multiple PDF documents

You can use to Preview to combine multiple PDF documents, rearranging and leaving out pages you don’t need…

Go to folder in Spotlight Search

Once you’ve searched for the document and it is highlighted, keying Enter will open the document, keying Command+Enter will take you to the folder where the document is located.

Keyboard Shortcuts

Fn+F11 reveals the desktop. Useful to quickly drag some files on to the Desktop or vice-versa.

PPI calculation


Have the screen resolution (width and height) in pixels and the diagonal length in inches, and want to calculate the PPI? Pretty simple really. Let’s use a concrete example. The iPad 4 has a screen resolution of 2048 by 1536 pixels, and a diagonal length of 9.7 inches. Thus we calculate its PPI as

20130222-212656.jpg

The calculation above was performed using the wonderful MyScript Calculator, and the result captured by taking a screenshot.

Here’s the same formula using WordPress’ \LaTeX renderer, in a more generalized form: \sqrt{w^2 + h^2} \over L, where w is the screen width in pixels, h is the screen height in pixels, and L is the diagonal length of the screen in inches.

Merge pdf files using ghostscript


To merge or join pdf files with ghostscript, from the command line:

"c:\Program Files\gs\gs9.06\bin\gswin64.exe" -dNOPAUSE -sDEVICE=pdfwrite -sOUTPUTFILE=join.pdf -dBATCH ch01.pdf ch02.pdf ch03.pdf ch04.pdf ch05.pdf ch06.pdf ch07.pdf ch08.pdf ch09.pdf ch10.pdf ch11.pdf ch12.pdf ch13.pdf AppA.pdf AppB.pdf

Change the command appropriately for your operating system and files.

A pity gsview does not provide a GUI for doing that.