Migrating to AsciiDoc from MS Word

Pandoc

Pandoc is a swiss army knife for converting one markup format to another. It does an admirable job converting simple docx files to AsciiDoc.

Generally, we don’t like to recommend pandoc because it doesn’t create AsciiDoc the way we prefer. However, in this case, it’s a good choice.

To perform the conversion from MS Word docx to AsciiDoc, you need to perform the following command:

$ pandoc --from=docx --to=asciidoc --wrap=none --atx-headers \
  --normalize --extract-media=extracted-media input.docx > output.adoc

Then, edit the output file to tidy it up.

The conversions you can expect are shown in the following table (tested using MS Word 2010 and Pandoc 1.17 (Windows)):

Table 1. Pandoc MS Word to AsciiDoc conversions
MS Word Feature Conversion

Headings (using MS Word Heading styles 1-5)

Yes

Tables

Yes. Merged cells are unmerged. Column widths are ignored.

Bulleted or numbered lists

Yes

Footnotes

Yes

Figure and table captions

Normal paragraph

Any other MS Word styled paragraphs

Normal paragraph

Embedded images

Yes

Character formatting (bold, underline and italic)

Yes

Document automation (fields, auto-generated figure and table numbers)

No - ignored

Internal references (e.g. "See Figure 3")

Plain text

Drawing canvas

Ignored

Text boxes

Ignored

Linked (not embedded) images

Ignored

Vector graphics (MS Word "insert shape")

Ignored

Optimizing for Pandoc

The basic usage documented above is fine for one-off imports. If you have a lot to do, it’s worth while cleaning the input document first, and automating the post-conversion tidy up.

  1. Clean up the MS Word document:

    • Remove non-essential material (title pages, headers and footers, table of contents etc); it is usually easiest to copy just the body text into a new blank document

    • Switch off tracking and accept all changes

    • Ensure you have used Heading styles for headings

    • Ensure table titles are immediately above their table

    • Ensure figure titles are immediately above their figure

    • Ensure images are inserted as embedded files, not as links

    • Remove canvases and put any images they contained into the main flow as paragraph images (this limitation may be removed in the next release of pandoc)

    • Replace all internal references and auto-generated sequence numbers with their literal values (Ctrl+A, Ctrl+Shift+F9)

    • Remove text boxes and put their text into the main flow

    • Replace special characters: smart quotes with simple quotes, non-breaking hyphens with normal hyphens.

    • Remove all character formatting (Ctrl+A, Ctrl+B, Ctrl+B, Ctrl+I, Ctrl+I, Ctrl+U, Ctrl+U)

    • Optional: insert ids and cross references using AsciiDoc notation (You might find it easier doing it now rather than in the AsciiDoc document later.)

    • Save as "Strict Open-xml document (docx)"

  2. Convert using pandoc as shown above.

  3. Check that the output document looks OK, and that all images have been extracted.

    If for some reason pandoc is not extracting images, you can always extract them by using unzip tool. Docx is just a zip file with a docx file extension. Embedded images are located in word/media directory.

    $ unzip input.docx -d input-docx
      ls input-docx/word/media/
  4. Fix up the output, preferably with an editor that can do regular expressions:

    • Delete automatic ids (those beginning with undererscore)

    • Replace long table delimiters with short ones.

    • Insert line-breaks to get to 1 sentence per paragraph.

    • Re-insert images and turn caption paragraphs back into Asciidoctor captions.

    • Replace the hard cross references with AsciiDoc references.

    • Fix tables - merged cells will have unmerged, column widths need putting back.

  5. Try to convert it, and fix any errors that come up.

The following are posix shell one-liners to automate some of these steps (adjust the regexps to match your particular document):

  • Delete automatically inserted ids

    $ perl -W -pe  's!\[\[_.*]]!!g' -i output.adoc
  • Shorten table delimiters

    $ perl -W -pe  's!\|==*!|====!g' -i output.adoc
  • 1 sentence per line. Be careful not to match lists. It will get confused by abbreviations, but there is no way around that.

    $ perl -W -pe 's!(\w\w+)\.\s+(\w)!$1.\n$2!g' -i output.adoc
  • Replace figure captions with id and title

    $ perl -W -pe 's!^Figure (\d+)\s?(.*)![[fig-$1]]\n.$2\n!g' -i output.adoc
  • Replace references to figures with asciidoc xref

    $ perl -W -pe 's!Figure (\d+)!<<fig-$1>>!g' -i output.adoc

Google Docs

Google Docs can already upload and edit MS Word docx files. Using the AsciiDoc Processor add-on by Guillaume Grossetie, you can copy and paste part or all of the document from Google Docs as AsciiDoc text. The features that it can handle seem to be substantially fewer than pandoc but expect further development. The source for the addon is at https://github.com/Mogztter/asciidoc-googledocs-addon/.

Plain Text

This method is only useful for very small files or if the other methods are not available.

It keeps the text, and fixes fields like auto-numbered lists and cross references.

It loses tables (converted to plain paragraphs), images, symbols, form fields, and textboxes.

In MS Word, use Save as  Plain text, then when the File Conversion dialog appears, set:

  • Other encoding: UTF-8

  • Do not insert line breaks

  • Allow character substition

Save the file then apply AsciiDoc markup manually.

Experiment with the encoding. Try UTF-8 first, but if you get problems you can always revert to US-ASCII.