HTML Cleaner Guide: Strip Word, Google Docs, and MSO Bloat

Updated May 27, 2026 5 min read Popupnote Editorial Team

HTML pasted from Word, Google Docs, or rich-text editors arrives bloated with inline styles, MSO comments, span wrappers, and proprietary attributes that bloat your page and break your styling. An HTML cleaner strips the noise — keeping the semantic markup and discarding the formatting cruft — so the content you paste into a CMS or email template is lean.

This guide covers what cleaning removes, when to use it, and the practical points that prevent broken content.

What HTML Cleaning Does

Strips inline style attributes
Removes empty tags and redundant wrappers
Cleans Microsoft Office-specific tags (<o:p>, <w:)
Removes class attributes from foreign sources
Converts non-semantic tags to semantic equivalents
Normalises whitespace and line breaks
Removes script and event handler attributes (security)

Common Sources of Dirty HTML

Microsoft Word — Adds mso-* styles and conditional comments
Google Docs — Wraps content in spans with custom styles
Outlook / Email clients — Inline tables and font tags
Old WYSIWYG editors — TinyMCE, CKEditor legacy output
Copy-paste from web pages — Brings styling from source site

Common Use Cases

Pasting articles from Word into a CMS (WordPress, Drupal)
Cleaning email templates from designers
Preparing content from Google Docs for publishing
Sanitising user-submitted HTML
Reducing page weight by removing inline styles
Fixing layout issues caused by foreign CSS

What to Keep vs Strip

Keep

Semantic tags: h1-h6, p, ul/ol/li, strong, em, blockquote
Links with href
Images with src and alt
Tables when structurally needed

Strip

Inline styles (let your stylesheet handle it)
Font tags (deprecated)
Empty paragraphs and divs
Script tags and event handlers
Conditional comments
MSO/Office-specific markup

Levels of Cleaning

Light — Remove styles and classes only
Medium — Plus empty tags and proprietary markup
Strict — Reduce to plain semantic HTML only
Plain text — Strip all tags

Common Pitfalls

Over-cleaning. Stripping needed semantic markup along with cruft
Lost line breaks. Word's paragraph styles converted incorrectly
Broken lists. Word lists use nested tables; need careful conversion
Images lost. Word embeds as base64 or local file:// paths
Tables collapse. Width attributes stripped without CSS replacement
Encoding issues. Smart quotes, em-dashes mangled

Security Notes

Always sanitise user-submitted HTML before rendering
Strip script tags, event handlers (onclick, onerror, etc.)
Validate href attributes (no javascript: URLs)
Validate src attributes for images
Use libraries like DOMPurify, Bleach for production sanitisation

Before Pasting Into a CMS

Copy content from source
Paste into cleaner tool
Choose cleaning level
Review output for missing elements
Paste cleaned HTML into CMS source view
Preview rendered result

Better Alternatives Where Possible

Write directly in the CMS rather than copy-paste
Use Markdown when the CMS supports it
Configure WYSIWYG editors with paste-as-plain-text
For email, use email-specific templating (MJML)

Quick Tips

Word documents need the most aggressive cleaning
Always preview cleaned HTML before publishing
Save your preferred cleaning level for reuse
For security, use a vetted sanitisation library, not regex
Writing in Markdown sidesteps the cleaning problem entirely

Use the HTML Cleaner on Popupnote

The HTML Cleaner on Popupnote provides a clean tool for stripping inline styles, MSO tags, and other cruft from pasted HTML — for content editors, web developers, and anyone publishing copied content. The tool runs in your browser without any account required.