HTML pasted from Word, Google Docs, or rich-text editors arrives bloated with inline styles, MSO comments, span wrappers, and proprietary attributes that bloat your page and break your styling. An HTML cleaner strips the noise — keeping the semantic markup and discarding the formatting cruft — so the content you paste into a CMS or email template is lean.

This guide covers what cleaning removes, when to use it, and the practical points that prevent broken content.

What HTML Cleaning Does

  • Strips inline style attributes
  • Removes empty tags and redundant wrappers
  • Cleans Microsoft Office-specific tags (<o:p>, <w:)
  • Removes class attributes from foreign sources
  • Converts non-semantic tags to semantic equivalents
  • Normalises whitespace and line breaks
  • Removes script and event handler attributes (security)

Common Sources of Dirty HTML

  • Microsoft Word — Adds mso-* styles and conditional comments
  • Google Docs — Wraps content in spans with custom styles
  • Outlook / Email clients — Inline tables and font tags
  • Old WYSIWYG editors — TinyMCE, CKEditor legacy output
  • Copy-paste from web pages — Brings styling from source site

Common Use Cases

  • Pasting articles from Word into a CMS (WordPress, Drupal)
  • Cleaning email templates from designers
  • Preparing content from Google Docs for publishing
  • Sanitising user-submitted HTML
  • Reducing page weight by removing inline styles
  • Fixing layout issues caused by foreign CSS

What to Keep vs Strip

Keep

  • Semantic tags: h1-h6, p, ul/ol/li, strong, em, blockquote
  • Links with href
  • Images with src and alt
  • Tables when structurally needed

Strip

  • Inline styles (let your stylesheet handle it)
  • Font tags (deprecated)
  • Empty paragraphs and divs
  • Script tags and event handlers
  • Conditional comments
  • MSO/Office-specific markup

Levels of Cleaning

  • Light — Remove styles and classes only
  • Medium — Plus empty tags and proprietary markup
  • Strict — Reduce to plain semantic HTML only
  • Plain text — Strip all tags

Common Pitfalls

  • Over-cleaning. Stripping needed semantic markup along with cruft
  • Lost line breaks. Word's paragraph styles converted incorrectly
  • Broken lists. Word lists use nested tables; need careful conversion
  • Images lost. Word embeds as base64 or local file:// paths
  • Tables collapse. Width attributes stripped without CSS replacement
  • Encoding issues. Smart quotes, em-dashes mangled

Security Notes

  • Always sanitise user-submitted HTML before rendering
  • Strip script tags, event handlers (onclick, onerror, etc.)
  • Validate href attributes (no javascript: URLs)
  • Validate src attributes for images
  • Use libraries like DOMPurify, Bleach for production sanitisation

Before Pasting Into a CMS

  1. Copy content from source
  2. Paste into cleaner tool
  3. Choose cleaning level
  4. Review output for missing elements
  5. Paste cleaned HTML into CMS source view
  6. Preview rendered result

Better Alternatives Where Possible

  • Write directly in the CMS rather than copy-paste
  • Use Markdown when the CMS supports it
  • Configure WYSIWYG editors with paste-as-plain-text
  • For email, use email-specific templating (MJML)

Quick Tips

  • Word documents need the most aggressive cleaning
  • Always preview cleaned HTML before publishing
  • Save your preferred cleaning level for reuse
  • For security, use a vetted sanitisation library, not regex
  • Writing in Markdown sidesteps the cleaning problem entirely

Use the HTML Cleaner on Popupnote

The HTML Cleaner on Popupnote provides a clean tool for stripping inline styles, MSO tags, and other cruft from pasted HTML — for content editors, web developers, and anyone publishing copied content. The tool runs in your browser without any account required.