HTML pasted from Word, Google Docs, or rich-text editors arrives bloated with inline styles, MSO comments, span wrappers, and proprietary attributes that bloat your page and break your styling. An HTML cleaner strips the noise — keeping the semantic markup and discarding the formatting cruft — so the content you paste into a CMS or email template is lean.
This guide covers what cleaning removes, when to use it, and the practical points that prevent broken content.
What HTML Cleaning Does
- Strips inline
styleattributes - Removes empty tags and redundant wrappers
- Cleans Microsoft Office-specific tags (
<o:p>,<w:) - Removes class attributes from foreign sources
- Converts non-semantic tags to semantic equivalents
- Normalises whitespace and line breaks
- Removes script and event handler attributes (security)
Common Sources of Dirty HTML
- Microsoft Word — Adds mso-* styles and conditional comments
- Google Docs — Wraps content in spans with custom styles
- Outlook / Email clients — Inline tables and font tags
- Old WYSIWYG editors — TinyMCE, CKEditor legacy output
- Copy-paste from web pages — Brings styling from source site
Common Use Cases
- Pasting articles from Word into a CMS (WordPress, Drupal)
- Cleaning email templates from designers
- Preparing content from Google Docs for publishing
- Sanitising user-submitted HTML
- Reducing page weight by removing inline styles
- Fixing layout issues caused by foreign CSS
What to Keep vs Strip
Keep
- Semantic tags:
h1-h6,p,ul/ol/li,strong,em,blockquote - Links with href
- Images with src and alt
- Tables when structurally needed
Strip
- Inline styles (let your stylesheet handle it)
- Font tags (deprecated)
- Empty paragraphs and divs
- Script tags and event handlers
- Conditional comments
- MSO/Office-specific markup
Levels of Cleaning
- Light — Remove styles and classes only
- Medium — Plus empty tags and proprietary markup
- Strict — Reduce to plain semantic HTML only
- Plain text — Strip all tags
Common Pitfalls
- Over-cleaning. Stripping needed semantic markup along with cruft
- Lost line breaks. Word's paragraph styles converted incorrectly
- Broken lists. Word lists use nested tables; need careful conversion
- Images lost. Word embeds as base64 or local file:// paths
- Tables collapse. Width attributes stripped without CSS replacement
- Encoding issues. Smart quotes, em-dashes mangled
Security Notes
- Always sanitise user-submitted HTML before rendering
- Strip script tags, event handlers (onclick, onerror, etc.)
- Validate href attributes (no javascript: URLs)
- Validate src attributes for images
- Use libraries like DOMPurify, Bleach for production sanitisation
Before Pasting Into a CMS
- Copy content from source
- Paste into cleaner tool
- Choose cleaning level
- Review output for missing elements
- Paste cleaned HTML into CMS source view
- Preview rendered result
Better Alternatives Where Possible
- Write directly in the CMS rather than copy-paste
- Use Markdown when the CMS supports it
- Configure WYSIWYG editors with paste-as-plain-text
- For email, use email-specific templating (MJML)
Quick Tips
- Word documents need the most aggressive cleaning
- Always preview cleaned HTML before publishing
- Save your preferred cleaning level for reuse
- For security, use a vetted sanitisation library, not regex
- Writing in Markdown sidesteps the cleaning problem entirely
Use the HTML Cleaner on Popupnote
The HTML Cleaner on Popupnote provides a clean tool for stripping inline styles, MSO tags, and other cruft from pasted HTML — for content editors, web developers, and anyone publishing copied content. The tool runs in your browser without any account required.