Prompt: Migrate an OpenWetWare Wiki to Offline Markdown
Use this prompt to instruct an AI coding assistant (e.g. GitHub Copilot, Claude) to migrate an OpenWetWare lab wiki to local Markdown files with full offline capability.
The Prompt
Migrate the lab wiki at
https://openwetware.org/wiki/NAMESPACEto local Markdown files. The end result must be fully offline-readable — if openwetware.org and its CDN shut down, all content including images must still work.Use the migration tool
migrate.pyin this repo. Edit the CONFIG section at the top to set the primary namespace and one-hop namespaces, then run the pipeline:python migrate.py crawl # Discover & convert all pages python migrate.py fix # Fix remaining external references python migrate.py fix # Run again — each pass catches new edge cases python migrate.py audit # Verify zero references remain python migrate.py inventory # Generate TODO.md and BROKEN_LINKS.mdRepeat
fix→audituntil audit returns zero. See the DOs and DON’Ts below.
Step-by-Step Guide
1. Setup
mkdir labwiki && cd labwiki
python3 -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 markdownify
Copy migrate.py into the project root.
2. Configure
Edit the CONFIG section in migrate.py:
WIKI_BASE = "https://openwetware.org"
PRIMARY_NAMESPACE = "YourLab" # e.g. "Beauchamp"
HOME_PAGE_URL = f"{WIKI_BASE}/wiki/{PRIMARY_NAMESPACE}"
ONE_HOP_NAMESPACES = ["RAVE", "CAMRI"] # linked wikis to include (not crawled recursively)
3. Crawl
python migrate.py crawl
This will:
- Start from the home page and follow all internal links
- Download ALL pages under the primary namespace
- Download one-hop pages (pages from other namespaces linked from primary pages)
- Convert HTML → Markdown using
markdownify - Download all CDN-hosted media (images, PDFs) to
attachments/ - Rewrite links to relative local paths
4. Fix (run multiple times)
python migrate.py fix
python migrate.py fix # yes, run it again
Each pass catches edge cases the previous pass missed. Keep running until audit is clean.
5. Audit
python migrate.py audit
Must return ZERO remaining references. If not, run fix again.
6. Inventory
python migrate.py inventory
Generates TODO.md (all pages with checkboxes) and BROKEN_LINKS.md (unresolvable links).
Architecture (What Gets Created)
labwiki/
├── migrate.py # Migration tool
├── TODO.md # Generated inventory
├── BROKEN_LINKS.md # Unresolvable links
├── pages/
│ ├── YourLab/ # Primary namespace
│ │ ├── index.md # Home page
│ │ ├── Publications.md
│ │ └── ...
│ ├── RAVE/ # One-hop namespaces
│ │ ├── index.md
│ │ └── ...
│ ├── CAMRI/
│ └── SomePage.md # Root-level pages (no namespace)
└── attachments/
├── Publications/ # Media organized by page
│ ├── paper.pdf
│ └── figure1.jpg
├── RAVE_Install/
└── ...
DOs and DON’Ts
DOs
-
DO use Python script files for all batch operations. Never inline Python in zsh/bash — quoting issues with URLs, special characters, and nested quotes will silently corrupt commands.
-
DO run
fixmultiple times. A single regex pass will NOT catch everything. Each pass reveals new edge cases (title attributes, nested links, angle-bracket URLs). Runfix→auditin a loop until audit returns zero. -
DO audit after every fix pass with
grep -rn 'openwetware.org' pages/and alsogrep -rn 'oww-files' pages/. The audit command does this automatically. -
DO handle title attributes in markdown links. Links like
[text](url "Beauchamp:PageName")are common. A naive regex[^)]+greedily captures the title as part of the URL. Split URL from title with:r'(\S+?)(\s+"[^"]*")?$' -
DO handle the CDN thumbnail trick. Thumbnails are on
oww-files-thumb.sfo3.cdn.digitaloceanspaces.com. To get the full-size image: replaceoww-files-thumbwithoww-files-publicin the URL AND drop the last path segment (the/250px-filename.jpgpart). -
DO use the MediaWiki API to resolve
File:page links. URL:{WIKI_BASE}/mediawiki/api.php?action=query&titles=File:{name}&prop=imageinfo&iiprop=url&format=json. The response contains the actual CDN download URL. -
DO strip “Retrieved from” footers. These are MediaWiki boilerplate at the bottom of every page:
Retrieved from "https://openwetware.org/mediawiki/...". Strip them during conversion. -
DO strip
[edit]section links. These are MediaWiki editing chrome that appears next to every heading. -
DO URL-decode filenames when saving locally (
%2C→,,%28→(, etc.) and replace spaces with underscores. -
DO define scope clearly. Primary namespace = crawl everything. Other namespaces = one-hop only (download if linked, but do NOT follow their links recursively). This prevents scope creep.
-
DO use markers for broken links that can be re-resolved later. Pattern:
](# "WikiName - not archived")for missing pages,~~Page Name~~ *(page not archived)*for 404s. -
DO check for nested image-link patterns. MediaWiki often generates
[](/wiki/File:FullSize.jpg)— an image wrapped in a clickable link. Both thesrcand thehrefneed to be resolved. -
DO extract content from
#mw-content-textonly. The rest of the page (sidebar, navigation, footer) is wiki chrome, not content. -
DO handle multiple URL variations. The same wiki link can appear as:
/wiki/Beauchamp:Foo(relative)https://openwetware.org/wiki/Beauchamp:Foo(absolute)http://www.openwetware.org/wiki/Beauchamp:Foo(http, www)
DON’Ts
-
DON’T use inline Python in shell commands. Zsh and bash have different quoting rules. URLs with special characters (
%,(,),&) will break. Always write a.pyfile and run it. -
DON’T assume one fix pass is enough. We needed 5+ rounds of fixing for the Beauchamp wiki (291 initial references → 0). Each pass catches ~60-80% of remaining issues.
-
DON’T use a simple
[^)]+regex for markdown link URLs. It fails on title attributes:[text](url "title")— the)in"title")isn’t the closing paren of the link. Also fails on nested image-links[](href). -
DON’T recursively crawl non-primary namespaces. Other wikis (RAVE, CAMRI, YAEL) have their own structure. Only download pages that are directly linked from primary namespace pages.
-
DON’T forget about
File:wrapper links. Pattern[img](https://openwetware.org/wiki/File:Name.jpg)is NOT a direct image URL — it’s a wiki page that wraps the full-size image. Must resolve via API. -
DON’T leave CDN thumbnail URLs.
oww-files-thumb.sfo3.cdn.digitaloceanspaces.comserves low-res thumbnails. Always convert to full-size fromoww-files-public. -
DON’T skip the “Retrieved from” footer check. New pages downloaded in later rounds will have these footers. They reference
openwetware.organd will fail the audit. -
DON’T batch-process quoting-sensitive strings through the shell. Even
echoorsedcan corrupt URLs with&,=,(,). Python’s string handling is safer. -
DON’T forget angle-bracket URLs. Some wiki pages contain
<http://openwetware.org/wiki/CAMRI:Flywheel>— bare URLs in angle brackets. These need a separate regex pass, not just the[text](url)pattern. -
DON’T assume wiki page names are case-sensitive if resolution fails. Try exact match, then underscored, then case-insensitive, then index page fallback.
OpenWetWare-Specific Reference
URL Patterns
| Pattern | Example | Meaning |
|---|---|---|
/wiki/Namespace:Page |
/wiki/Beauchamp:Publications |
Wiki page |
/wiki/File:Name.ext |
/wiki/File:RAVE_Logo.jpg |
File description page (NOT the file itself) |
/wiki/User:Name |
/wiki/User:John_Doe |
User profile (skip) |
?action=edit&redlink=1 |
?title=Beauchamp:Foo&action=edit&redlink=1 |
Non-existent page |
CDN URL Patterns
| Host | Purpose | Example |
|---|---|---|
oww-files-public.sfo3.cdn.digitaloceanspaces.com |
Full-size files | https://oww-files-public.sfo3.cdn.../a/ab/Paper.pdf |
oww-files-thumb.sfo3.cdn.digitaloceanspaces.com |
Thumbnails | https://oww-files-thumb.sfo3.cdn.../a/ab/Image.jpg/250px-Image.jpg |
Thumbnail → Full-Size Conversion
Thumbnail: https://oww-files-thumb.sfo3.cdn.../a/ab/Image.jpg/250px-Image.jpg
^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^
replace with drop this segment
oww-files-public
Full-size: https://oww-files-public.sfo3.cdn.../a/ab/Image.jpg
MediaWiki API for File Resolution
GET {WIKI_BASE}/mediawiki/api.php?action=query&titles=File:{name}&prop=imageinfo&iiprop=url&format=json
Response:
{
"query": {
"pages": {
"12345": {
"imageinfo": [{
"url": "https://oww-files-public.sfo3.cdn.../a/ab/Image.jpg"
}]
}
}
}
}
Content Extraction
Target: <div id="mw-content-text">...</div>
Remove before conversion:
<span class="mw-editsection">— “[edit]” links<div class="catlinks">— category links<div id="toc">— table of contentsRetrieved from "..."— page footer
Broken Link Markers
In converted Markdown, broken links use these conventions:
[Link text](# "WikiName - not archived") <!-- Page doesn't exist locally -->
[Link text](# "File:name.jpg - not available") <!-- File couldn't be downloaded -->
~~Page Name~~ *(page not archived)* <!-- Page returned 404 -->
Troubleshooting
| Problem | Solution |
|---|---|
audit keeps finding references |
Run fix again — each pass catches ~60-80% |
| Title attributes break regex | Use r'(\S+?)(\s+"[^"]*")?$' to split URL from title |
| Inline Python in shell fails | Write a .py file instead — never use python -c "..." with URLs |
| Images missing after migration | Check for oww-files-thumb URLs — convert to oww-files-public |
/wiki/File: links still external |
Use MediaWiki API to resolve, not direct URL construction |
(archived) (archived) duplication |
Fix script ran twice — the fix command handles this pattern |
Nested [](href) broken |
Both inner src and outer href need separate resolution |