On Linux how to get first HTML `pre` innerHTML?
Jun 3, 1:43pm

To get the innerHTML of the first <pre> element in an HTML document on Linux, you can use command-line tools like curl (to fetch the HTML) and xmllint or pup (to parse and extract the content).

Method 1: Using curl + pup (Recommended)

If you have pup installed (a command-line HTML parser), you can do:

bash
curl -s <URL> | pup 'pre' | head -n 1
  • curl -s <URL> fetches the HTML silently.
  • pup 'pre' extracts all <pre> elements.
  • head -n 1 takes the first one.

If you want the innerHTML (not just the text), use:

bash
curl -s <URL> | pup 'pre html'

Method 2: Using curl + xmllint (XPath)

If you prefer xmllint (part of libxml2), you can use XPath:

bash
curl -s <URL> | xmllint --html --xpath '//pre[1]' - 2>/dev/null
  • --html parses the input as HTML.
  • --xpath '//pre[1]' selects the first <pre> element.

To get only the innerHTML (without the <pre> tags), use:

bash
curl -s <URL> | xmllint --html --xpath 'string(//pre[1])' - 2>/dev/null

Method 3: Using grep (Quick & Dirty)

If the HTML is simple and well-formatted, you can use grep:

bash
curl -s <URL> | grep -oP '<pre>\K[^<]*' | head -n 1
  • -oP enables Perl regex.
  • \K discards the matched <pre> tag.
  • [^<]* captures everything until the next <.

⚠️ Warning: This may fail if the <pre> content contains < or > characters.


Which to Use?

  • Best for reliability: pup (Method 1)
  • Best for XPath: xmllint (Method 2)
  • Quick & dirty: grep (Method 3)
Jun 3, 1:43pm

This is a shared conversation and may include unverified content that doesn't reflect Mistral AI's views. Login to resume chatting.

Vibe can make mistakes. Check answers. Learn more