URL Scraper Script


I wrote the command-line filter to scrape HTML, XML, and plaintext documents for the URLs that they contain. It reads from standard input and gives the results on standard output, one URL per line.

The heart of the script is the following regular expression, in Python:

url_pattern = re.compile('''["']http://[^+]*?['"]''')

which basically looks for URLs between quotes ("" or '') that start with http://.

The script is most useful if you mix and match it with sort, uniq, and grep. Unix geeks know the drill. For example:

[beaker] ~> wget -O - | ./ | sort | uniq

The script

Download the script.