I wrote the
url_scrape.py command-line filter to scrape HTML, XML, and plaintext documents for the URLs that they contain. It reads from standard input and gives the results on standard output, one URL per line.
The heart of the script is the following regular expression, in Python:
url_pattern = re.compile('''["']http://[^+]*?['"]''')
which basically looks for URLs between quotes (
'') that start with
The script is most useful if you mix and match it with
grep. Unix geeks know the drill. For example:
[beaker] ~> wget -O - http://madphilosopher.ca/ | ./url_scrape.py | sort | uniq http://coralcdn.com/ http://del.icio.us/madphilosopher http://del.icio.us/madphilosopher/comments http://denyhosts.sourceforge.net/ http://en.wikipedia.org/wiki/Slashdot_effect http://geourl.org/near/?p=http://madphilosopher.ca/ http://gmpg.org/xfn/ http://gmpg.org/xfn/11 http://home.online.no/~kafox/blogfiles/ http://iloveradio.org http://madphilosopher.ca http://madphilosopher.ca/2003/01/ http://madphilosopher.ca/2003/02/ http://madphilosopher.ca/2003/03/ http://madphilosopher.ca/2003/04/ ... http://madphilosopher.ca/images/delicious_links.gif http://madphilosopher.ca/images/geourl.gif http://madphilosopher.ca/images/get_firefox.gif http://madphilosopher.ca/images/rss2.gif ...