Categories

URL Scraper Script

Description

I wrote the url_scrape.py command-line filter to scrape HTML, XML, and plaintext documents for the URLs that they contain. It reads from standard input and gives the results on standard output, one URL per line.

The heart of the script is the following regular expression, in Python:

url_pattern = re.compile('''["']http://[^+]*?['"]''')

which basically looks for URLs between quotes ("" or '') that start with http://.

The script is most useful if you mix and match it with sort, uniq, and grep. Unix geeks know the drill. For example:

[beaker] ~> wget -O - https://madphilosopher.ca/ | ./url_scrape.py | sort | uniq
http://coralcdn.com/
http://del.icio.us/madphilosopher
http://del.icio.us/madphilosopher/comments
http://denyhosts.sourceforge.net/
http://en.wikipedia.org/wiki/Slashdot_effect
http://geourl.org/near/?p=https://madphilosopher.ca/
http://gmpg.org/xfn/
http://gmpg.org/xfn/11
http://home.online.no/~kafox/blogfiles/
http://iloveradio.org
https://madphilosopher.ca
https://madphilosopher.ca/2003/01/
https://madphilosopher.ca/2003/02/
https://madphilosopher.ca/2003/03/
https://madphilosopher.ca/2003/04/
...
https://madphilosopher.ca/images/delicious_links.gif
https://madphilosopher.ca/images/geourl.gif
https://madphilosopher.ca/images/get_firefox.gif
https://madphilosopher.ca/images/rss2.gif
...

The script

Download the url_scrape.py script.