Get a list of urls/domains from a text file
// November 5th, 2008 // linux
I was just in need of a little script that extracts all urls from a text file. Here is the result.
sed 's/http/\^http/g' FILENAME | tr -s "^" "\n" | grep http| sed 's/[\ |\\\|\"].*//g' | sed "s/['].*//g" | sort | uniq
And as my final goal was to extract a list of domain names from the file which I can later use in my php script here is the hardcore version which gives you a copy&paste array definition of all domains found in a file.
echo -n '$domains = array ( "' ; sed 's/http/\^http/g' FILENAME | tr -s "^" "\n" | grep http| sed 's/[\ |\\\|\"].*//g' | sed "s/['].*//g" | sort | uniq | awk 'BEGIN{FS="/"}{print $3}' | cut -d . -f 2- | grep -E '^[a-z]+\.[a-z]+$' | sort | uniq | tr "\n" "," | sed "s/,/\", \"/ig" | sed 's/, \"$//ig'; echo -n ' );'





I spent over an hour googling around before I found your elegant script for getting this job done. Thanks for posting it!
Thank you for your one liner!
Does the exact job for me.