Posts Tagged ‘extract urls’

Get a list of urls/domains from a text file

// November 5th, 2008 // 2 Comments » // linux

I was just in need of a little script that extracts all urls from a text file. Here is the result.

sed 's/http/\^http/g' FILENAME | tr -s "^" "\n" | grep http| sed 's/[\ |\\\|\"].*//g' | sed "s/['].*//g" | sort | uniq

And as my final goal was to extract a list of domain names from the file which I can later use in my php script here is the hardcore version which gives you a copy&paste array definition of all domains found in a file.

echo -n '$domains = array ( "' ; sed 's/http/\^http/g' FILENAME | tr -s "^" "\n" | grep http| sed 's/[\ |\\\|\"].*//g' | sed "s/['].*//g" | sort | uniq | awk 'BEGIN{FS="/"}{print $3}' | cut -d . -f 2- | grep -E '^[a-z]+\.[a-z]+$' | sort | uniq | tr "\n" "," | sed "s/,/\", \"/ig" | sed 's/, \"$//ig'; echo -n ' );'