Java JRE, JDK, Tomcat and Nutch search-engine
Install Tomcat
Verify Java JRE and JDK are there:
java -version javac -helpsudo
Install tomcat (admin and docs convenient but optional):
apt-get install tomcat6 tomcat6-admin tomcat6-docs
Disable/enable auto startup, start/stop/restart service:
sudo update-rc.d tomcat6 disable sudo update-rc.d tomcat6 enable sudo /etc/init.d/tomcat6 start sudo /etc/init.d/tomcat6 stop sudo /etc/init.d/tomcat6 restart
Set JAVA_HOME:
update-alternatives --query java export JAVA_HOME=/usr/lib/jvm/java-6-openjdk/
gedit ~/.bashrc
Stuff we added to make Tomcat go:
export CLASSPATH=/usr/local/tomcat6/common/lib/jsp-api.jar:/usr/local/tomcat/common/lib/servlet-api.jar
Nutch
Sources:
- http://wiki.apache.org/nutch/NutchTutorial
- http://wiki.apache.org/nutch/FAQ#How_do_I_index_my_local_file_system.3F
- http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
- http://thewiki4opentech.org/index.php/Nutch
- http://thewiki4opentech.org/index.php/Bash_Scripts_for_Nutch_to_automatize_Nutch
Deploy nutch-1.2.war in Tomcat
NB:we are trying to index local files, this didnt wordk however, maybe security settings that i like to keep that way
Base-directory of extract in this example:
cd /home/wilbert/Downloads/nutch-1.2
Give http.agent.name a value like "YOURNAME Spider":
gedit ./conf/nutch-default.xml
gedit ./conf/crawl-urlfilter.txt
# skip ftp:, & mailto: urls - removed "file" to allow files -^(ftp|mailto): # accept hosts in MY.DOMAIN.NAME - wjv5, local files +^http://([a-z0-9]*\.)*wjv5/ +^file:///media ./conf/regex-urlfilter.txt : not sure if this is used # skip file: ftp: and mailto: urls - removed "file" to allow files -^(ftp|mailto):
By default the "file plugin" is disabled. Add "protocol-file" and all other plugins
Location of crawl within deployement:
gedit ./conf/nutch-site.xml
<configuration>
<property>
<name>plugin.includes</name>
<value>protocol-file|protocol-http|urlfilter-regex|parse-(text|html|js|tika|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
<property>
<name>searcher.dir</name>
<value>webapps/nutch/crawl</value>
</property>
</configuration>
Set JAVA_HOME, needed for indexing:
update-alternatives --config java export JAVA_HOME=/usr/lib/jvm/java-6-openjdk
Make seed-dir "urls" with seed-files - these are starting points for the crawl:
</syntaxhighlight> cd /home/wilbert/Downloads/nutch-1.2 mkdir ./urls rm ./urls/* </syntaxhighlight>
Local files:
find /media/DataSda6/opslag -type f -name '*' | grep -i '\.\(htm\|html\|doc\|pdf\|txt\)$' | sed 's/.*/file:\/\/&/' > ./urls/local
Tomcat docs as example:
echo "http://wjv5:8080/docs/index.html" > ./urls/tomcat
Rmove old crawl, make new one:
rm -r ./crawl bin/nutch crawl urls -dir crawl -depth 10 -topN 100000
Search for "apache":
bin/nutch org.apache.nutch.searcher.NutchBean apache
Copy/replace to tomcat base for deployed nutch:
sudo rm -r /var/lib/tomcat6/webapps/nutch/crawl sudo cp -r ./crawl /var/lib/tomcat6/webapps/nutch/
The deployement also needs the modified conf files:
from=/home/wilbert/Downloads/nutch-1.2/conf/
to=/var/lib/tomcat6/webapps/nutch/WEB-INF/classes
sudo cp ${from}crawl-urlfilter.txt ${to}
sudo cp ${from}regex-urlfilter.txt ${to}
sudo cp ${from}nutch-default.xml ${to}
sudo cp ${from}nutch-site.xml ${to}
Restart Tomcat:
sudo /etc/init.d/tomcat6 restart