Java JRE, JDK, Tomcat and Nutch search-engine

From WickyWiki


Install Tomcat

Verify Java JRE and JDK are there:

java -version
javac -helpsudo

Install tomcat (admin and docs convenient but optional):

apt-get install tomcat6 tomcat6-admin tomcat6-docs

Disable/enable auto startup, start/stop/restart service:

sudo update-rc.d tomcat6 disable
sudo update-rc.d tomcat6 enable
sudo /etc/init.d/tomcat6 start
sudo /etc/init.d/tomcat6 stop
sudo /etc/init.d/tomcat6 restart

Set JAVA_HOME:

update-alternatives --query java
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk/
gedit ~/.bashrc

Stuff we added to make Tomcat go:

export CLASSPATH=/usr/local/tomcat6/common/lib/jsp-api.jar:/usr/local/tomcat/common/lib/servlet-api.jar

Nutch

Sources:

Deploy nutch-1.2.war in Tomcat

NB:we are trying to index local files, this didnt wordk however, maybe security settings that i like to keep that way

Base-directory of extract in this example:

cd /home/wilbert/Downloads/nutch-1.2

Give http.agent.name a value like "YOURNAME Spider":

gedit ./conf/nutch-default.xml
gedit ./conf/crawl-urlfilter.txt
# skip ftp:, & mailto: urls - removed "file" to allow files
-^(ftp|mailto):

# accept hosts in MY.DOMAIN.NAME - wjv5, local files
+^http://([a-z0-9]*\.)*wjv5/
+^file:///media

./conf/regex-urlfilter.txt : not sure if this is used
# skip file: ftp: and mailto: urls - removed "file" to allow files
-^(ftp|mailto):

By default the "file plugin" is disabled. Add "protocol-file" and all other plugins

Location of crawl within deployement:

gedit ./conf/nutch-site.xml
<configuration>
  <property>
    <name>plugin.includes</name>
    <value>protocol-file|protocol-http|urlfilter-regex|parse-(text|html|js|tika|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  </property>
  <property>
    <name>searcher.dir</name>
    <value>webapps/nutch/crawl</value>
  </property>
</configuration>

Set JAVA_HOME, needed for indexing:

update-alternatives --config java  
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk

Make seed-dir "urls" with seed-files - these are starting points for the crawl:

</syntaxhighlight> cd /home/wilbert/Downloads/nutch-1.2 mkdir ./urls rm ./urls/* </syntaxhighlight>

Local files:

find /media/DataSda6/opslag -type f -name '*' | grep -i '\.\(htm\|html\|doc\|pdf\|txt\)$' | sed 's/.*/file:\/\/&/' > ./urls/local

Tomcat docs as example:

echo "http://wjv5:8080/docs/index.html" > ./urls/tomcat

Rmove old crawl, make new one:

rm -r ./crawl
bin/nutch crawl urls -dir crawl -depth 10 -topN 100000

Search for "apache":

bin/nutch org.apache.nutch.searcher.NutchBean apache

Copy/replace to tomcat base for deployed nutch:

sudo rm -r /var/lib/tomcat6/webapps/nutch/crawl
sudo cp -r ./crawl /var/lib/tomcat6/webapps/nutch/

The deployement also needs the modified conf files:

from=/home/wilbert/Downloads/nutch-1.2/conf/
to=/var/lib/tomcat6/webapps/nutch/WEB-INF/classes
sudo cp ${from}crawl-urlfilter.txt ${to}
sudo cp ${from}regex-urlfilter.txt ${to}
sudo cp ${from}nutch-default.xml ${to}
sudo cp ${from}nutch-site.xml ${to}

Restart Tomcat:

sudo /etc/init.d/tomcat6 restart