The fundamental idea of web searching is through web crawling, critical information about the website and the embedded links in the crawled website can be obtained. Based on the crawling results, a ranking algorithm could be applied, based on which one can search the website using keywords. Nowadays, we get so much into Google, which also takes the same way of performing the ranking of millions or probably billions of website for searching. Google, as a centralized searching service provider, they need to deal with so many websites in the market and as an end user, if we want our website to be seen by Google, we may need to submit the crawling request to Google – if not, Google still is able to find our website and perform crawling for it but that will probably take forever, since the reality is, as said above, Google needs to deal with those tons of websites on the market and our own website has little chance to be seen if we are not submitting the crawling request to Google. However, even though we submit the crawling request, still it may take some time and sometimes, for whatever reason, the crawling may not give back a full crawling for our site, etc. As an end user, if we want to have a dedicated search engine just for our website, we can go to the market and search for alternative search engines.
Yacy is the one that I found useful and here in this blog, I will note down the setup of
Yacy on my own VPS and its implementation in my own blog website (actually, the current blog website that you are visiting now).
Yacy provides docker image for the service, and to run the docker container to fire up the server, we can simply run the following command,
sudo docker run -d --name yacy_search_server -p 8092:8090 -p 8443:8443 -v yacy_search_server_data:/opt/yacy_search_server/DATA --restart unless-stopped --log-opt max-size=200m --log-opt max-file=2 yacy/yacy_search_server:aarch64-latest
8092 is my selected port number.
Here, concerning the proxy setup to make the
Yacyserver secure, we can refer to my another blog here.
The next step is to implement the
Yacy engine into our own website. In my case, I am using
jekyll to host the current blog on
GitHub. We can refer to the YAML configuration file here about the overall configuration of the whole site to be generated with
jekyll. Specifically, the search page link in the top menu is configured here. This further points to the
search.html file here, which further contains the inclusion of the
search-google.html file. The
search-google.html file then contains the HTML codes for hosting the search box which will lead up to the
Yacy engine. The file can be found here. The file was created by combining multiple pieces. The first piece is from the
Yacy server. Once we have our
Yacy server set up, we can visit the page, e.g.,
ya.iris-home.net in my case. Further, we go to
Administration (needs login) =>
Search Portal Integration (left-side menu) =>
Portal Configuration =>
Search Box Anywhere (top selection menu in the right-side panel). There, we can see the code snippet that we need to grab for the implementation of
Yacy. Here, I did not take them all but instead only the form part. Next, it seems that
Yacy does not provide the API interface with which we can directly provide our own website as the parameters to tell
Yacy to guide it to search our own website for the provided keywords.
Yacyis able to search our website, we need to crawl it first. To do this, we go to
Load Web Pages, Crawler, Then, we only need to fill in our website URL and hit the
Start New Crawland
Yacywill start the crawling. That’s all. Once the crawling is done, our website information will be available to
Yacyand the search over our website should be doable with