stopwords in exact title searches

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

stopwords in exact title searches

Charles McGrath, Katie

We’ve been struggling with configuring title searches that contain stopwords (e.g. “The Help”).  I found an old post in which Demian was indicating that you could get some weird results when mixing stopworded and non-stopworded fields in searchspecs.yaml (https://sourceforge.net/p/vufind/mailman/message/34143745/) but I’m not sure how to find out which fields are stopworded and which aren’t.  Is this configured somewhere?  Are there other settings that affect how stopwords are handled?  Right now it seems to make no difference whether the search terms are in quotes or not – the stopwords are ignored.

 

I thought that perhaps uncommenting the “ExactSettings” section of the searchspecs.yaml file would be helpful for this issue, but this causes the search to return zero results.

 

Thanks in advance,

 

Katie McGrath

eiNetwork

Pittsburgh, PA

 

 

The information contained in this e-mail, and any attachment, is confidential and is intended solely for the use of the intended recipient. Access, copying or re-use of the e-mail or any attachment, or any information contained therein, by any other person is not authorized. If you are not the intended recipient please return the e-mail to the sender and delete it from your computer. Although we attempt to sweep e-mail and attachments for viruses, we do not guarantee that either are virus-free and accept no liability for any damage sustained as a result of viruses.
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|

Re: stopwords in exact title searches

Demian Katz

The way to figure out whether or not a field is filtering stopwords is to look at the Solr schema:

 

https://github.com/vufind-org/vufind/blob/master/solr/vufind/biblio/conf/schema.xml

 

Note that in the fieldType definitions at the top, some contain solr.StopFilterFactory and some do not. Logically enough, all the field types containing solr.StopFilterFactory will delete stopwords. You then can look at the type attributes of the field definitions to figure out whether or not a given field deletes stopwords.

 

When you use solr.StopFilterFactory, the only thing you can really configure is which words are considered to be stopwords. There is no way to temporarily turn it off, because what the filter does is literally delete the stopwords before they can be stored in your index – so the data can’t be found because it is simply not there – there is just a placeholder that indicates “some deleted word was here” for the purposes of determining word positions in phrases.

 

The reason that mixing stopworded and non-stopworded fields in a Dismax search causes problems is that the system gets confused trying to reconcile deleted and non-deleted words in the index and doesn’t always bring back complete or appropriate result sets.

 

As far as a solution is concerned, you might want to try reducing or completely removing the list of stopwords in your biblio/conf/stopwords.txt and then reindexing your records. Ideally it would be nice to set up a second test instance of VuFind that you could compare in real-time against the current instance. You might find that the search results are just as good or better if you leave the stopwords alone, and if so that may be an easier solution. Here at Villanova we ended up greatly reducing our stopword list from the defaults shipped with VuFind, and it solved some problems (though we still do consider “the” to be a stopword). The only cost, apart from the inevitable changes in relevance ranking, is that your index will be a little bit larger since it will be storing more words.

 

I hope this is helpful, but please let me know if I can do anything more to help! You might also find it useful to check out the archives of the solr-user list (or put out a question there) to see if there have been any new innovations in dealing with stopwords; with VuFind 4.0, we will be upgrading to Solr 6, and I haven’t yet had time to look at all of the new features of the last couple of Solr releases. Perhaps there are some new options I’m not yet aware of.

 

- Demian

 

From: Charles McGrath, Katie [mailto:[hidden email]]
Sent: Tuesday, May 02, 2017 4:35 PM
To: [hidden email]
Subject: [VuFind-Tech] stopwords in exact title searches

 

We’ve been struggling with configuring title searches that contain stopwords (e.g. “The Help”).  I found an old post in which Demian was indicating that you could get some weird results when mixing stopworded and non-stopworded fields in searchspecs.yaml (https://sourceforge.net/p/vufind/mailman/message/34143745/) but I’m not sure how to find out which fields are stopworded and which aren’t.  Is this configured somewhere?  Are there other settings that affect how stopwords are handled?  Right now it seems to make no difference whether the search terms are in quotes or not – the stopwords are ignored.

 

I thought that perhaps uncommenting the “ExactSettings” section of the searchspecs.yaml file would be helpful for this issue, but this causes the search to return zero results.

 

Thanks in advance,

 

Katie McGrath

eiNetwork

Pittsburgh, PA

 

 

 

The information contained in this e-mail, and any attachment, is confidential and is intended solely for the use of the intended recipient. Access, copying or re-use of the e-mail or any attachment, or any information contained therein, by any other person is not authorized. If you are not the intended recipient please return the e-mail to the sender and delete it from your computer. Although we attempt to sweep e-mail and attachments for viruses, we do not guarantee that either are virus-free and accept no liability for any damage sustained as a result of viruses.


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech