Optimization for alphabetic browse

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Optimization for alphabetic browse

Tod Olson
Hi everyone,

We're seeing some slow responses and sometimes timeouts in our alphabetical browses, and I think there is some potential for optimization.

The basic symptom is that when we pull up a browse list where some of the list entires have large numbers of hits, VuFind is slow to return that list, sometimes even times out. 

In our catalog, on example is  subject browse for "economics"  is slow and one of the entries has over 14K titles. "Finance", "mathematics", and "science" should also show the problem. I could also find this symptom at National Library of Austrailia, a subject browse for "mathematics" is also slow, and the results have one entry with over 4K titles and another entry has over 2K titles. This is not a problem when the lists only have small numbers of titles, and therefore not a problem for sites with smaller collections.

Looking at the vufind-browse-handler code, I think I see the problem. It's all contained in BrowseRequestHandler.java. The starting point is Browse::populateItem. When populating a BrowseItem, the code searches the BibDB for the heading, loops over ALL of the matching record IDs from the Lucene indexes and copies them into the ids field, type List<String>. The title count is then sets the the title count from ids.size().

I think that the ids field is used to build the link in the browse listing. When there are only a few titles (< 5? or set by a param?) then the browse list uses the record IDs to build the link for that browse listing, but if there are more than just a few it uses the heading to build the link. It seems to me like copying all of those record IDs into the ids field is a waste of effort for entries with large sets of matching titles, and might be a place to optimize the code.

Perhaps the first question is, is this a concern for any other sites? Anyone else interested in trying to optimize this? (I ask partly because it's best to do this with the community, and partly because I am low on cycles.)


The next bit is possible optimization. IF dragging all of the record IDs out of the Lucene indexes and copying then into a List<String> seems like the likely slowness, then maybe we can shortcut. The meat of the action is in BibDB::matchingIDs, which creates a SimpleCollector to iterate over all results and also collect any extra fields needed to populate the Browse display. 

So the second question is, assuming this analysis is correct: under what conditions can we dispense with collecting all of the record IDs and just use the BibDB::recordCount method?


So there's a fun little project. :-) Is anyone else out there interested in this?


-Tod


Tod Olson <[hidden email]>
Systems Librarian
Interim Director for Integrated Library Systems
University of Chicago Library


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Optimization for alphabetic browse

Demian Katz

Tod,

 

Thanks for the detailed analysis. I’m copying Mark Triggs on this in case he has any immediate thoughts, since he knows the code best.

 

It seems to me that one possible solution might be to add a setting to the handler (which could be configured through solrconfig.xml) for “maximum IDs to return.” Then this could be configured to match the threshold that VuFind currently uses for linking to IDs rather than to searches, which would probably solve the problem for 99.9% of users, while also leaving the ability to change the threshold without modifying code if there is a reason to do that in the future.

 

For what it’s worth, Villanova has over 2,000 entries associated with our “economics” heading, but I’m not seeing any lag when I browse to that point in the index. Perhaps that’s too low a number to have any effect, or perhaps it’s because our overall index is not that large… but in any case, we don’t seem to be suffering at the moment. That doesn’t mean I’m unwilling to help with the development of this fix, of course, but it also means that I can’t make it my own top priority right now. J

 

- Demian

 

From: Tod Olson [mailto:[hidden email]]
Sent: Wednesday, July 12, 2017 2:00 PM
To: vufind-tech
Subject: [VuFind-Tech] Optimization for alphabetic browse

 

Hi everyone,

 

We're seeing some slow responses and sometimes timeouts in our alphabetical browses, and I think there is some potential for optimization.

 

The basic symptom is that when we pull up a browse list where some of the list entires have large numbers of hits, VuFind is slow to return that list, sometimes even times out. 

 

In our catalog, on example is  subject browse for "economics"  is slow and one of the entries has over 14K titles. "Finance", "mathematics", and "science" should also show the problem. I could also find this symptom at National Library of Austrailia, a subject browse for "mathematics" is also slow, and the results have one entry with over 4K titles and another entry has over 2K titles. This is not a problem when the lists only have small numbers of titles, and therefore not a problem for sites with smaller collections.

 

Looking at the vufind-browse-handler code, I think I see the problem. It's all contained in BrowseRequestHandler.java. The starting point is Browse::populateItem. When populating a BrowseItem, the code searches the BibDB for the heading, loops over ALL of the matching record IDs from the Lucene indexes and copies them into the ids field, type List<String>. The title count is then sets the the title count from ids.size().

 

I think that the ids field is used to build the link in the browse listing. When there are only a few titles (< 5? or set by a param?) then the browse list uses the record IDs to build the link for that browse listing, but if there are more than just a few it uses the heading to build the link. It seems to me like copying all of those record IDs into the ids field is a waste of effort for entries with large sets of matching titles, and might be a place to optimize the code.

 

Perhaps the first question is, is this a concern for any other sites? Anyone else interested in trying to optimize this? (I ask partly because it's best to do this with the community, and partly because I am low on cycles.)

 

 

The next bit is possible optimization. IF dragging all of the record IDs out of the Lucene indexes and copying then into a List<String> seems like the likely slowness, then maybe we can shortcut. The meat of the action is in BibDB::matchingIDs, which creates a SimpleCollector to iterate over all results and also collect any extra fields needed to populate the Browse display. 

 

So the second question is, assuming this analysis is correct: under what conditions can we dispense with collecting all of the record IDs and just use the BibDB::recordCount method?

 

 

So there's a fun little project. :-) Is anyone else out there interested in this?

 

 

-Tod

 

Tod Olson <[hidden email]>

Systems Librarian

Interim Director for Integrated Library Systems

University of Chicago Library

 


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Optimization for alphabetic browse

Tod Olson
On implementing an optimization, I'm also thinking in BibDB::matchingIDs, to check if the extras parameter is empty and just immediately check BibDB:: recordCount, if the result is greater than some threshold don't even bother with pulling IDs. Not a perfect solution, but maybe okay.

And will be interested to see what Mark has to say!

Thanks,

-Tod

On Jul 12, 2017, at 1:46 PM, Demian Katz <[hidden email]> wrote:

Tod,
 
Thanks for the detailed analysis. I’m copying Mark Triggs on this in case he has any immediate thoughts, since he knows the code best.
 
It seems to me that one possible solution might be to add a setting to the handler (which could be configured through solrconfig.xml) for “maximum IDs to return.” Then this could be configured to match the threshold that VuFind currently uses for linking to IDs rather than to searches, which would probably solve the problem for 99.9% of users, while also leaving the ability to change the threshold without modifying code if there is a reason to do that in the future.
 
For what it’s worth, Villanova has over 2,000 entries associated with our “economics” heading, but I’m not seeing any lag when I browse to that point in the index. Perhaps that’s too low a number to have any effect, or perhaps it’s because our overall index is not that large… but in any case, we don’t seem to be suffering at the moment. That doesn’t mean I’m unwilling to help with the development of this fix, of course, but it also means that I can’t make it my own top priority right now. J
 
- Demian
 
From: Tod Olson [[hidden email]] 
Sent: Wednesday, July 12, 2017 2:00 PM
To: vufind-tech
Subject: [VuFind-Tech] Optimization for alphabetic browse
 
Hi everyone, 
 
We're seeing some slow responses and sometimes timeouts in our alphabetical browses, and I think there is some potential for optimization.
 
The basic symptom is that when we pull up a browse list where some of the list entires have large numbers of hits, VuFind is slow to return that list, sometimes even times out. 
 
In our catalog, on example is  subject browse for "economics"  is slow and one of the entries has over 14K titles. "Finance", "mathematics", and "science" should also show the problem. I could also find this symptom at National Library of Austrailia, a subject browse for "mathematics" is also slow, and the results have one entry with over 4K titles and another entry has over 2K titles. This is not a problem when the lists only have small numbers of titles, and therefore not a problem for sites with smaller collections.
 
Looking at the vufind-browse-handler code, I think I see the problem. It's all contained in BrowseRequestHandler.java. The starting point is Browse::populateItem. When populating a BrowseItem, the code searches the BibDB for the heading, loops over ALL of the matching record IDs from the Lucene indexes and copies them into the ids field, type List<String>. The title count is then sets the the title count fromids.size().
 
I think that the ids field is used to build the link in the browse listing. When there are only a few titles (< 5? or set by a param?) then the browse list uses the record IDs to build the link for that browse listing, but if there are more than just a few it uses the heading to build the link. It seems to me like copying all of those record IDs into the ids field is a waste of effort for entries with large sets of matching titles, and might be a place to optimize the code.
 
Perhaps the first question is, is this a concern for any other sites? Anyone else interested in trying to optimize this? (I ask partly because it's best to do this with the community, and partly because I am low on cycles.)
 
 
The next bit is possible optimization. IF dragging all of the record IDs out of the Lucene indexes and copying then into a List<String> seems like the likely slowness, then maybe we can shortcut. The meat of the action is in BibDB::matchingIDs, which creates a SimpleCollector to iterate over all results and also collect any extra fields needed to populate the Browse display. 
 
So the second question is, assuming this analysis is correct: under what conditions can we dispense with collecting all of the record IDs and just use the BibDB::recordCount method?
 
 
So there's a fun little project. :-) Is anyone else out there interested in this?
 
 
-Tod

 

Tod Olson <[hidden email]>
Systems Librarian
Interim Director for Integrated Library Systems
University of Chicago Library


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Optimization for alphabetic browse

Mark Triggs-4

This may well disappoint, but I'm scratching my head as to why I ever bothered pulling out the list of IDs like that. Firing the search and getting the count should be cheap enough, but collecting all the IDs does seem like a waste of time.

I found this comment in the AlphaBrowse.php that I think I wrote:

 // Linking using bib ids is generally more reliable than doing
 // searches for headings, but headings give shorter queries and
 // don't look as strange.

But current me doesn't find that very compelling. Seems like it would be better to forget about collecting the IDs and just build the link based on the search heading. Like you say, should be much faster!

Cheers,

Mark

Tod Olson <[hidden email]> writes:

On implementing an optimization, I'm also thinking in BibDB::matchingIDs, to check if the extras parameter is empty and just immediately check BibDB:: recordCount, if the result is greater than some threshold don't even bother with pulling IDs. Not a perfect solution, but maybe okay.

And will be interested to see what Mark has to say!

Thanks,

-Tod

-- 
Mark Triggs
<[hidden email]>

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Optimization for alphabetic browse

Tod Olson
Thanks, Mark!

You were probably thinking about the majority of cases (especially titles) where there are only one or very few matches. I'd be really interested to find cases where the heading is less reliable, I rather doubt you were making that up.

In any case, I may try to think a little more flexibly about how to address the issue.

Best,

-Tod

On Jul 13, 2017, at 6:03 AM, Mark Triggs <[hidden email]> wrote:

This may well disappoint, but I'm scratching my head as to why I ever bothered pulling out the list of IDs like that. Firing the search and getting the count should be cheap enough, but collecting all the IDs does seem like a waste of time.

I found this comment in the AlphaBrowse.php that I think I wrote:

 // Linking using bib ids is generally more reliable than doing
 // searches for headings, but headings give shorter queries and
 // don't look as strange.

But current me doesn't find that very compelling. Seems like it would be better to forget about collecting the IDs and just build the link based on the search heading. Like you say, should be much faster!

Cheers,

Mark

Tod Olson <[hidden email]> writes:

On implementing an optimization, I'm also thinking in BibDB::matchingIDs, to check if the extras parameter is empty and just immediately check BibDB:: recordCount, if the result is greater than some threshold don't even bother with pulling IDs. Not a perfect solution, but maybe okay.

And will be interested to see what Mark has to say!

Thanks,

-Tod

-- 
Mark Triggs
<[hidden email]>


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Optimization for alphabetic browse

Demian Katz

It’s possible that the search-by-exact-ID thing is a hold-over from earlier days of VuFind when search query construction was less consistent and reliable, or was simply a sign of lack of confidence in Solr. It may also help address edge cases where a record has changed in Solr but the browse index has not yet been updated – i.e. if a heading is removed or changed on a record but the record is not deleted.

 

In general, though, I’m inclined to agree with Mark that the whole thing may no longer be necessary… and indeed, if there are problems, having two different ways of resolving browse headings just creates more cases that need to be tested, potentially complicating troubleshooting and introducing weird behaviors. So it may make the most sense to switch to always doing things one consistent way.

 

I’d still be inclined to make this a configurable behavior (if it can be done at little cost) simply because there might be some use cases where it would be helpful for somebody to use the browse handler to fetch IDs – but that’s probably very conservative of me, and might not really be worth the effort.

 

- Demian

 

 

From: Tod Olson [mailto:[hidden email]]
Sent: Thursday, July 13, 2017 7:43 AM
To: Mark Triggs
Cc: Tod Olson; Demian Katz; vufind-tech
Subject: Re: Optimization for alphabetic browse

 

Thanks, Mark!

 

You were probably thinking about the majority of cases (especially titles) where there are only one or very few matches. I'd be really interested to find cases where the heading is less reliable, I rather doubt you were making that up.

 

In any case, I may try to think a little more flexibly about how to address the issue.

 

Best,

 

-Tod

 

On Jul 13, 2017, at 6:03 AM, Mark Triggs <[hidden email]> wrote:

 

This may well disappoint, but I'm scratching my head as to why I ever bothered pulling out the list of IDs like that. Firing the search and getting the count should be cheap enough, but collecting all the IDs does seem like a waste of time.

I found this comment in the AlphaBrowse.php that I think I wrote:

 // Linking using bib ids is generally more reliable than doing
 // searches for headings, but headings give shorter queries and
 // don't look as strange.

But current me doesn't find that very compelling. Seems like it would be better to forget about collecting the IDs and just build the link based on the search heading. Like you say, should be much faster!

Cheers,

Mark

Tod Olson <[hidden email]> writes:

On implementing an optimization, I'm also thinking in BibDB::matchingIDs, to check if the extras parameter is empty and just immediately check BibDB:: recordCount, if the result is greater than some threshold don't even bother with pulling IDs. Not a perfect solution, but maybe okay.

And will be interested to see what Mark has to say!

Thanks,

-Tod

-- 
Mark Triggs
<[hidden email]>

 


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Optimization for alphabetic browse

Tod Olson
It is also the case that BibDB::matchingIDs allows an "extras" parameter which names extra Solr fields to return for use in the browse listing. We use this in our Title Browse to return author, format, and year of publication. When there is a title with multiple formats or whatever, you see that. See a title browse for "hobbit" in our VuFind installation. Useful in Title browse and call number browse, but we do not use the extras in the authority browses. So must continue to support that.

So yes, probably best to devise a general solution. *think, think, think*

-Tod

On Jul 13, 2017, at 7:22 AM, Demian Katz <[hidden email]> wrote:

It’s possible that the search-by-exact-ID thing is a hold-over from earlier days of VuFind when search query construction was less consistent and reliable, or was simply a sign of lack of confidence in Solr. It may also help address edge cases where a record has changed in Solr but the browse index has not yet been updated – i.e. if a heading is removed or changed on a record but the record is not deleted.
 
In general, though, I’m inclined to agree with Mark that the whole thing may no longer be necessary… and indeed, if there are problems, having two different ways of resolving browse headings just creates more cases that need to be tested, potentially complicating troubleshooting and introducing weird behaviors. So it may make the most sense to switch to always doing things one consistent way.
 
I’d still be inclined to make this a configurable behavior (if it can be done at little cost) simply because there might be some use cases where it would be helpful for somebody to use the browse handler to fetch IDs – but that’s probably very conservative of me, and might not really be worth the effort.
 
- Demian
 
 
From: Tod Olson [[hidden email]] 
Sent: Thursday, July 13, 2017 7:43 AM
To: Mark Triggs
Cc: Tod Olson; Demian Katz; vufind-tech
Subject: Re: Optimization for alphabetic browse
 
Thanks, Mark!
 
You were probably thinking about the majority of cases (especially titles) where there are only one or very few matches. I'd be really interested to find cases where the heading is less reliable, I rather doubt you were making that up.
 
In any case, I may try to think a little more flexibly about how to address the issue.
 
Best,
 
-Tod
 
On Jul 13, 2017, at 6:03 AM, Mark Triggs <[hidden email]> wrote:
 
This may well disappoint, but I'm scratching my head as to why I ever bothered pulling out the list of IDs like that. Firing the search and getting the count should be cheap enough, but collecting all the IDs does seem like a waste of time.
I found this comment in the AlphaBrowse.php that I think I wrote:
 // Linking using bib ids is generally more reliable than doing
 // searches for headings, but headings give shorter queries and
 // don't look as strange.
But current me doesn't find that very compelling. Seems like it would be better to forget about collecting the IDs and just build the link based on the search heading. Like you say, should be much faster!
Cheers,
Mark
Tod Olson <[hidden email]> writes:
On implementing an optimization, I'm also thinking in BibDB::matchingIDs, to check if the extras parameter is empty and just immediately check BibDB:: recordCount, if the result is greater than some threshold don't even bother with pulling IDs. Not a perfect solution, but maybe okay.
And will be interested to see what Mark has to say!
Thanks,
-Tod
-- 
Mark Triggs
<[hidden email]>


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Optimization for alphabetic browse

Tod Olson
While I'm making changes to the browse handler, shall I also increment the Java version to 1.8?

-Tod

On Jul 13, 2017, at 8:54 AM, Tod Olson <[hidden email]> wrote:

It is also the case that BibDB::matchingIDs allows an "extras" parameter which names extra Solr fields to return for use in the browse listing. We use this in our Title Browse to return author, format, and year of publication. When there is a title with multiple formats or whatever, you see that. See a title browse for "hobbit" in our VuFind installation. Useful in Title browse and call number browse, but we do not use the extras in the authority browses. So must continue to support that.

So yes, probably best to devise a general solution. *think, think, think*

-Tod

On Jul 13, 2017, at 7:22 AM, Demian Katz <[hidden email]> wrote:

It’s possible that the search-by-exact-ID thing is a hold-over from earlier days of VuFind when search query construction was less consistent and reliable, or was simply a sign of lack of confidence in Solr. It may also help address edge cases where a record has changed in Solr but the browse index has not yet been updated – i.e. if a heading is removed or changed on a record but the record is not deleted.
 
In general, though, I’m inclined to agree with Mark that the whole thing may no longer be necessary… and indeed, if there are problems, having two different ways of resolving browse headings just creates more cases that need to be tested, potentially complicating troubleshooting and introducing weird behaviors. So it may make the most sense to switch to always doing things one consistent way.
 
I’d still be inclined to make this a configurable behavior (if it can be done at little cost) simply because there might be some use cases where it would be helpful for somebody to use the browse handler to fetch IDs – but that’s probably very conservative of me, and might not really be worth the effort.
 
- Demian
 
 
From: Tod Olson [[hidden email]] 
Sent: Thursday, July 13, 2017 7:43 AM
To: Mark Triggs
Cc: Tod Olson; Demian Katz; vufind-tech
Subject: Re: Optimization for alphabetic browse
 
Thanks, Mark!
 
You were probably thinking about the majority of cases (especially titles) where there are only one or very few matches. I'd be really interested to find cases where the heading is less reliable, I rather doubt you were making that up.
 
In any case, I may try to think a little more flexibly about how to address the issue.
 
Best,
 
-Tod
 
On Jul 13, 2017, at 6:03 AM, Mark Triggs <[hidden email]> wrote:
 
This may well disappoint, but I'm scratching my head as to why I ever bothered pulling out the list of IDs like that. Firing the search and getting the count should be cheap enough, but collecting all the IDs does seem like a waste of time.
I found this comment in the AlphaBrowse.php that I think I wrote:
 // Linking using bib ids is generally more reliable than doing
 // searches for headings, but headings give shorter queries and
 // don't look as strange.
But current me doesn't find that very compelling. Seems like it would be better to forget about collecting the IDs and just build the link based on the search heading. Like you say, should be much faster!
Cheers,
Mark
Tod Olson <[hidden email]> writes:
On implementing an optimization, I'm also thinking in BibDB::matchingIDs, to check if the extras parameter is empty and just immediately check BibDB:: recordCount, if the result is greater than some threshold don't even bother with pulling IDs. Not a perfect solution, but maybe okay.
And will be interested to see what Mark has to say!
Thanks,
-Tod
-- 
Mark Triggs
<[hidden email]>



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Optimization for alphabetic browse

Demian Katz

No objections from me -- VuFind's base Java requirement raised to 1.8 with release 4.0 due to the upgrade to Solr 6.


- Demian



From: Tod Olson <[hidden email]>
Sent: Friday, July 14, 2017 9:54 AM
To: Demian Katz
Cc: Tod Olson; Mark Triggs; vufind-tech
Subject: Re: Optimization for alphabetic browse
 
While I'm making changes to the browse handler, shall I also increment the Java version to 1.8?

-Tod

On Jul 13, 2017, at 8:54 AM, Tod Olson <[hidden email]> wrote:

It is also the case that BibDB::matchingIDs allows an "extras" parameter which names extra Solr fields to return for use in the browse listing. We use this in our Title Browse to return author, format, and year of publication. When there is a title with multiple formats or whatever, you see that. See a title browse for "hobbit" in our VuFind installation. Useful in Title browse and call number browse, but we do not use the extras in the authority browses. So must continue to support that.

So yes, probably best to devise a general solution. *think, think, think*

-Tod

On Jul 13, 2017, at 7:22 AM, Demian Katz <[hidden email]> wrote:

It’s possible that the search-by-exact-ID thing is a hold-over from earlier days of VuFind when search query construction was less consistent and reliable, or was simply a sign of lack of confidence in Solr. It may also help address edge cases where a record has changed in Solr but the browse index has not yet been updated – i.e. if a heading is removed or changed on a record but the record is not deleted.
 
In general, though, I’m inclined to agree with Mark that the whole thing may no longer be necessary… and indeed, if there are problems, having two different ways of resolving browse headings just creates more cases that need to be tested, potentially complicating troubleshooting and introducing weird behaviors. So it may make the most sense to switch to always doing things one consistent way.
 
I’d still be inclined to make this a configurable behavior (if it can be done at little cost) simply because there might be some use cases where it would be helpful for somebody to use the browse handler to fetch IDs – but that’s probably very conservative of me, and might not really be worth the effort.
 
- Demian
 
 
From: Tod Olson [[hidden email]] 
Sent: Thursday, July 13, 2017 7:43 AM
To: Mark Triggs
Cc: Tod Olson; Demian Katz; vufind-tech
Subject: Re: Optimization for alphabetic browse
 
Thanks, Mark!
 
You were probably thinking about the majority of cases (especially titles) where there are only one or very few matches. I'd be really interested to find cases where the heading is less reliable, I rather doubt you were making that up.
 
In any case, I may try to think a little more flexibly about how to address the issue.
 
Best,
 
-Tod
 
On Jul 13, 2017, at 6:03 AM, Mark Triggs <[hidden email]> wrote:
 
This may well disappoint, but I'm scratching my head as to why I ever bothered pulling out the list of IDs like that. Firing the search and getting the count should be cheap enough, but collecting all the IDs does seem like a waste of time.
I found this comment in the AlphaBrowse.php that I think I wrote:
 // Linking using bib ids is generally more reliable than doing
 // searches for headings, but headings give shorter queries and
 // don't look as strange.
But current me doesn't find that very compelling. Seems like it would be better to forget about collecting the IDs and just build the link based on the search heading. Like you say, should be much faster!
Cheers,
Mark
Tod Olson <[hidden email]> writes:
On implementing an optimization, I'm also thinking in BibDB::matchingIDs, to check if the extras parameter is empty and just immediately check BibDB:: recordCount, if the result is greater than some threshold don't even bother with pulling IDs. Not a perfect solution, but maybe okay.
And will be interested to see what Mark has to say!
Thanks,
-Tod
-- 
Mark Triggs
<[hidden email]>



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Optimization for alphabetic browse

Tod Olson
In reply to this post by Tod Olson
If we want to preserve the search-by-exact-ID thing, there must be a way to set a threshold. The easiest way I can think of is to put it in solrconfig.xml. We'll come back to this in a bit. I want to start with the UI and work back to the browse-handler.

There are two things we want from the BibDB::matchingIDs() method:

(1) bib Ids for exact-match searching from the alphabetical browse results list, and
(2) extra field display info for the alphabetical browse results list.

These are two very different functions, but must be intertwined for fairly practical reasons, and that creates some complication.

For (1), the exact-match searching, the UI has some threshold on the number of exact bib Ids it will use in a browse list entry. The browse-handler response for each list item includes a list of bib ids and a count of matching records. Somehow the UI has a threshold for how many ids to use in constructing the search link from the list. Let's say it's 5: if there are 5 ids or fewer, the search by bib id, otherwise search by heading. Where is that threshold set? And what is the behavior if the ids list is empty, but with a non-zero count? Is it safe to return an empty list, or do we always need a minimal number for the UI to work correctly?

For (2), the extra fields for list entries, if there is a request for any extra fields in a call to BibDB::matchingIDs(), then setting a threshold may risk not returning some values, unless we ignore the threshold when extra fields are requested. But I think that sort of exception is confusing.

So I think we want a mechanism to set the maximum number of records/bib Ids to retrieve, regardless of hit count on the entry or whether we are asking for extra fields.

How to implement such a threshold?

The existing requestHanders stanza for "/browse" already contain per-index configuration: DBpath, field (for exact-match search), and normalizer. We could add another setting, maxBibIds or maxBibListSize, something like that. Semantics could be:

-1: get all of the bibIds (maybe good for title browse)
0: get no bibIds and don't bother checking matching Ids or extras at all (probably good for subject browse)
N: List the first N bibIds that are returned (for sites or indexes wanting to hedge their bets) 

BrowseRequestHandler then stores the threshold for each browse index in a new field in the BrowseSource object for each browse index.
The the threshold is somehow passed in to Browse::populateItem, but I've not gotten that far yet.

So that's the outline in my head. Sound reasonable so far, or too complicated?


Side question: do we need to add versioning to the browse-handler? BrowseRequestHandler::getVersion returns a fixed string, maybe we should do something about that, but I don't know what.

-Tod

On Jul 13, 2017, at 8:54 AM, Tod Olson <[hidden email]> wrote:

It is also the case that BibDB::matchingIDs allows an "extras" parameter which names extra Solr fields to return for use in the browse listing. We use this in our Title Browse to return author, format, and year of publication. When there is a title with multiple formats or whatever, you see that. See a title browse for "hobbit" in our VuFind installation. Useful in Title browse and call number browse, but we do not use the extras in the authority browses. So must continue to support that.

So yes, probably best to devise a general solution. *think, think, think*

-Tod

On Jul 13, 2017, at 7:22 AM, Demian Katz <[hidden email]> wrote:

It’s possible that the search-by-exact-ID thing is a hold-over from earlier days of VuFind when search query construction was less consistent and reliable, or was simply a sign of lack of confidence in Solr. It may also help address edge cases where a record has changed in Solr but the browse index has not yet been updated – i.e. if a heading is removed or changed on a record but the record is not deleted.
 
In general, though, I’m inclined to agree with Mark that the whole thing may no longer be necessary… and indeed, if there are problems, having two different ways of resolving browse headings just creates more cases that need to be tested, potentially complicating troubleshooting and introducing weird behaviors. So it may make the most sense to switch to always doing things one consistent way.
 
I’d still be inclined to make this a configurable behavior (if it can be done at little cost) simply because there might be some use cases where it would be helpful for somebody to use the browse handler to fetch IDs – but that’s probably very conservative of me, and might not really be worth the effort.
 
- Demian
 
 
From: Tod Olson [[hidden email]] 
Sent: Thursday, July 13, 2017 7:43 AM
To: Mark Triggs
Cc: Tod Olson; Demian Katz; vufind-tech
Subject: Re: Optimization for alphabetic browse
 
Thanks, Mark!
 
You were probably thinking about the majority of cases (especially titles) where there are only one or very few matches. I'd be really interested to find cases where the heading is less reliable, I rather doubt you were making that up.
 
In any case, I may try to think a little more flexibly about how to address the issue.
 
Best,
 
-Tod
 
On Jul 13, 2017, at 6:03 AM, Mark Triggs <[hidden email]> wrote:
 
This may well disappoint, but I'm scratching my head as to why I ever bothered pulling out the list of IDs like that. Firing the search and getting the count should be cheap enough, but collecting all the IDs does seem like a waste of time.
I found this comment in the AlphaBrowse.php that I think I wrote:
 // Linking using bib ids is generally more reliable than doing
 // searches for headings, but headings give shorter queries and
 // don't look as strange.
But current me doesn't find that very compelling. Seems like it would be better to forget about collecting the IDs and just build the link based on the search heading. Like you say, should be much faster!
Cheers,
Mark
Tod Olson <[hidden email]> writes:
On implementing an optimization, I'm also thinking in BibDB::matchingIDs, to check if the extras parameter is empty and just immediately check BibDB:: recordCount, if the result is greater than some threshold don't even bother with pulling IDs. Not a perfect solution, but maybe okay.
And will be interested to see what Mark has to say!
Thanks,
-Tod
-- 
Mark Triggs
<[hidden email]>



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Optimization for alphabetic browse

Demian Katz

Tod,

 

To answer your first question, the logic for linking to AlphaBrowse results is found here:

 

https://github.com/vufind-org/vufind/blob/master/module/VuFind/src/VuFind/View/Helper/Root/AlphaBrowse.php#L67

 

From: Tod Olson [mailto:[hidden email]]
Sent: Tuesday, July 25, 2017 4:30 PM
To: Demian Katz
Cc: Tod Olson; Mark Triggs; vufind-tech
Subject: Re: Optimization for alphabetic browse

 

If we want to preserve the search-by-exact-ID thing, there must be a way to set a threshold. The easiest way I can think of is to put it in solrconfig.xml. We'll come back to this in a bit. I want to start with the UI and work back to the browse-handler.

 

There are two things we want from the BibDB::matchingIDs() method:

 

(1) bib Ids for exact-match searching from the alphabetical browse results list, and

(2) extra field display info for the alphabetical browse results list.

 

These are two very different functions, but must be intertwined for fairly practical reasons, and that creates some complication.

 

For (1), the exact-match searching, the UI has some threshold on the number of exact bib Ids it will use in a browse list entry. The browse-handler response for each list item includes a list of bib ids and a count of matching records. Somehow the UI has a threshold for how many ids to use in constructing the search link from the list. Let's say it's 5: if there are 5 ids or fewer, the search by bib id, otherwise search by heading. Where is that threshold set? And what is the behavior if the ids list is empty, but with a non-zero count? Is it safe to return an empty list, or do we always need a minimal number for the UI to work correctly?

 

For (2), the extra fields for list entries, if there is a request for any extra fields in a call to BibDB::matchingIDs(), then setting a threshold may risk not returning some values, unless we ignore the threshold when extra fields are requested. But I think that sort of exception is confusing.

 

So I think we want a mechanism to set the maximum number of records/bib Ids to retrieve, regardless of hit count on the entry or whether we are asking for extra fields.

 

How to implement such a threshold?

 

The existing requestHanders stanza for "/browse" already contain per-index configuration: DBpath, field (for exact-match search), and normalizer. We could add another setting, maxBibIds or maxBibListSize, something like that. Semantics could be:

 

-1: get all of the bibIds (maybe good for title browse)

0: get no bibIds and don't bother checking matching Ids or extras at all (probably good for subject browse)

N: List the first N bibIds that are returned (for sites or indexes wanting to hedge their bets) 

 

BrowseRequestHandler then stores the threshold for each browse index in a new field in the BrowseSource object for each browse index.

The the threshold is somehow passed in to Browse::populateItem, but I've not gotten that far yet.

 

So that's the outline in my head. Sound reasonable so far, or too complicated?

 

 

Side question: do we need to add versioning to the browse-handler? BrowseRequestHandler::getVersion returns a fixed string, maybe we should do something about that, but I don't know what.

 

-Tod

 

On Jul 13, 2017, at 8:54 AM, Tod Olson <[hidden email]> wrote:

 

It is also the case that BibDB::matchingIDs allows an "extras" parameter which names extra Solr fields to return for use in the browse listing. We use this in our Title Browse to return author, format, and year of publication. When there is a title with multiple formats or whatever, you see that. See a title browse for "hobbit" in our VuFind installation. Useful in Title browse and call number browse, but we do not use the extras in the authority browses. So must continue to support that.

 

So yes, probably best to devise a general solution. *think, think, think*

 

-Tod

 

On Jul 13, 2017, at 7:22 AM, Demian Katz <[hidden email]> wrote:

 

It’s possible that the search-by-exact-ID thing is a hold-over from earlier days of VuFind when search query construction was less consistent and reliable, or was simply a sign of lack of confidence in Solr. It may also help address edge cases where a record has changed in Solr but the browse index has not yet been updated – i.e. if a heading is removed or changed on a record but the record is not deleted.

 

In general, though, I’m inclined to agree with Mark that the whole thing may no longer be necessary… and indeed, if there are problems, having two different ways of resolving browse headings just creates more cases that need to be tested, potentially complicating troubleshooting and introducing weird behaviors. So it may make the most sense to switch to always doing things one consistent way.

 

I’d still be inclined to make this a configurable behavior (if it can be done at little cost) simply because there might be some use cases where it would be helpful for somebody to use the browse handler to fetch IDs – but that’s probably very conservative of me, and might not really be worth the effort.

 

- Demian

 

 

From: Tod Olson [[hidden email]] 
Sent: Thursday, July 13, 2017 7:43 AM
To: Mark Triggs
Cc: Tod Olson; Demian Katz; vufind-tech
Subject: Re: Optimization for alphabetic browse

 

Thanks, Mark!

 

You were probably thinking about the majority of cases (especially titles) where there are only one or very few matches. I'd be really interested to find cases where the heading is less reliable, I rather doubt you were making that up.

 

In any case, I may try to think a little more flexibly about how to address the issue.

 

Best,

 

-Tod

 

On Jul 13, 2017, at 6:03 AM, Mark Triggs <[hidden email]> wrote:

 

This may well disappoint, but I'm scratching my head as to why I ever bothered pulling out the list of IDs like that. Firing the search and getting the count should be cheap enough, but collecting all the IDs does seem like a waste of time.

I found this comment in the AlphaBrowse.php that I think I wrote:

 // Linking using bib ids is generally more reliable than doing
 // searches for headings, but headings give shorter queries and
 // don't look as strange.

But current me doesn't find that very compelling. Seems like it would be better to forget about collecting the IDs and just build the link based on the search heading. Like you say, should be much faster!

Cheers,

Mark

Tod Olson <[hidden email]> writes:

On implementing an optimization, I'm also thinking in BibDB::matchingIDs, to check if the extras parameter is empty and just immediately check BibDB:: recordCount, if the result is greater than some threshold don't even bother with pulling IDs. Not a perfect solution, but maybe okay.

And will be interested to see what Mark has to say!

Thanks,

-Tod

-- 
Mark Triggs
<[hidden email]>

 

 


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Optimization for alphabetic browse

Demian Katz
In reply to this post by Tod Olson

Sometimes, if I double-tap enter too quickly, it seems to send my emails prematurely. Stupid Outlook. Anyway, it appears that the logic for linking to headings assumes that IDs will be present when count <= 5, so simply omitting them from the response will not work without code changes to VuFind.

 

As far as how configuration goes, if we’re interested in backward compatibility, perhaps we should actually add two settings:

 

retrieveBibId = true/false (default = true)

maxBibListSize = as you described (default = -1)

 

Essentially, retrieveBibId controls whether or not to fetch bib IDs while retrieving the extras list, and maxBibListSize controls how many records to examine when we need IDs and/or extras. We might eventually consider deprecating retrieveBibId and making “id” just another element that can be included in the extras list, but it seems simpler to start here.

 

As far as versioning goes, I certainly wouldn’t object to doing a better job of that. Right now the closest thing we have to versioning is the fact that I add a tag to the GitHub repo every time there is a VuFind release showing which version of the browse handler is included in each VuFind release… but having an internal, separate browse handler versioning mechanism wouldn’t be a bad thing, and would enable us to use semantic versioning for backward breaks, etc.

 

Hopefully that covers everything; if I missed anything I should have addressed, feel free to ask again. J

 

- Demian

 

From: Demian Katz
Sent: Wednesday, July 26, 2017 8:31 AM
To: 'Tod Olson'
Cc: Mark Triggs; vufind-tech
Subject: RE: Optimization for alphabetic browse

 

Tod,

 

To answer your first question, the logic for linking to AlphaBrowse results is found here:

 

https://github.com/vufind-org/vufind/blob/master/module/VuFind/src/VuFind/View/Helper/Root/AlphaBrowse.php#L67

 

From: Tod Olson [[hidden email]]
Sent: Tuesday, July 25, 2017 4:30 PM
To: Demian Katz
Cc: Tod Olson; Mark Triggs; vufind-tech
Subject: Re: Optimization for alphabetic browse

 

If we want to preserve the search-by-exact-ID thing, there must be a way to set a threshold. The easiest way I can think of is to put it in solrconfig.xml. We'll come back to this in a bit. I want to start with the UI and work back to the browse-handler.

 

There are two things we want from the BibDB::matchingIDs() method:

 

(1) bib Ids for exact-match searching from the alphabetical browse results list, and

(2) extra field display info for the alphabetical browse results list.

 

These are two very different functions, but must be intertwined for fairly practical reasons, and that creates some complication.

 

For (1), the exact-match searching, the UI has some threshold on the number of exact bib Ids it will use in a browse list entry. The browse-handler response for each list item includes a list of bib ids and a count of matching records. Somehow the UI has a threshold for how many ids to use in constructing the search link from the list. Let's say it's 5: if there are 5 ids or fewer, the search by bib id, otherwise search by heading. Where is that threshold set? And what is the behavior if the ids list is empty, but with a non-zero count? Is it safe to return an empty list, or do we always need a minimal number for the UI to work correctly?

 

For (2), the extra fields for list entries, if there is a request for any extra fields in a call to BibDB::matchingIDs(), then setting a threshold may risk not returning some values, unless we ignore the threshold when extra fields are requested. But I think that sort of exception is confusing.

 

So I think we want a mechanism to set the maximum number of records/bib Ids to retrieve, regardless of hit count on the entry or whether we are asking for extra fields.

 

How to implement such a threshold?

 

The existing requestHanders stanza for "/browse" already contain per-index configuration: DBpath, field (for exact-match search), and normalizer. We could add another setting, maxBibIds or maxBibListSize, something like that. Semantics could be:

 

-1: get all of the bibIds (maybe good for title browse)

0: get no bibIds and don't bother checking matching Ids or extras at all (probably good for subject browse)

N: List the first N bibIds that are returned (for sites or indexes wanting to hedge their bets) 

 

BrowseRequestHandler then stores the threshold for each browse index in a new field in the BrowseSource object for each browse index.

The the threshold is somehow passed in to Browse::populateItem, but I've not gotten that far yet.

 

So that's the outline in my head. Sound reasonable so far, or too complicated?

 

 

Side question: do we need to add versioning to the browse-handler? BrowseRequestHandler::getVersion returns a fixed string, maybe we should do something about that, but I don't know what.

 

-Tod

 

On Jul 13, 2017, at 8:54 AM, Tod Olson <[hidden email]> wrote:

 

It is also the case that BibDB::matchingIDs allows an "extras" parameter which names extra Solr fields to return for use in the browse listing. We use this in our Title Browse to return author, format, and year of publication. When there is a title with multiple formats or whatever, you see that. See a title browse for "hobbit" in our VuFind installation. Useful in Title browse and call number browse, but we do not use the extras in the authority browses. So must continue to support that.

 

So yes, probably best to devise a general solution. *think, think, think*

 

-Tod

 

On Jul 13, 2017, at 7:22 AM, Demian Katz <[hidden email]> wrote:

 

It’s possible that the search-by-exact-ID thing is a hold-over from earlier days of VuFind when search query construction was less consistent and reliable, or was simply a sign of lack of confidence in Solr. It may also help address edge cases where a record has changed in Solr but the browse index has not yet been updated – i.e. if a heading is removed or changed on a record but the record is not deleted.

 

In general, though, I’m inclined to agree with Mark that the whole thing may no longer be necessary… and indeed, if there are problems, having two different ways of resolving browse headings just creates more cases that need to be tested, potentially complicating troubleshooting and introducing weird behaviors. So it may make the most sense to switch to always doing things one consistent way.

 

I’d still be inclined to make this a configurable behavior (if it can be done at little cost) simply because there might be some use cases where it would be helpful for somebody to use the browse handler to fetch IDs – but that’s probably very conservative of me, and might not really be worth the effort.

 

- Demian

 

 

From: Tod Olson [[hidden email]] 
Sent: Thursday, July 13, 2017 7:43 AM
To: Mark Triggs
Cc: Tod Olson; Demian Katz; vufind-tech
Subject: Re: Optimization for alphabetic browse

 

Thanks, Mark!

 

You were probably thinking about the majority of cases (especially titles) where there are only one or very few matches. I'd be really interested to find cases where the heading is less reliable, I rather doubt you were making that up.

 

In any case, I may try to think a little more flexibly about how to address the issue.

 

Best,

 

-Tod

 

On Jul 13, 2017, at 6:03 AM, Mark Triggs <[hidden email]> wrote:

 

This may well disappoint, but I'm scratching my head as to why I ever bothered pulling out the list of IDs like that. Firing the search and getting the count should be cheap enough, but collecting all the IDs does seem like a waste of time.

I found this comment in the AlphaBrowse.php that I think I wrote:

 // Linking using bib ids is generally more reliable than doing
 // searches for headings, but headings give shorter queries and
 // don't look as strange.

But current me doesn't find that very compelling. Seems like it would be better to forget about collecting the IDs and just build the link based on the search heading. Like you say, should be much faster!

Cheers,

Mark

Tod Olson <[hidden email]> writes:

On implementing an optimization, I'm also thinking in BibDB::matchingIDs, to check if the extras parameter is empty and just immediately check BibDB:: recordCount, if the result is greater than some threshold don't even bother with pulling IDs. Not a perfect solution, but maybe okay.

And will be interested to see what Mark has to say!

Thanks,

-Tod

-- 
Mark Triggs
<[hidden email]>

 

 


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Optimization for alphabetic browse

Tod Olson
I like that two setting idea, and have worked that into my code changes. I hope to have a pull request later today or tomorrow to show how I'm approaching it. And have others chime in on where to be more defensive in the programming.

The idea that id just become another field is also good. That would reduce backward compatibility, but clean up some code. To implement that, I would change some property names, and create a new method to replace matchingIDs, just to make the break with back-comparability more clear. (of course, if we want to toggle between old and new behavior that likely would be another property.)

-Tod

On Jul 26, 2017, at 7:38 AM, Demian Katz <[hidden email]> wrote:

Sometimes, if I double-tap enter too quickly, it seems to send my emails prematurely. Stupid Outlook. Anyway, it appears that the logic for linking to headings assumes that IDs will be present when count <= 5, so simply omitting them from the response will not work without code changes to VuFind.
 
As far as how configuration goes, if we’re interested in backward compatibility, perhaps we should actually add two settings:
 
retrieveBibId = true/false (default = true)
maxBibListSize = as you described (default = -1)
 
Essentially, retrieveBibId controls whether or not to fetch bib IDs while retrieving the extras list, and maxBibListSize controls how many records to examine when we need IDs and/or extras. We might eventually consider deprecating retrieveBibId and making “id” just another element that can be included in the extras list, but it seems simpler to start here.
 
As far as versioning goes, I certainly wouldn’t object to doing a better job of that. Right now the closest thing we have to versioning is the fact that I add a tag to the GitHub repo every time there is a VuFind release showing which version of the browse handler is included in each VuFind release… but having an internal, separate browse handler versioning mechanism wouldn’t be a bad thing, and would enable us to use semantic versioning for backward breaks, etc.
 
Hopefully that covers everything; if I missed anything I should have addressed, feel free to ask again. J
 
- Demian
 
From: Demian Katz 
Sent: Wednesday, July 26, 2017 8:31 AM
To: 'Tod Olson'
Cc: Mark Triggs; vufind-tech
Subject: RE: Optimization for alphabetic browse
 
Tod,
 
To answer your first question, the logic for linking to AlphaBrowse results is found here:
 
 
From: Tod Olson [[hidden email]] 
Sent: Tuesday, July 25, 2017 4:30 PM
To: Demian Katz
Cc: Tod Olson; Mark Triggs; vufind-tech
Subject: Re: Optimization for alphabetic browse
 
If we want to preserve the search-by-exact-ID thing, there must be a way to set a threshold. The easiest way I can think of is to put it in solrconfig.xml. We'll come back to this in a bit. I want to start with the UI and work back to the browse-handler. 
 
There are two things we want from the BibDB::matchingIDs() method:
 
(1) bib Ids for exact-match searching from the alphabetical browse results list, and
(2) extra field display info for the alphabetical browse results list.
 
These are two very different functions, but must be intertwined for fairly practical reasons, and that creates some complication.
 
For (1), the exact-match searching, the UI has some threshold on the number of exact bib Ids it will use in a browse list entry. The browse-handler response for each list item includes a list of bib ids and a count of matching records. Somehow the UI has a threshold for how many ids to use in constructing the search link from the list. Let's say it's 5: if there are 5 ids or fewer, the search by bib id, otherwise search by heading. Where is that threshold set? And what is the behavior if the ids list is empty, but with a non-zero count? Is it safe to return an empty list, or do we always need a minimal number for the UI to work correctly?
 
For (2), the extra fields for list entries, if there is a request for any extra fields in a call to BibDB::matchingIDs(), then setting a threshold may risk not returning some values, unless we ignore the threshold when extra fields are requested. But I think that sort of exception is confusing.
 
So I think we want a mechanism to set the maximum number of records/bib Ids to retrieve, regardless of hit count on the entry or whether we are asking for extra fields.
 
How to implement such a threshold?
 
The existing requestHanders stanza for "/browse" already contain per-index configuration: DBpath, field (for exact-match search), and normalizer. We could add another setting, maxBibIds or maxBibListSize, something like that. Semantics could be:
 
-1: get all of the bibIds (maybe good for title browse)
0: get no bibIds and don't bother checking matching Ids or extras at all (probably good for subject browse)
N: List the first N bibIds that are returned (for sites or indexes wanting to hedge their bets) 
 
BrowseRequestHandler then stores the threshold for each browse index in a new field in the BrowseSource object for each browse index.
The the threshold is somehow passed in to Browse::populateItem, but I've not gotten that far yet.
 
So that's the outline in my head. Sound reasonable so far, or too complicated?
 
 
Side question: do we need to add versioning to the browse-handler? BrowseRequestHandler::getVersion returns a fixed string, maybe we should do something about that, but I don't know what.
 
-Tod
 
On Jul 13, 2017, at 8:54 AM, Tod Olson <[hidden email]> wrote:
 
It is also the case that BibDB::matchingIDs allows an "extras" parameter which names extra Solr fields to return for use in the browse listing. We use this in our Title Browse to return author, format, and year of publication. When there is a title with multiple formats or whatever, you see that. See a title browse for "hobbit" in our VuFind installation. Useful in Title browse and call number browse, but we do not use the extras in the authority browses. So must continue to support that.
 
So yes, probably best to devise a general solution. *think, think, think*
 
-Tod
 
On Jul 13, 2017, at 7:22 AM, Demian Katz <[hidden email]> wrote:
 
It’s possible that the search-by-exact-ID thing is a hold-over from earlier days of VuFind when search query construction was less consistent and reliable, or was simply a sign of lack of confidence in Solr. It may also help address edge cases where a record has changed in Solr but the browse index has not yet been updated – i.e. if a heading is removed or changed on a record but the record is not deleted.
 
In general, though, I’m inclined to agree with Mark that the whole thing may no longer be necessary… and indeed, if there are problems, having two different ways of resolving browse headings just creates more cases that need to be tested, potentially complicating troubleshooting and introducing weird behaviors. So it may make the most sense to switch to always doing things one consistent way.
 
I’d still be inclined to make this a configurable behavior (if it can be done at little cost) simply because there might be some use cases where it would be helpful for somebody to use the browse handler to fetch IDs – but that’s probably very conservative of me, and might not really be worth the effort.
 
- Demian
 
 
From: Tod Olson [[hidden email]] 
Sent: Thursday, July 13, 2017 7:43 AM
To: Mark Triggs
Cc: Tod Olson; Demian Katz; vufind-tech
Subject: Re: Optimization for alphabetic browse
 
Thanks, Mark!
 
You were probably thinking about the majority of cases (especially titles) where there are only one or very few matches. I'd be really interested to find cases where the heading is less reliable, I rather doubt you were making that up.
 
In any case, I may try to think a little more flexibly about how to address the issue.
 
Best,
 
-Tod
 
On Jul 13, 2017, at 6:03 AM, Mark Triggs <[hidden email]> wrote:
 
This may well disappoint, but I'm scratching my head as to why I ever bothered pulling out the list of IDs like that. Firing the search and getting the count should be cheap enough, but collecting all the IDs does seem like a waste of time.
I found this comment in the AlphaBrowse.php that I think I wrote:
 // Linking using bib ids is generally more reliable than doing
 // searches for headings, but headings give shorter queries and
 // don't look as strange.
But current me doesn't find that very compelling. Seems like it would be better to forget about collecting the IDs and just build the link based on the search heading. Like you say, should be much faster!
Cheers,
Mark
Tod Olson <[hidden email]> writes:
On implementing an optimization, I'm also thinking in BibDB::matchingIDs, to check if the extras parameter is empty and just immediately check BibDB:: recordCount, if the result is greater than some threshold don't even bother with pulling IDs. Not a perfect solution, but maybe okay.
And will be interested to see what Mark has to say!
Thanks,
-Tod
-- 
Mark Triggs
<[hidden email]>


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Optimization for alphabetic browse

Demian Katz

Thanks, Tod, that makes sense to me.

 

As far as maintaining backward compatibility goes, if we want to get this into VuFind 4.1, it would be nice if we could implement in a way that doesn’t break the interface. If we’re okay with waiting until 5.0 next year, I’d feel more comfortable with breaking changes.

 

One thought is that it might make sense to code this in such a way that the deep code that’s processing everything only expects one extras list… then higher-level code could process the retrieveBibId setting and use that to add the “id” field to the extras list as needed. This would then give us a place to put a deprecation warning, and later to simply remove obsolete functionality, without having to go deeper into the logic. Of course, I’m making this suggestion without looking at the code, so it may be completely inappropriate to the reality of the situation. ;-)

 

- Demian

 

From: Tod Olson [mailto:[hidden email]]
Sent: Wednesday, July 26, 2017 4:48 PM
To: Demian Katz
Cc: Tod Olson; Mark Triggs; vufind-tech
Subject: Re: Optimization for alphabetic browse

 

I like that two setting idea, and have worked that into my code changes. I hope to have a pull request later today or tomorrow to show how I'm approaching it. And have others chime in on where to be more defensive in the programming.

 

The idea that id just become another field is also good. That would reduce backward compatibility, but clean up some code. To implement that, I would change some property names, and create a new method to replace matchingIDs, just to make the break with back-comparability more clear. (of course, if we want to toggle between old and new behavior that likely would be another property.)

 

-Tod

 

On Jul 26, 2017, at 7:38 AM, Demian Katz <[hidden email]> wrote:

 

Sometimes, if I double-tap enter too quickly, it seems to send my emails prematurely. Stupid Outlook. Anyway, it appears that the logic for linking to headings assumes that IDs will be present when count <= 5, so simply omitting them from the response will not work without code changes to VuFind.

 

As far as how configuration goes, if we’re interested in backward compatibility, perhaps we should actually add two settings:

 

retrieveBibId = true/false (default = true)

maxBibListSize = as you described (default = -1)

 

Essentially, retrieveBibId controls whether or not to fetch bib IDs while retrieving the extras list, and maxBibListSize controls how many records to examine when we need IDs and/or extras. We might eventually consider deprecating retrieveBibId and making “id” just another element that can be included in the extras list, but it seems simpler to start here.

 

As far as versioning goes, I certainly wouldn’t object to doing a better job of that. Right now the closest thing we have to versioning is the fact that I add a tag to the GitHub repo every time there is a VuFind release showing which version of the browse handler is included in each VuFind release… but having an internal, separate browse handler versioning mechanism wouldn’t be a bad thing, and would enable us to use semantic versioning for backward breaks, etc.

 

Hopefully that covers everything; if I missed anything I should have addressed, feel free to ask again. J

 

- Demian

 

From: Demian Katz 
Sent: Wednesday, July 26, 2017 8:31 AM
To: 'Tod Olson'
Cc: Mark Triggs; vufind-tech
Subject: RE: Optimization for alphabetic browse

 

Tod,

 

To answer your first question, the logic for linking to AlphaBrowse results is found here:

 

 

From: Tod Olson [[hidden email]] 
Sent: Tuesday, July 25, 2017 4:30 PM
To: Demian Katz
Cc: Tod Olson; Mark Triggs; vufind-tech
Subject: Re: Optimization for alphabetic browse

 

If we want to preserve the search-by-exact-ID thing, there must be a way to set a threshold. The easiest way I can think of is to put it in solrconfig.xml. We'll come back to this in a bit. I want to start with the UI and work back to the browse-handler. 

 

There are two things we want from the BibDB::matchingIDs() method:

 

(1) bib Ids for exact-match searching from the alphabetical browse results list, and

(2) extra field display info for the alphabetical browse results list.

 

These are two very different functions, but must be intertwined for fairly practical reasons, and that creates some complication.

 

For (1), the exact-match searching, the UI has some threshold on the number of exact bib Ids it will use in a browse list entry. The browse-handler response for each list item includes a list of bib ids and a count of matching records. Somehow the UI has a threshold for how many ids to use in constructing the search link from the list. Let's say it's 5: if there are 5 ids or fewer, the search by bib id, otherwise search by heading. Where is that threshold set? And what is the behavior if the ids list is empty, but with a non-zero count? Is it safe to return an empty list, or do we always need a minimal number for the UI to work correctly?

 

For (2), the extra fields for list entries, if there is a request for any extra fields in a call to BibDB::matchingIDs(), then setting a threshold may risk not returning some values, unless we ignore the threshold when extra fields are requested. But I think that sort of exception is confusing.

 

So I think we want a mechanism to set the maximum number of records/bib Ids to retrieve, regardless of hit count on the entry or whether we are asking for extra fields.

 

How to implement such a threshold?

 

The existing requestHanders stanza for "/browse" already contain per-index configuration: DBpath, field (for exact-match search), and normalizer. We could add another setting, maxBibIds or maxBibListSize, something like that. Semantics could be:

 

-1: get all of the bibIds (maybe good for title browse)

0: get no bibIds and don't bother checking matching Ids or extras at all (probably good for subject browse)

N: List the first N bibIds that are returned (for sites or indexes wanting to hedge their bets) 

 

BrowseRequestHandler then stores the threshold for each browse index in a new field in the BrowseSource object for each browse index.

The the threshold is somehow passed in to Browse::populateItem, but I've not gotten that far yet.

 

So that's the outline in my head. Sound reasonable so far, or too complicated?

 

 

Side question: do we need to add versioning to the browse-handler? BrowseRequestHandler::getVersion returns a fixed string, maybe we should do something about that, but I don't know what.

 

-Tod

 

On Jul 13, 2017, at 8:54 AM, Tod Olson <[hidden email]> wrote:

 

It is also the case that BibDB::matchingIDs allows an "extras" parameter which names extra Solr fields to return for use in the browse listing. We use this in our Title Browse to return author, format, and year of publication. When there is a title with multiple formats or whatever, you see that. See a title browse for "hobbit" in our VuFind installation. Useful in Title browse and call number browse, but we do not use the extras in the authority browses. So must continue to support that.

 

So yes, probably best to devise a general solution. *think, think, think*

 

-Tod

 

On Jul 13, 2017, at 7:22 AM, Demian Katz <[hidden email]> wrote:

 

It’s possible that the search-by-exact-ID thing is a hold-over from earlier days of VuFind when search query construction was less consistent and reliable, or was simply a sign of lack of confidence in Solr. It may also help address edge cases where a record has changed in Solr but the browse index has not yet been updated – i.e. if a heading is removed or changed on a record but the record is not deleted.

 

In general, though, I’m inclined to agree with Mark that the whole thing may no longer be necessary… and indeed, if there are problems, having two different ways of resolving browse headings just creates more cases that need to be tested, potentially complicating troubleshooting and introducing weird behaviors. So it may make the most sense to switch to always doing things one consistent way.

 

I’d still be inclined to make this a configurable behavior (if it can be done at little cost) simply because there might be some use cases where it would be helpful for somebody to use the browse handler to fetch IDs – but that’s probably very conservative of me, and might not really be worth the effort.

 

- Demian

 

 

From: Tod Olson [[hidden email]] 
Sent: Thursday, July 13, 2017 7:43 AM
To: Mark Triggs
Cc: Tod Olson; Demian Katz; vufind-tech
Subject: Re: Optimization for alphabetic browse

 

Thanks, Mark!

 

You were probably thinking about the majority of cases (especially titles) where there are only one or very few matches. I'd be really interested to find cases where the heading is less reliable, I rather doubt you were making that up.

 

In any case, I may try to think a little more flexibly about how to address the issue.

 

Best,

 

-Tod

 

On Jul 13, 2017, at 6:03 AM, Mark Triggs <[hidden email]> wrote:

 

This may well disappoint, but I'm scratching my head as to why I ever bothered pulling out the list of IDs like that. Firing the search and getting the count should be cheap enough, but collecting all the IDs does seem like a waste of time.

I found this comment in the AlphaBrowse.php that I think I wrote:

 // Linking using bib ids is generally more reliable than doing
 // searches for headings, but headings give shorter queries and
 // don't look as strange.

But current me doesn't find that very compelling. Seems like it would be better to forget about collecting the IDs and just build the link based on the search heading. Like you say, should be much faster!

Cheers,

Mark

Tod Olson <[hidden email]> writes:

On implementing an optimization, I'm also thinking in BibDB::matchingIDs, to check if the extras parameter is empty and just immediately check BibDB:: recordCount, if the result is greater than some threshold don't even bother with pulling IDs. Not a perfect solution, but maybe okay.

And will be interested to see what Mark has to say!

Thanks,

-Tod

-- 
Mark Triggs
<[hidden email]>

 


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Optimization for alphabetic browse

Tod Olson
Yes, that's a good thought. I'll put out a pull request soon, in advance of trying to implement that thought.

I also will not have implemented the maxBibListSize logic. The current code uses a Lucene SimpleCollector object to collect the entire result set and it's unclear how best to put a limit on the records to collect, where the collect() method is called in a loop that we do not control inside of and IndexSearcher.search(Query q, Collector results). So this will be interesting. There are a couple other search() methods defined, may do one that returns N docs and then have to parse the docs.

But best to get a pull request out before I go much further.

-Tod

On Jul 26, 2017, at 3:53 PM, Demian Katz <[hidden email]> wrote:

Thanks, Tod, that makes sense to me.
 
As far as maintaining backward compatibility goes, if we want to get this into VuFind 4.1, it would be nice if we could implement in a way that doesn’t break the interface. If we’re okay with waiting until 5.0 next year, I’d feel more comfortable with breaking changes.
 
One thought is that it might make sense to code this in such a way that the deep code that’s processing everything only expects one extras list… then higher-level code could process the retrieveBibId setting and use that to add the “id” field to the extras list as needed. This would then give us a place to put a deprecation warning, and later to simply remove obsolete functionality, without having to go deeper into the logic. Of course, I’m making this suggestion without looking at the code, so it may be completely inappropriate to the reality of the situation. ;-)
 
- Demian
 
From: Tod Olson [[hidden email]] 
Sent: Wednesday, July 26, 2017 4:48 PM
To: Demian Katz
Cc: Tod Olson; Mark Triggs; vufind-tech
Subject: Re: Optimization for alphabetic browse
 
I like that two setting idea, and have worked that into my code changes. I hope to have a pull request later today or tomorrow to show how I'm approaching it. And have others chime in on where to be more defensive in the programming.
 
The idea that id just become another field is also good. That would reduce backward compatibility, but clean up some code. To implement that, I would change some property names, and create a new method to replace matchingIDs, just to make the break with back-comparability more clear. (of course, if we want to toggle between old and new behavior that likely would be another property.)
 
-Tod
 
On Jul 26, 2017, at 7:38 AM, Demian Katz <[hidden email]> wrote:
 
Sometimes, if I double-tap enter too quickly, it seems to send my emails prematurely. Stupid Outlook. Anyway, it appears that the logic for linking to headings assumes that IDs will be present when count <= 5, so simply omitting them from the response will not work without code changes to VuFind.
 
As far as how configuration goes, if we’re interested in backward compatibility, perhaps we should actually add two settings:
 
retrieveBibId = true/false (default = true)
maxBibListSize = as you described (default = -1)
 
Essentially, retrieveBibId controls whether or not to fetch bib IDs while retrieving the extras list, and maxBibListSize controls how many records to examine when we need IDs and/or extras. We might eventually consider deprecating retrieveBibId and making “id” just another element that can be included in the extras list, but it seems simpler to start here.
 
As far as versioning goes, I certainly wouldn’t object to doing a better job of that. Right now the closest thing we have to versioning is the fact that I add a tag to the GitHub repo every time there is a VuFind release showing which version of the browse handler is included in each VuFind release… but having an internal, separate browse handler versioning mechanism wouldn’t be a bad thing, and would enable us to use semantic versioning for backward breaks, etc.
 
Hopefully that covers everything; if I missed anything I should have addressed, feel free to ask again. J
 
- Demian
 
From: Demian Katz 
Sent: Wednesday, July 26, 2017 8:31 AM
To: 'Tod Olson'
Cc: Mark Triggs; vufind-tech
Subject: RE: Optimization for alphabetic browse
 
Tod,
 
To answer your first question, the logic for linking to AlphaBrowse results is found here:
 
 
From: Tod Olson [[hidden email]] 
Sent: Tuesday, July 25, 2017 4:30 PM
To: Demian Katz
Cc: Tod Olson; Mark Triggs; vufind-tech
Subject: Re: Optimization for alphabetic browse
 
If we want to preserve the search-by-exact-ID thing, there must be a way to set a threshold. The easiest way I can think of is to put it in solrconfig.xml. We'll come back to this in a bit. I want to start with the UI and work back to the browse-handler. 
 
There are two things we want from the BibDB::matchingIDs() method:
 
(1) bib Ids for exact-match searching from the alphabetical browse results list, and
(2) extra field display info for the alphabetical browse results list.
 
These are two very different functions, but must be intertwined for fairly practical reasons, and that creates some complication.
 
For (1), the exact-match searching, the UI has some threshold on the number of exact bib Ids it will use in a browse list entry. The browse-handler response for each list item includes a list of bib ids and a count of matching records. Somehow the UI has a threshold for how many ids to use in constructing the search link from the list. Let's say it's 5: if there are 5 ids or fewer, the search by bib id, otherwise search by heading. Where is that threshold set? And what is the behavior if the ids list is empty, but with a non-zero count? Is it safe to return an empty list, or do we always need a minimal number for the UI to work correctly?
 
For (2), the extra fields for list entries, if there is a request for any extra fields in a call to BibDB::matchingIDs(), then setting a threshold may risk not returning some values, unless we ignore the threshold when extra fields are requested. But I think that sort of exception is confusing.
 
So I think we want a mechanism to set the maximum number of records/bib Ids to retrieve, regardless of hit count on the entry or whether we are asking for extra fields.
 
How to implement such a threshold?
 
The existing requestHanders stanza for "/browse" already contain per-index configuration: DBpath, field (for exact-match search), and normalizer. We could add another setting, maxBibIds or maxBibListSize, something like that. Semantics could be:
 
-1: get all of the bibIds (maybe good for title browse)
0: get no bibIds and don't bother checking matching Ids or extras at all (probably good for subject browse)
N: List the first N bibIds that are returned (for sites or indexes wanting to hedge their bets) 
 
BrowseRequestHandler then stores the threshold for each browse index in a new field in the BrowseSource object for each browse index.
The the threshold is somehow passed in to Browse::populateItem, but I've not gotten that far yet.
 
So that's the outline in my head. Sound reasonable so far, or too complicated?
 
 
Side question: do we need to add versioning to the browse-handler? BrowseRequestHandler::getVersion returns a fixed string, maybe we should do something about that, but I don't know what.
 
-Tod
 
On Jul 13, 2017, at 8:54 AM, Tod Olson <[hidden email]> wrote:
 
It is also the case that BibDB::matchingIDs allows an "extras" parameter which names extra Solr fields to return for use in the browse listing. We use this in our Title Browse to return author, format, and year of publication. When there is a title with multiple formats or whatever, you see that. See a title browse for "hobbit" in our VuFind installation. Useful in Title browse and call number browse, but we do not use the extras in the authority browses. So must continue to support that.
 
So yes, probably best to devise a general solution. *think, think, think*
 
-Tod
 
On Jul 13, 2017, at 7:22 AM, Demian Katz <[hidden email]> wrote:
 
It’s possible that the search-by-exact-ID thing is a hold-over from earlier days of VuFind when search query construction was less consistent and reliable, or was simply a sign of lack of confidence in Solr. It may also help address edge cases where a record has changed in Solr but the browse index has not yet been updated – i.e. if a heading is removed or changed on a record but the record is not deleted.
 
In general, though, I’m inclined to agree with Mark that the whole thing may no longer be necessary… and indeed, if there are problems, having two different ways of resolving browse headings just creates more cases that need to be tested, potentially complicating troubleshooting and introducing weird behaviors. So it may make the most sense to switch to always doing things one consistent way.
 
I’d still be inclined to make this a configurable behavior (if it can be done at little cost) simply because there might be some use cases where it would be helpful for somebody to use the browse handler to fetch IDs – but that’s probably very conservative of me, and might not really be worth the effort.
 
- Demian
 
 
From: Tod Olson [[hidden email]] 
Sent: Thursday, July 13, 2017 7:43 AM
To: Mark Triggs
Cc: Tod Olson; Demian Katz; vufind-tech
Subject: Re: Optimization for alphabetic browse
 
Thanks, Mark!
 
You were probably thinking about the majority of cases (especially titles) where there are only one or very few matches. I'd be really interested to find cases where the heading is less reliable, I rather doubt you were making that up.
 
In any case, I may try to think a little more flexibly about how to address the issue.
 
Best,
 
-Tod
 
On Jul 13, 2017, at 6:03 AM, Mark Triggs <[hidden email]> wrote:
 
This may well disappoint, but I'm scratching my head as to why I ever bothered pulling out the list of IDs like that. Firing the search and getting the count should be cheap enough, but collecting all the IDs does seem like a waste of time.
I found this comment in the AlphaBrowse.php that I think I wrote:
 // Linking using bib ids is generally more reliable than doing
 // searches for headings, but headings give shorter queries and
 // don't look as strange.
But current me doesn't find that very compelling. Seems like it would be better to forget about collecting the IDs and just build the link based on the search heading. Like you say, should be much faster!
Cheers,
Mark
Tod Olson <[hidden email]> writes:
On implementing an optimization, I'm also thinking in BibDB::matchingIDs, to check if the extras parameter is empty and just immediately check BibDB:: recordCount, if the result is greater than some threshold don't even bother with pulling IDs. Not a perfect solution, but maybe okay.
And will be interested to see what Mark has to say!
Thanks,
-Tod
-- 
Mark Triggs
<[hidden email]>


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Optimization for alphabetic browse

Demian Katz

Thanks, Tod!

 

I would encourage you to create a TODO list in the PR description so we can keep track of progress and know what we are reviewing… i.e.

 

TODO

 

- [ ] Implement maxBibListSize logic

- [ ] Test X, Y and Z

 

Also, let me know if you think it would be helpful to merge some of your cleanup work to master now (like the syntax adjustments and comment corrections) in order to reduce the size of the PR and make it easier to review the core logic. Some of this looks pretty obviously safe to go ahead with.

 

thanks,

Demian

 

From: Tod Olson [mailto:[hidden email]]
Sent: Wednesday, July 26, 2017 5:22 PM
To: Demian Katz
Cc: Tod Olson; Mark Triggs; vufind-tech
Subject: Re: Optimization for alphabetic browse

 

Yes, that's a good thought. I'll put out a pull request soon, in advance of trying to implement that thought.

 

I also will not have implemented the maxBibListSize logic. The current code uses a Lucene SimpleCollector object to collect the entire result set and it's unclear how best to put a limit on the records to collect, where the collect() method is called in a loop that we do not control inside of and IndexSearcher.search(Query q, Collector results). So this will be interesting. There are a couple other search() methods defined, may do one that returns N docs and then have to parse the docs.

 

But best to get a pull request out before I go much further.

 

-Tod

 

On Jul 26, 2017, at 3:53 PM, Demian Katz <[hidden email]> wrote:

 

Thanks, Tod, that makes sense to me.

 

As far as maintaining backward compatibility goes, if we want to get this into VuFind 4.1, it would be nice if we could implement in a way that doesn’t break the interface. If we’re okay with waiting until 5.0 next year, I’d feel more comfortable with breaking changes.

 

One thought is that it might make sense to code this in such a way that the deep code that’s processing everything only expects one extras list… then higher-level code could process the retrieveBibId setting and use that to add the “id” field to the extras list as needed. This would then give us a place to put a deprecation warning, and later to simply remove obsolete functionality, without having to go deeper into the logic. Of course, I’m making this suggestion without looking at the code, so it may be completely inappropriate to the reality of the situation. ;-)

 

- Demian

 

From: Tod Olson [[hidden email]] 
Sent: Wednesday, July 26, 2017 4:48 PM
To: Demian Katz
Cc: Tod Olson; Mark Triggs; vufind-tech
Subject: Re: Optimization for alphabetic browse

 

I like that two setting idea, and have worked that into my code changes. I hope to have a pull request later today or tomorrow to show how I'm approaching it. And have others chime in on where to be more defensive in the programming.

 

The idea that id just become another field is also good. That would reduce backward compatibility, but clean up some code. To implement that, I would change some property names, and create a new method to replace matchingIDs, just to make the break with back-comparability more clear. (of course, if we want to toggle between old and new behavior that likely would be another property.)

 

-Tod

 

On Jul 26, 2017, at 7:38 AM, Demian Katz <[hidden email]> wrote:

 

Sometimes, if I double-tap enter too quickly, it seems to send my emails prematurely. Stupid Outlook. Anyway, it appears that the logic for linking to headings assumes that IDs will be present when count <= 5, so simply omitting them from the response will not work without code changes to VuFind.

 

As far as how configuration goes, if we’re interested in backward compatibility, perhaps we should actually add two settings:

 

retrieveBibId = true/false (default = true)

maxBibListSize = as you described (default = -1)

 

Essentially, retrieveBibId controls whether or not to fetch bib IDs while retrieving the extras list, and maxBibListSize controls how many records to examine when we need IDs and/or extras. We might eventually consider deprecating retrieveBibId and making “id” just another element that can be included in the extras list, but it seems simpler to start here.

 

As far as versioning goes, I certainly wouldn’t object to doing a better job of that. Right now the closest thing we have to versioning is the fact that I add a tag to the GitHub repo every time there is a VuFind release showing which version of the browse handler is included in each VuFind release… but having an internal, separate browse handler versioning mechanism wouldn’t be a bad thing, and would enable us to use semantic versioning for backward breaks, etc.

 

Hopefully that covers everything; if I missed anything I should have addressed, feel free to ask again. J

 

- Demian

 

From: Demian Katz 
Sent: Wednesday, July 26, 2017 8:31 AM
To: 'Tod Olson'
Cc: Mark Triggs; vufind-tech
Subject: RE: Optimization for alphabetic browse

 

Tod,

 

To answer your first question, the logic for linking to AlphaBrowse results is found here:

 

 

From: Tod Olson [[hidden email]] 
Sent: Tuesday, July 25, 2017 4:30 PM
To: Demian Katz
Cc: Tod Olson; Mark Triggs; vufind-tech
Subject: Re: Optimization for alphabetic browse

 

If we want to preserve the search-by-exact-ID thing, there must be a way to set a threshold. The easiest way I can think of is to put it in solrconfig.xml. We'll come back to this in a bit. I want to start with the UI and work back to the browse-handler. 

 

There are two things we want from the BibDB::matchingIDs() method:

 

(1) bib Ids for exact-match searching from the alphabetical browse results list, and

(2) extra field display info for the alphabetical browse results list.

 

These are two very different functions, but must be intertwined for fairly practical reasons, and that creates some complication.

 

For (1), the exact-match searching, the UI has some threshold on the number of exact bib Ids it will use in a browse list entry. The browse-handler response for each list item includes a list of bib ids and a count of matching records. Somehow the UI has a threshold for how many ids to use in constructing the search link from the list. Let's say it's 5: if there are 5 ids or fewer, the search by bib id, otherwise search by heading. Where is that threshold set? And what is the behavior if the ids list is empty, but with a non-zero count? Is it safe to return an empty list, or do we always need a minimal number for the UI to work correctly?

 

For (2), the extra fields for list entries, if there is a request for any extra fields in a call to BibDB::matchingIDs(), then setting a threshold may risk not returning some values, unless we ignore the threshold when extra fields are requested. But I think that sort of exception is confusing.

 

So I think we want a mechanism to set the maximum number of records/bib Ids to retrieve, regardless of hit count on the entry or whether we are asking for extra fields.

 

How to implement such a threshold?

 

The existing requestHanders stanza for "/browse" already contain per-index configuration: DBpath, field (for exact-match search), and normalizer. We could add another setting, maxBibIds or maxBibListSize, something like that. Semantics could be:

 

-1: get all of the bibIds (maybe good for title browse)

0: get no bibIds and don't bother checking matching Ids or extras at all (probably good for subject browse)

N: List the first N bibIds that are returned (for sites or indexes wanting to hedge their bets) 

 

BrowseRequestHandler then stores the threshold for each browse index in a new field in the BrowseSource object for each browse index.

The the threshold is somehow passed in to Browse::populateItem, but I've not gotten that far yet.

 

So that's the outline in my head. Sound reasonable so far, or too complicated?

 

 

Side question: do we need to add versioning to the browse-handler? BrowseRequestHandler::getVersion returns a fixed string, maybe we should do something about that, but I don't know what.

 

-Tod

 

On Jul 13, 2017, at 8:54 AM, Tod Olson <[hidden email]> wrote:

 

It is also the case that BibDB::matchingIDs allows an "extras" parameter which names extra Solr fields to return for use in the browse listing. We use this in our Title Browse to return author, format, and year of publication. When there is a title with multiple formats or whatever, you see that. See a title browse for "hobbit" in our VuFind installation. Useful in Title browse and call number browse, but we do not use the extras in the authority browses. So must continue to support that.

 

So yes, probably best to devise a general solution. *think, think, think*

 

-Tod

 

On Jul 13, 2017, at 7:22 AM, Demian Katz <[hidden email]> wrote:

 

It’s possible that the search-by-exact-ID thing is a hold-over from earlier days of VuFind when search query construction was less consistent and reliable, or was simply a sign of lack of confidence in Solr. It may also help address edge cases where a record has changed in Solr but the browse index has not yet been updated – i.e. if a heading is removed or changed on a record but the record is not deleted.

 

In general, though, I’m inclined to agree with Mark that the whole thing may no longer be necessary… and indeed, if there are problems, having two different ways of resolving browse headings just creates more cases that need to be tested, potentially complicating troubleshooting and introducing weird behaviors. So it may make the most sense to switch to always doing things one consistent way.

 

I’d still be inclined to make this a configurable behavior (if it can be done at little cost) simply because there might be some use cases where it would be helpful for somebody to use the browse handler to fetch IDs – but that’s probably very conservative of me, and might not really be worth the effort.

 

- Demian

 

 

From: Tod Olson [[hidden email]] 
Sent: Thursday, July 13, 2017 7:43 AM
To: Mark Triggs
Cc: Tod Olson; Demian Katz; vufind-tech
Subject: Re: Optimization for alphabetic browse

 

Thanks, Mark!

 

You were probably thinking about the majority of cases (especially titles) where there are only one or very few matches. I'd be really interested to find cases where the heading is less reliable, I rather doubt you were making that up.

 

In any case, I may try to think a little more flexibly about how to address the issue.

 

Best,

 

-Tod

 

On Jul 13, 2017, at 6:03 AM, Mark Triggs <[hidden email]> wrote:

 

This may well disappoint, but I'm scratching my head as to why I ever bothered pulling out the list of IDs like that. Firing the search and getting the count should be cheap enough, but collecting all the IDs does seem like a waste of time.

I found this comment in the AlphaBrowse.php that I think I wrote:

 // Linking using bib ids is generally more reliable than doing
 // searches for headings, but headings give shorter queries and
 // don't look as strange.

But current me doesn't find that very compelling. Seems like it would be better to forget about collecting the IDs and just build the link based on the search heading. Like you say, should be much faster!

Cheers,

Mark

Tod Olson <[hidden email]> writes:

On implementing an optimization, I'm also thinking in BibDB::matchingIDs, to check if the extras parameter is empty and just immediately check BibDB:: recordCount, if the result is greater than some threshold don't even bother with pulling IDs. Not a perfect solution, but maybe okay.

And will be interested to see what Mark has to say!

Thanks,

-Tod

-- 
Mark Triggs
<[hidden email]>

 


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Optimization for alphabetic browse

Tod Olson
Thanks for the reminder, will do.

Also, I do think that those style changes are quite safe.

-Tod

On Jul 27, 2017, at 7:09 AM, Demian Katz <[hidden email]> wrote:

Thanks, Tod!
 
I would encourage you to create a TODO list in the PR description so we can keep track of progress and know what we are reviewing… i.e.
 
TODO
 
- [ ] Implement maxBibListSize logic
- [ ] Test X, Y and Z
 
Also, let me know if you think it would be helpful to merge some of your cleanup work to master now (like the syntax adjustments and comment corrections) in order to reduce the size of the PR and make it easier to review the core logic. Some of this looks pretty obviously safe to go ahead with.
 
thanks,
Demian
 
From: Tod Olson [[hidden email]] 
Sent: Wednesday, July 26, 2017 5:22 PM
To: Demian Katz
Cc: Tod Olson; Mark Triggs; vufind-tech
Subject: Re: Optimization for alphabetic browse
 
Yes, that's a good thought. I'll put out a pull request soon, in advance of trying to implement that thought.
 
I also will not have implemented the maxBibListSize logic. The current code uses a Lucene SimpleCollector object to collect the entire result set and it's unclear how best to put a limit on the records to collect, where the collect() method is called in a loop that we do not control inside of and IndexSearcher.search(Query q, Collector results). So this will be interesting. There are a couple other search() methods defined, may do one that returns N docs and then have to parse the docs.
 
But best to get a pull request out before I go much further.
 
-Tod
 
On Jul 26, 2017, at 3:53 PM, Demian Katz <[hidden email]> wrote:
 
Thanks, Tod, that makes sense to me.
 
As far as maintaining backward compatibility goes, if we want to get this into VuFind 4.1, it would be nice if we could implement in a way that doesn’t break the interface. If we’re okay with waiting until 5.0 next year, I’d feel more comfortable with breaking changes.
 
One thought is that it might make sense to code this in such a way that the deep code that’s processing everything only expects one extras list… then higher-level code could process the retrieveBibId setting and use that to add the “id” field to the extras list as needed. This would then give us a place to put a deprecation warning, and later to simply remove obsolete functionality, without having to go deeper into the logic. Of course, I’m making this suggestion without looking at the code, so it may be completely inappropriate to the reality of the situation. ;-)
 
- Demian
 
From: Tod Olson [[hidden email]] 
Sent: Wednesday, July 26, 2017 4:48 PM
To: Demian Katz
Cc: Tod Olson; Mark Triggs; vufind-tech
Subject: Re: Optimization for alphabetic browse
 
I like that two setting idea, and have worked that into my code changes. I hope to have a pull request later today or tomorrow to show how I'm approaching it. And have others chime in on where to be more defensive in the programming.
 
The idea that id just become another field is also good. That would reduce backward compatibility, but clean up some code. To implement that, I would change some property names, and create a new method to replace matchingIDs, just to make the break with back-comparability more clear. (of course, if we want to toggle between old and new behavior that likely would be another property.)
 
-Tod
 
On Jul 26, 2017, at 7:38 AM, Demian Katz <[hidden email]> wrote:
 
Sometimes, if I double-tap enter too quickly, it seems to send my emails prematurely. Stupid Outlook. Anyway, it appears that the logic for linking to headings assumes that IDs will be present when count <= 5, so simply omitting them from the response will not work without code changes to VuFind.
 
As far as how configuration goes, if we’re interested in backward compatibility, perhaps we should actually add two settings:
 
retrieveBibId = true/false (default = true)
maxBibListSize = as you described (default = -1)
 
Essentially, retrieveBibId controls whether or not to fetch bib IDs while retrieving the extras list, and maxBibListSize controls how many records to examine when we need IDs and/or extras. We might eventually consider deprecating retrieveBibId and making “id” just another element that can be included in the extras list, but it seems simpler to start here.
 
As far as versioning goes, I certainly wouldn’t object to doing a better job of that. Right now the closest thing we have to versioning is the fact that I add a tag to the GitHub repo every time there is a VuFind release showing which version of the browse handler is included in each VuFind release… but having an internal, separate browse handler versioning mechanism wouldn’t be a bad thing, and would enable us to use semantic versioning for backward breaks, etc.
 
Hopefully that covers everything; if I missed anything I should have addressed, feel free to ask again. J
 
- Demian
 
From: Demian Katz 
Sent: Wednesday, July 26, 2017 8:31 AM
To: 'Tod Olson'
Cc: Mark Triggs; vufind-tech
Subject: RE: Optimization for alphabetic browse
 
Tod,
 
To answer your first question, the logic for linking to AlphaBrowse results is found here:
 
 
From: Tod Olson [[hidden email]] 
Sent: Tuesday, July 25, 2017 4:30 PM
To: Demian Katz
Cc: Tod Olson; Mark Triggs; vufind-tech
Subject: Re: Optimization for alphabetic browse
 
If we want to preserve the search-by-exact-ID thing, there must be a way to set a threshold. The easiest way I can think of is to put it in solrconfig.xml. We'll come back to this in a bit. I want to start with the UI and work back to the browse-handler. 
 
There are two things we want from the BibDB::matchingIDs() method:
 
(1) bib Ids for exact-match searching from the alphabetical browse results list, and
(2) extra field display info for the alphabetical browse results list.
 
These are two very different functions, but must be intertwined for fairly practical reasons, and that creates some complication.
 
For (1), the exact-match searching, the UI has some threshold on the number of exact bib Ids it will use in a browse list entry. The browse-handler response for each list item includes a list of bib ids and a count of matching records. Somehow the UI has a threshold for how many ids to use in constructing the search link from the list. Let's say it's 5: if there are 5 ids or fewer, the search by bib id, otherwise search by heading. Where is that threshold set? And what is the behavior if the ids list is empty, but with a non-zero count? Is it safe to return an empty list, or do we always need a minimal number for the UI to work correctly?
 
For (2), the extra fields for list entries, if there is a request for any extra fields in a call to BibDB::matchingIDs(), then setting a threshold may risk not returning some values, unless we ignore the threshold when extra fields are requested. But I think that sort of exception is confusing.
 
So I think we want a mechanism to set the maximum number of records/bib Ids to retrieve, regardless of hit count on the entry or whether we are asking for extra fields.
 
How to implement such a threshold?
 
The existing requestHanders stanza for "/browse" already contain per-index configuration: DBpath, field (for exact-match search), and normalizer. We could add another setting, maxBibIds or maxBibListSize, something like that. Semantics could be:
 
-1: get all of the bibIds (maybe good for title browse)
0: get no bibIds and don't bother checking matching Ids or extras at all (probably good for subject browse)
N: List the first N bibIds that are returned (for sites or indexes wanting to hedge their bets) 
 
BrowseRequestHandler then stores the threshold for each browse index in a new field in the BrowseSource object for each browse index.
The the threshold is somehow passed in to Browse::populateItem, but I've not gotten that far yet.
 
So that's the outline in my head. Sound reasonable so far, or too complicated?
 
 
Side question: do we need to add versioning to the browse-handler? BrowseRequestHandler::getVersion returns a fixed string, maybe we should do something about that, but I don't know what.
 
-Tod
 
On Jul 13, 2017, at 8:54 AM, Tod Olson <[hidden email]> wrote:
 
It is also the case that BibDB::matchingIDs allows an "extras" parameter which names extra Solr fields to return for use in the browse listing. We use this in our Title Browse to return author, format, and year of publication. When there is a title with multiple formats or whatever, you see that. See a title browse for "hobbit" in our VuFind installation. Useful in Title browse and call number browse, but we do not use the extras in the authority browses. So must continue to support that.
 
So yes, probably best to devise a general solution. *think, think, think*
 
-Tod
 
On Jul 13, 2017, at 7:22 AM, Demian Katz <[hidden email]> wrote:
 
It’s possible that the search-by-exact-ID thing is a hold-over from earlier days of VuFind when search query construction was less consistent and reliable, or was simply a sign of lack of confidence in Solr. It may also help address edge cases where a record has changed in Solr but the browse index has not yet been updated – i.e. if a heading is removed or changed on a record but the record is not deleted.
 
In general, though, I’m inclined to agree with Mark that the whole thing may no longer be necessary… and indeed, if there are problems, having two different ways of resolving browse headings just creates more cases that need to be tested, potentially complicating troubleshooting and introducing weird behaviors. So it may make the most sense to switch to always doing things one consistent way.
 
I’d still be inclined to make this a configurable behavior (if it can be done at little cost) simply because there might be some use cases where it would be helpful for somebody to use the browse handler to fetch IDs – but that’s probably very conservative of me, and might not really be worth the effort.
 
- Demian
 
 
From: Tod Olson [[hidden email]] 
Sent: Thursday, July 13, 2017 7:43 AM
To: Mark Triggs
Cc: Tod Olson; Demian Katz; vufind-tech
Subject: Re: Optimization for alphabetic browse
 
Thanks, Mark!
 
You were probably thinking about the majority of cases (especially titles) where there are only one or very few matches. I'd be really interested to find cases where the heading is less reliable, I rather doubt you were making that up.
 
In any case, I may try to think a little more flexibly about how to address the issue.
 
Best,
 
-Tod
 
On Jul 13, 2017, at 6:03 AM, Mark Triggs <[hidden email]> wrote:
 
This may well disappoint, but I'm scratching my head as to why I ever bothered pulling out the list of IDs like that. Firing the search and getting the count should be cheap enough, but collecting all the IDs does seem like a waste of time.
I found this comment in the AlphaBrowse.php that I think I wrote:
 // Linking using bib ids is generally more reliable than doing
 // searches for headings, but headings give shorter queries and
 // don't look as strange.
But current me doesn't find that very compelling. Seems like it would be better to forget about collecting the IDs and just build the link based on the search heading. Like you say, should be much faster!
Cheers,
Mark
Tod Olson <[hidden email]> writes:
On implementing an optimization, I'm also thinking in BibDB::matchingIDs, to check if the extras parameter is empty and just immediately check BibDB:: recordCount, if the result is greater than some threshold don't even bother with pulling IDs. Not a perfect solution, but maybe okay.
And will be interested to see what Mark has to say!
Thanks,
-Tod
-- 
Mark Triggs
<[hidden email]>


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Optimization for alphabetic browse

Demian Katz

Thanks, Tod, I’ve merged some low-hanging fruit to master, then merged master back into your PR. I think this makes everything a bit more readable now! Let me know if I’ve messed anything up (though I’ve tried to be careful). J

 

- Demian

 

From: Tod Olson [mailto:[hidden email]]
Sent: Thursday, July 27, 2017 8:31 AM
To: Demian Katz
Cc: Tod Olson; Mark Triggs; vufind-tech
Subject: Re: Optimization for alphabetic browse

 

Thanks for the reminder, will do.

 

Also, I do think that those style changes are quite safe.

 

-Tod

 

On Jul 27, 2017, at 7:09 AM, Demian Katz <[hidden email]> wrote:

 

Thanks, Tod!

 

I would encourage you to create a TODO list in the PR description so we can keep track of progress and know what we are reviewing… i.e.

 

TODO

 

- [ ] Implement maxBibListSize logic

- [ ] Test X, Y and Z

 

Also, let me know if you think it would be helpful to merge some of your cleanup work to master now (like the syntax adjustments and comment corrections) in order to reduce the size of the PR and make it easier to review the core logic. Some of this looks pretty obviously safe to go ahead with.

 

thanks,

Demian

 

From: Tod Olson [[hidden email]] 
Sent: Wednesday, July 26, 2017 5:22 PM
To: Demian Katz
Cc: Tod Olson; Mark Triggs; vufind-tech
Subject: Re: Optimization for alphabetic browse

 

Yes, that's a good thought. I'll put out a pull request soon, in advance of trying to implement that thought.

 

I also will not have implemented the maxBibListSize logic. The current code uses a Lucene SimpleCollector object to collect the entire result set and it's unclear how best to put a limit on the records to collect, where the collect() method is called in a loop that we do not control inside of and IndexSearcher.search(Query q, Collector results). So this will be interesting. There are a couple other search() methods defined, may do one that returns N docs and then have to parse the docs.

 

But best to get a pull request out before I go much further.

 

-Tod

 

On Jul 26, 2017, at 3:53 PM, Demian Katz <[hidden email]> wrote:

 

Thanks, Tod, that makes sense to me.

 

As far as maintaining backward compatibility goes, if we want to get this into VuFind 4.1, it would be nice if we could implement in a way that doesn’t break the interface. If we’re okay with waiting until 5.0 next year, I’d feel more comfortable with breaking changes.

 

One thought is that it might make sense to code this in such a way that the deep code that’s processing everything only expects one extras list… then higher-level code could process the retrieveBibId setting and use that to add the “id” field to the extras list as needed. This would then give us a place to put a deprecation warning, and later to simply remove obsolete functionality, without having to go deeper into the logic. Of course, I’m making this suggestion without looking at the code, so it may be completely inappropriate to the reality of the situation. ;-)

 

- Demian

 

From: Tod Olson [[hidden email]] 
Sent: Wednesday, July 26, 2017 4:48 PM
To: Demian Katz
Cc: Tod Olson; Mark Triggs; vufind-tech
Subject: Re: Optimization for alphabetic browse

 

I like that two setting idea, and have worked that into my code changes. I hope to have a pull request later today or tomorrow to show how I'm approaching it. And have others chime in on where to be more defensive in the programming.

 

The idea that id just become another field is also good. That would reduce backward compatibility, but clean up some code. To implement that, I would change some property names, and create a new method to replace matchingIDs, just to make the break with back-comparability more clear. (of course, if we want to toggle between old and new behavior that likely would be another property.)

 

-Tod

 

On Jul 26, 2017, at 7:38 AM, Demian Katz <[hidden email]> wrote:

 

Sometimes, if I double-tap enter too quickly, it seems to send my emails prematurely. Stupid Outlook. Anyway, it appears that the logic for linking to headings assumes that IDs will be present when count <= 5, so simply omitting them from the response will not work without code changes to VuFind.

 

As far as how configuration goes, if we’re interested in backward compatibility, perhaps we should actually add two settings:

 

retrieveBibId = true/false (default = true)

maxBibListSize = as you described (default = -1)

 

Essentially, retrieveBibId controls whether or not to fetch bib IDs while retrieving the extras list, and maxBibListSize controls how many records to examine when we need IDs and/or extras. We might eventually consider deprecating retrieveBibId and making “id” just another element that can be included in the extras list, but it seems simpler to start here.

 

As far as versioning goes, I certainly wouldn’t object to doing a better job of that. Right now the closest thing we have to versioning is the fact that I add a tag to the GitHub repo every time there is a VuFind release showing which version of the browse handler is included in each VuFind release… but having an internal, separate browse handler versioning mechanism wouldn’t be a bad thing, and would enable us to use semantic versioning for backward breaks, etc.

 

Hopefully that covers everything; if I missed anything I should have addressed, feel free to ask again. J

 

- Demian

 

From: Demian Katz 
Sent: Wednesday, July 26, 2017 8:31 AM
To: 'Tod Olson'
Cc: Mark Triggs; vufind-tech
Subject: RE: Optimization for alphabetic browse

 

Tod,

 

To answer your first question, the logic for linking to AlphaBrowse results is found here:

 

 

From: Tod Olson [[hidden email]] 
Sent: Tuesday, July 25, 2017 4:30 PM
To: Demian Katz
Cc: Tod Olson; Mark Triggs; vufind-tech
Subject: Re: Optimization for alphabetic browse

 

If we want to preserve the search-by-exact-ID thing, there must be a way to set a threshold. The easiest way I can think of is to put it in solrconfig.xml. We'll come back to this in a bit. I want to start with the UI and work back to the browse-handler. 

 

There are two things we want from the BibDB::matchingIDs() method:

 

(1) bib Ids for exact-match searching from the alphabetical browse results list, and

(2) extra field display info for the alphabetical browse results list.

 

These are two very different functions, but must be intertwined for fairly practical reasons, and that creates some complication.

 

For (1), the exact-match searching, the UI has some threshold on the number of exact bib Ids it will use in a browse list entry. The browse-handler response for each list item includes a list of bib ids and a count of matching records. Somehow the UI has a threshold for how many ids to use in constructing the search link from the list. Let's say it's 5: if there are 5 ids or fewer, the search by bib id, otherwise search by heading. Where is that threshold set? And what is the behavior if the ids list is empty, but with a non-zero count? Is it safe to return an empty list, or do we always need a minimal number for the UI to work correctly?

 

For (2), the extra fields for list entries, if there is a request for any extra fields in a call to BibDB::matchingIDs(), then setting a threshold may risk not returning some values, unless we ignore the threshold when extra fields are requested. But I think that sort of exception is confusing.

 

So I think we want a mechanism to set the maximum number of records/bib Ids to retrieve, regardless of hit count on the entry or whether we are asking for extra fields.

 

How to implement such a threshold?

 

The existing requestHanders stanza for "/browse" already contain per-index configuration: DBpath, field (for exact-match search), and normalizer. We could add another setting, maxBibIds or maxBibListSize, something like that. Semantics could be:

 

-1: get all of the bibIds (maybe good for title browse)

0: get no bibIds and don't bother checking matching Ids or extras at all (probably good for subject browse)

N: List the first N bibIds that are returned (for sites or indexes wanting to hedge their bets) 

 

BrowseRequestHandler then stores the threshold for each browse index in a new field in the BrowseSource object for each browse index.

The the threshold is somehow passed in to Browse::populateItem, but I've not gotten that far yet.

 

So that's the outline in my head. Sound reasonable so far, or too complicated?

 

 

Side question: do we need to add versioning to the browse-handler? BrowseRequestHandler::getVersion returns a fixed string, maybe we should do something about that, but I don't know what.

 

-Tod

 

On Jul 13, 2017, at 8:54 AM, Tod Olson <[hidden email]> wrote:

 

It is also the case that BibDB::matchingIDs allows an "extras" parameter which names extra Solr fields to return for use in the browse listing. We use this in our Title Browse to return author, format, and year of publication. When there is a title with multiple formats or whatever, you see that. See a title browse for "hobbit" in our VuFind installation. Useful in Title browse and call number browse, but we do not use the extras in the authority browses. So must continue to support that.

 

So yes, probably best to devise a general solution. *think, think, think*

 

-Tod

 

On Jul 13, 2017, at 7:22 AM, Demian Katz <[hidden email]> wrote:

 

It’s possible that the search-by-exact-ID thing is a hold-over from earlier days of VuFind when search query construction was less consistent and reliable, or was simply a sign of lack of confidence in Solr. It may also help address edge cases where a record has changed in Solr but the browse index has not yet been updated – i.e. if a heading is removed or changed on a record but the record is not deleted.

 

In general, though, I’m inclined to agree with Mark that the whole thing may no longer be necessary… and indeed, if there are problems, having two different ways of resolving browse headings just creates more cases that need to be tested, potentially complicating troubleshooting and introducing weird behaviors. So it may make the most sense to switch to always doing things one consistent way.

 

I’d still be inclined to make this a configurable behavior (if it can be done at little cost) simply because there might be some use cases where it would be helpful for somebody to use the browse handler to fetch IDs – but that’s probably very conservative of me, and might not really be worth the effort.

 

- Demian

 

 

From: Tod Olson [[hidden email]] 
Sent: Thursday, July 13, 2017 7:43 AM
To: Mark Triggs
Cc: Tod Olson; Demian Katz; vufind-tech
Subject: Re: Optimization for alphabetic browse

 

Thanks, Mark!

 

You were probably thinking about the majority of cases (especially titles) where there are only one or very few matches. I'd be really interested to find cases where the heading is less reliable, I rather doubt you were making that up.

 

In any case, I may try to think a little more flexibly about how to address the issue.

 

Best,

 

-Tod

 

On Jul 13, 2017, at 6:03 AM, Mark Triggs <[hidden email]> wrote:

 

This may well disappoint, but I'm scratching my head as to why I ever bothered pulling out the list of IDs like that. Firing the search and getting the count should be cheap enough, but collecting all the IDs does seem like a waste of time.

I found this comment in the AlphaBrowse.php that I think I wrote:

 // Linking using bib ids is generally more reliable than doing
 // searches for headings, but headings give shorter queries and
 // don't look as strange.

But current me doesn't find that very compelling. Seems like it would be better to forget about collecting the IDs and just build the link based on the search heading. Like you say, should be much faster!

Cheers,

Mark

Tod Olson <[hidden email]> writes:

On implementing an optimization, I'm also thinking in BibDB::matchingIDs, to check if the extras parameter is empty and just immediately check BibDB:: recordCount, if the result is greater than some threshold don't even bother with pulling IDs. Not a perfect solution, but maybe okay.

And will be interested to see what Mark has to say!

Thanks,

-Tod

-- 
Mark Triggs
<[hidden email]>

 


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Loading...