Developers Call Agenda - 2/2/16

classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

Developers Call Agenda - 2/2/16

Demian Katz

The next developers call will be Tuesday, February 2, 2016 at 9am Eastern Standard Time (14:00 GMT).

AGENDA

1. Development Updates
2. Development Planning
    a. Improved Author Indexing
    b. Delimited Facets

    c. Eliminate "VuFind" Source in Database
    d. Solr Upgrade
    e. Javascript Reorganization
    f. Cover Issues
    g. API
    h. Modularization
    i. Improved Use of Permissions

    j. Session Performance Improvement
3. Other Topics?

More information on the free online call can be found at https://vufind.org/wiki/developers_call -- all are welcome!

- Demian

 

 


------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|

Re: Developers Call Agenda - 2/2/16

Günter Hipler
Hi Demian,

unfortunately nobody of the swissbib team isn't able to take part today because of an internal meeting

Therefor some inputs from our side in advance related to the agenda:
d) Solr Upgrade
- As it is alreday discussed on the list the 5.4.1 release fixes the more-like-this nullpointer exception bug. 
e.g. https://testvf.swissbib.ch/Record/336203004 (similar items)
- At the moment we don't use doc values for facets and sort fields because on some facet fields we still perform text processing (e.g. [1] with no String types) on the Solr side. But this is kind of our own problem because we use a strongly modified index schema.
- Next days I want to run some performance tests. At the moment the performance is significantly slower compared to the productive 4.10 version. I guess the main reason could be that in production we are using SSD disks (hopefully..)

g) API.
As I mentioned in the past, we want to start a project with Markus Maechler to implement a REST API for our linked-data project (in the context of his university education). Unfortunately the start is a little bit delayed. We want to provide all our ideas, goals and design principles once their is enough to tell about. At the moment most preparation is only written in German.

h) Modularization
Don't know if you follow the current efforts for ZF3. There are some resources which are quite valuable for information about what's going on:
http://framework.zend.com/blog/
https://mwop.net/blog/2016-01-28-expressive-stable.html

From my point of view the latest stable release of Expressive as the Zend Microframework for components is a major step to encapsulate components (as part of a modularization process). These components could be used even outside of the VuFind application and replace (if useful) some current use cases of VuFind for a smaller resource footprint.

We are going to follow the current development and it would be nice if more people are getting interest on this. Definitely it needs time and it should be a topic after the 3.0 release.


Günter
 


[1] https://github.com/swissbib/searchconf/blob/update/solr5x/solr/bib/solr.5.4/SOLR_HOME/sb-biblio/conf/schema.xml#L39


On 01/28/2016 02:59 PM, Demian Katz wrote:

The next developers call will be Tuesday, February 2, 2016 at 9am Eastern Standard Time (14:00 GMT).

AGENDA

1. Development Updates
2. Development Planning
    a. Improved Author Indexing
    b. Delimited Facets

    c. Eliminate "VuFind" Source in Database
    d. Solr Upgrade
    e. Javascript Reorganization
    f. Cover Issues
    g. API
    h. Modularization
    i. Improved Use of Permissions

    j. Session Performance Improvement
3. Other Topics?

More information on the free online call can be found at https://vufind.org/wiki/developers_call -- all are welcome!

- Demian

 

 



------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140


_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech

-- 
UNIVERSITÄT BASEL
Universitätsbibliothek
Günter Hipler
Projekt swissbib
Schönbeinstrasse 18-20
4056 Basel, Schweiz
Tel.: +41 61 267 31 12 
Fax: +41 61 267 31 03
E-Mail [hidden email]
URL www.swissbib.org

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|

Re: Developers Call Agenda - 2/2/16

Uwe Reh
Am 02.02.2016 um 11:48 schrieb Günter Hipler:
> d) Solr Upgrade ...
> - Next days I want to run some performance tests. At the moment the
> performance is significantly slower compared to the productive 4.10
> version. I guess the main reason could be that in production we are
> using SSD disks (hopefully..)

Hi Günter,

don't waste your time.
With Solr 5.0 became the fieldValueCache deactivated at Lucene-Level.
This means faceting isn't cached anymore. No SSD can compensate this
performance downgrade.
In our installation (10Facets) the Factor is 600! (100ms/query -->
60000ms/query)

You have just two option:
* Solr 5.4.x with DocValues
* Stay with Solr 4.10.x

Uwe




------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|

Re: Developers Call Agenda - 2/2/16

Demian Katz
In reply to this post by Günter Hipler

Thanks for all of the valuable input – I’ll definitely mention all of these points on today’s call (and will do some catch-up reading on ZF3/Expressive very soon so we can discuss that in more detail next time around).

 

The Solr performance issue is very unfortunate – it seems that docValues are the solution, but this doesn’t really feel like a solution so much as a workaround! I’m not seeing much activity on the JIRA ticket about the performance problems (https://issues.apache.org/jira/browse/SOLR-8096), so we may be stuck having to do something.

 

I would be interested to hear what level of performance improvements you achieve by switching to docValues on your test instance – I think that it would be useful to at least confirm the order of magnitude of improvement on a large-scale index. I haven’t had time to do that sort of performance testing on this end yet.

 

Thanks again!

 

- Demian

 

From: Günter Hipler [mailto:[hidden email]]
Sent: Tuesday, February 02, 2016 5:48 AM
To: [hidden email]
Subject: Re: [VuFind-Tech] Developers Call Agenda - 2/2/16

 

Hi Demian,

unfortunately nobody of the swissbib team isn't able to take part today because of an internal meeting

Therefor some inputs from our side in advance related to the agenda:
d) Solr Upgrade
- As it is alreday discussed on the list the 5.4.1 release fixes the more-like-this nullpointer exception bug. 
e.g. https://testvf.swissbib.ch/Record/336203004 (similar items)
- At the moment we don't use doc values for facets and sort fields because on some facet fields we still perform text processing (e.g. [1] with no String types) on the Solr side. But this is kind of our own problem because we use a strongly modified index schema.
- Next days I want to run some performance tests. At the moment the performance is significantly slower compared to the productive 4.10 version. I guess the main reason could be that in production we are using SSD disks (hopefully..)

g) API.
As I mentioned in the past, we want to start a project with Markus Maechler to implement a REST API for our linked-data project (in the context of his university education). Unfortunately the start is a little bit delayed. We want to provide all our ideas, goals and design principles once their is enough to tell about. At the moment most preparation is only written in German.

h) Modularization
Don't know if you follow the current efforts for ZF3. There are some resources which are quite valuable for information about what's going on:
http://framework.zend.com/blog/
https://mwop.net/blog/2016-01-28-expressive-stable.html

>From my point of view the latest stable release of Expressive as the Zend Microframework for components is a major step to encapsulate components (as part of a modularization process). These components could be used even outside of the VuFind application and replace (if useful) some current use cases of VuFind for a smaller resource footprint.

We are going to follow the current development and it would be nice if more people are getting interest on this. Definitely it needs time and it should be a topic after the 3.0 release.


Günter
 


[1] https://github.com/swissbib/searchconf/blob/update/solr5x/solr/bib/solr.5.4/SOLR_HOME/sb-biblio/conf/schema.xml#L39

On 01/28/2016 02:59 PM, Demian Katz wrote:

The next developers call will be Tuesday, February 2, 2016 at 9am Eastern Standard Time (14:00 GMT).

AGENDA

1. Development Updates
2. Development Planning
    a. Improved Author Indexing
    b. Delimited Facets

    c. Eliminate "VuFind" Source in Database
    d. Solr Upgrade
    e. Javascript Reorganization
    f. Cover Issues
    g. API
    h. Modularization
    i. Improved Use of Permissions

    j. Session Performance Improvement
3. Other Topics?

More information on the free online call can be found at https://vufind.org/wiki/developers_call -- all are welcome!

- Demian

 

 




------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140




_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech



-- 
UNIVERSITÄT BASEL
Universitätsbibliothek
Günter Hipler
Projekt swissbib
Schönbeinstrasse 18-20
4056 Basel, Schweiz
Tel.: +41 61 267 31 12 
Fax: +41 61 267 31 03
E-Mail [hidden email]
URL www.swissbib.org

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|

Re: Developers Call Agenda - 2/2/16

Demian Katz
In reply to this post by Günter Hipler

Speaking of “switching to docValues,” I’ve just opened a pull request with a first attempt at implementing docValues in VuFind without changing the current behavior:

 

https://github.com/vufind-org/vufind/pull/588

 

This moves the facet normalization logic from the Solr field type into the SolrMarc indexer. If you want to do performance testing, you might find it helpful to try it with these changes in place. (Note that right now, this branch was built against master – I haven’t checked if it has any conflicts with the Solr5 branch… but if it does, I suspect that they are minor).

 

Feedback/results welcomed, as always!

 

thanks,

Demian

 

From: Demian Katz
Sent: Tuesday, February 02, 2016 8:55 AM
To: 'Günter Hipler'; [hidden email]
Subject: RE: [VuFind-Tech] Developers Call Agenda - 2/2/16

 

Thanks for all of the valuable input – I’ll definitely mention all of these points on today’s call (and will do some catch-up reading on ZF3/Expressive very soon so we can discuss that in more detail next time around).

 

The Solr performance issue is very unfortunate – it seems that docValues are the solution, but this doesn’t really feel like a solution so much as a workaround! I’m not seeing much activity on the JIRA ticket about the performance problems (https://issues.apache.org/jira/browse/SOLR-8096), so we may be stuck having to do something.

 

I would be interested to hear what level of performance improvements you achieve by switching to docValues on your test instance – I think that it would be useful to at least confirm the order of magnitude of improvement on a large-scale index. I haven’t had time to do that sort of performance testing on this end yet.

 

Thanks again!

 

- Demian

 

From: Günter Hipler [[hidden email]]
Sent: Tuesday, February 02, 2016 5:48 AM
To: [hidden email]
Subject: Re: [VuFind-Tech] Developers Call Agenda - 2/2/16

 

Hi Demian,

unfortunately nobody of the swissbib team isn't able to take part today because of an internal meeting

Therefor some inputs from our side in advance related to the agenda:
d) Solr Upgrade
- As it is alreday discussed on the list the 5.4.1 release fixes the more-like-this nullpointer exception bug. 
e.g. https://testvf.swissbib.ch/Record/336203004 (similar items)
- At the moment we don't use doc values for facets and sort fields because on some facet fields we still perform text processing (e.g. [1] with no String types) on the Solr side. But this is kind of our own problem because we use a strongly modified index schema.
- Next days I want to run some performance tests. At the moment the performance is significantly slower compared to the productive 4.10 version. I guess the main reason could be that in production we are using SSD disks (hopefully..)

g) API.
As I mentioned in the past, we want to start a project with Markus Maechler to implement a REST API for our linked-data project (in the context of his university education). Unfortunately the start is a little bit delayed. We want to provide all our ideas, goals and design principles once their is enough to tell about. At the moment most preparation is only written in German.

h) Modularization
Don't know if you follow the current efforts for ZF3. There are some resources which are quite valuable for information about what's going on:
http://framework.zend.com/blog/
https://mwop.net/blog/2016-01-28-expressive-stable.html

>From my point of view the latest stable release of Expressive as the Zend Microframework for components is a major step to encapsulate components (as part of a modularization process). These components could be used even outside of the VuFind application and replace (if useful) some current use cases of VuFind for a smaller resource footprint.

We are going to follow the current development and it would be nice if more people are getting interest on this. Definitely it needs time and it should be a topic after the 3.0 release.


Günter
 


[1] https://github.com/swissbib/searchconf/blob/update/solr5x/solr/bib/solr.5.4/SOLR_HOME/sb-biblio/conf/schema.xml#L39

On 01/28/2016 02:59 PM, Demian Katz wrote:

The next developers call will be Tuesday, February 2, 2016 at 9am Eastern Standard Time (14:00 GMT).

AGENDA

1. Development Updates
2. Development Planning
    a. Improved Author Indexing
    b. Delimited Facets

    c. Eliminate "VuFind" Source in Database
    d. Solr Upgrade
    e. Javascript Reorganization
    f. Cover Issues
    g. API
    h. Modularization
    i. Improved Use of Permissions

    j. Session Performance Improvement
3. Other Topics?

More information on the free online call can be found at https://vufind.org/wiki/developers_call -- all are welcome!

- Demian

 

 



------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140



_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech

 

-- 
UNIVERSITÄT BASEL
Universitätsbibliothek
Günter Hipler
Projekt swissbib
Schönbeinstrasse 18-20
4056 Basel, Schweiz
Tel.: +41 61 267 31 12 
Fax: +41 61 267 31 03
E-Mail [hidden email]
URL www.swissbib.org

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|

solr 5.4.1 performance tests with facets

Günter Hipler
In reply to this post by Demian Katz
Hi

sorry for the delayed answer. Last week I made some performance tests on our fresh Solr 5.4.1 index without doc-values for facets.

My results:
- it makes a huge difference if you are running on SSD or not. With SSD's the performance is quite the same compared to our productive index using version 4.10
- in detail:
-- I used only queries I collected from our logs on the productive servers and picked up only facet queries with at least one facet field
-- total queries: 76656
qTime < 100 milliseconds:  73270
qTime > 500 milliseconds: 511
qTime > 1000 milliseconds: 325
qTime > 1500 milliseconds: 232
qTime > 2500 milliseconds: 197
qTime > 4000 milliseconds: 30
qTime longest 5724

For me these results are reasonable and more or less comparable with our 4.10 Index.

But I think it's no reason to set change to doc-values aside. The reason why I tested it without doc-values: I would be lucky to postpone the adaptations for the moment because there is a lot of other work to be done.

 Demian, thanks for the link to https://issues.apache.org/jira/browse/SOLR-8096.
I wasn't aware, the Solr team has this severe problem which seems not to be solved until now. From my point of view: they are loosing connection to the development in the underlying Lucene building.

Hope we (swissbib) can take part at todays dev - call.

Günter


On 02/02/2016 02:55 PM, Demian Katz wrote:

Thanks for all of the valuable input – I’ll definitely mention all of these points on today’s call (and will do some catch-up reading on ZF3/Expressive very soon so we can discuss that in more detail next time around).

 

The Solr performance issue is very unfortunate – it seems that docValues are the solution, but this doesn’t really feel like a solution so much as a workaround! I’m not seeing much activity on the JIRA ticket about the performance problems (https://issues.apache.org/jira/browse/SOLR-8096), so we may be stuck having to do something.

 

I would be interested to hear what level of performance improvements you achieve by switching to docValues on your test instance – I think that it would be useful to at least confirm the order of magnitude of improvement on a large-scale index. I haven’t had time to do that sort of performance testing on this end yet.

 

Thanks again!

 

- Demian

 

From: Günter Hipler [[hidden email]]
Sent: Tuesday, February 02, 2016 5:48 AM
To: [hidden email]
Subject: Re: [VuFind-Tech] Developers Call Agenda - 2/2/16

 

Hi Demian,

unfortunately nobody of the swissbib team isn't able to take part today because of an internal meeting

Therefor some inputs from our side in advance related to the agenda:
d) Solr Upgrade
- As it is alreday discussed on the list the 5.4.1 release fixes the more-like-this nullpointer exception bug. 
e.g. https://testvf.swissbib.ch/Record/336203004 (similar items)
- At the moment we don't use doc values for facets and sort fields because on some facet fields we still perform text processing (e.g. [1] with no String types) on the Solr side. But this is kind of our own problem because we use a strongly modified index schema.
- Next days I want to run some performance tests. At the moment the performance is significantly slower compared to the productive 4.10 version. I guess the main reason could be that in production we are using SSD disks (hopefully..)

g) API.
As I mentioned in the past, we want to start a project with Markus Maechler to implement a REST API for our linked-data project (in the context of his university education). Unfortunately the start is a little bit delayed. We want to provide all our ideas, goals and design principles once their is enough to tell about. At the moment most preparation is only written in German.

h) Modularization
Don't know if you follow the current efforts for ZF3. There are some resources which are quite valuable for information about what's going on:
http://framework.zend.com/blog/
https://mwop.net/blog/2016-01-28-expressive-stable.html

>From my point of view the latest stable release of Expressive as the Zend Microframework for components is a major step to encapsulate components (as part of a modularization process). These components could be used even outside of the VuFind application and replace (if useful) some current use cases of VuFind for a smaller resource footprint.

We are going to follow the current development and it would be nice if more people are getting interest on this. Definitely it needs time and it should be a topic after the 3.0 release.


Günter
 


[1] https://github.com/swissbib/searchconf/blob/update/solr5x/solr/bib/solr.5.4/SOLR_HOME/sb-biblio/conf/schema.xml#L39

On 01/28/2016 02:59 PM, Demian Katz wrote:

The next developers call will be Tuesday, February 2, 2016 at 9am Eastern Standard Time (14:00 GMT).

AGENDA

1. Development Updates
2. Development Planning
    a. Improved Author Indexing
    b. Delimited Facets

    c. Eliminate "VuFind" Source in Database
    d. Solr Upgrade
    e. Javascript Reorganization
    f. Cover Issues
    g. API
    h. Modularization
    i. Improved Use of Permissions

    j. Session Performance Improvement
3. Other Topics?

More information on the free online call can be found at https://vufind.org/wiki/developers_call -- all are welcome!

- Demian

 

 




------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140




_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech



-- 
UNIVERSITÄT BASEL
Universitätsbibliothek
Günter Hipler
Projekt swissbib
Schönbeinstrasse 18-20
4056 Basel, Schweiz
Tel.: +41 61 267 31 12 
Fax: +41 61 267 31 03
E-Mail [hidden email]
URL www.swissbib.org

-- 
UNIVERSITÄT BASEL
Universitätsbibliothek
Günter Hipler
Projekt swissbib
Schönbeinstrasse 18-20
4056 Basel, Schweiz
Tel.: +41 61 267 31 12 
Fax: +41 61 267 31 03
E-Mail [hidden email]
URL www.swissbib.org

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|

Re: solr 5.4.1 performance tests with facets

Demian Katz

Günter,

 

Thanks for sharing these results (and for your participation in today’s call). Out of curiosity, what process did you do to run these tests? Is there any possibility that you might be able to share data/scripts so that I can run the same tests on this end against various configurations of Solr, or is your schema so heavily customized that the queries/facet values would be meaningless to a “stock” VuFind instance?

 

In any case, I understand if you are unable to share the data for whatever reason – but it seemed worth asking in case it could save a bit of time with my own testing over here!

 

thanks,

Demian

 

From: Günter Hipler [mailto:[hidden email]]
Sent: Tuesday, February 16, 2016 8:25 AM
To: Demian Katz; [hidden email]
Cc: [hidden email]
Subject: solr 5.4.1 performance tests with facets

 

Hi

sorry for the delayed answer. Last week I made some performance tests on our fresh Solr 5.4.1 index without doc-values for facets.

My results:
- it makes a huge difference if you are running on SSD or not. With SSD's the performance is quite the same compared to our productive index using version 4.10
- in detail:
-- I used only queries I collected from our logs on the productive servers and picked up only facet queries with at least one facet field
-- total queries: 76656
qTime < 100 milliseconds:  73270
qTime > 500 milliseconds: 511
qTime > 1000 milliseconds: 325
qTime > 1500 milliseconds: 232
qTime > 2500 milliseconds: 197
qTime > 4000 milliseconds: 30
qTime longest 5724

For me these results are reasonable and more or less comparable with our 4.10 Index.

But I think it's no reason to set change to doc-values aside. The reason why I tested it without doc-values: I would be lucky to postpone the adaptations for the moment because there is a lot of other work to be done.

 Demian, thanks for the link to https://issues.apache.org/jira/browse/SOLR-8096.
I wasn't aware, the Solr team has this severe problem which seems not to be solved until now. From my point of view: they are loosing connection to the development in the underlying Lucene building.

Hope we (swissbib) can take part at todays dev - call.

Günter

On 02/02/2016 02:55 PM, Demian Katz wrote:

Thanks for all of the valuable input – I’ll definitely mention all of these points on today’s call (and will do some catch-up reading on ZF3/Expressive very soon so we can discuss that in more detail next time around).

 

The Solr performance issue is very unfortunate – it seems that docValues are the solution, but this doesn’t really feel like a solution so much as a workaround! I’m not seeing much activity on the JIRA ticket about the performance problems (https://issues.apache.org/jira/browse/SOLR-8096), so we may be stuck having to do something.

 

I would be interested to hear what level of performance improvements you achieve by switching to docValues on your test instance – I think that it would be useful to at least confirm the order of magnitude of improvement on a large-scale index. I haven’t had time to do that sort of performance testing on this end yet.

 

Thanks again!

 

- Demian

 

From: Günter Hipler [[hidden email]]
Sent: Tuesday, February 02, 2016 5:48 AM
To: [hidden email]
Subject: Re: [VuFind-Tech] Developers Call Agenda - 2/2/16

 

Hi Demian,

unfortunately nobody of the swissbib team isn't able to take part today because of an internal meeting

Therefor some inputs from our side in advance related to the agenda:
d) Solr Upgrade
- As it is alreday discussed on the list the 5.4.1 release fixes the more-like-this nullpointer exception bug. 
e.g. https://testvf.swissbib.ch/Record/336203004 (similar items)
- At the moment we don't use doc values for facets and sort fields because on some facet fields we still perform text processing (e.g. [1] with no String types) on the Solr side. But this is kind of our own problem because we use a strongly modified index schema.
- Next days I want to run some performance tests. At the moment the performance is significantly slower compared to the productive 4.10 version. I guess the main reason could be that in production we are using SSD disks (hopefully..)

g) API.
As I mentioned in the past, we want to start a project with Markus Maechler to implement a REST API for our linked-data project (in the context of his university education). Unfortunately the start is a little bit delayed. We want to provide all our ideas, goals and design principles once their is enough to tell about. At the moment most preparation is only written in German.

h) Modularization
Don't know if you follow the current efforts for ZF3. There are some resources which are quite valuable for information about what's going on:
http://framework.zend.com/blog/
https://mwop.net/blog/2016-01-28-expressive-stable.html

>From my point of view the latest stable release of Expressive as the Zend Microframework for components is a major step to encapsulate components (as part of a modularization process). These components could be used even outside of the VuFind application and replace (if useful) some current use cases of VuFind for a smaller resource footprint.

We are going to follow the current development and it would be nice if more people are getting interest on this. Definitely it needs time and it should be a topic after the 3.0 release.


Günter
 


[1] https://github.com/swissbib/searchconf/blob/update/solr5x/solr/bib/solr.5.4/SOLR_HOME/sb-biblio/conf/schema.xml#L39


On 01/28/2016 02:59 PM, Demian Katz wrote:

The next developers call will be Tuesday, February 2, 2016 at 9am Eastern Standard Time (14:00 GMT).

AGENDA

1. Development Updates
2. Development Planning
    a. Improved Author Indexing
    b. Delimited Facets

    c. Eliminate "VuFind" Source in Database
    d. Solr Upgrade
    e. Javascript Reorganization
    f. Cover Issues
    g. API
    h. Modularization
    i. Improved Use of Permissions

    j. Session Performance Improvement
3. Other Topics?

More information on the free online call can be found at https://vufind.org/wiki/developers_call -- all are welcome!

- Demian

 

 





------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140





_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech




-- 
UNIVERSITÄT BASEL
Universitätsbibliothek
Günter Hipler
Projekt swissbib
Schönbeinstrasse 18-20
4056 Basel, Schweiz
Tel.: +41 61 267 31 12 
Fax: +41 61 267 31 03
E-Mail [hidden email]
URL www.swissbib.org



-- 
UNIVERSITÄT BASEL
Universitätsbibliothek
Günter Hipler
Projekt swissbib
Schönbeinstrasse 18-20
4056 Basel, Schweiz
Tel.: +41 61 267 31 12 
Fax: +41 61 267 31 03
E-Mail [hidden email]
URL www.swissbib.org

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|

Re: solr 5.4.1 performance tests with facets

Günter Hipler
Demian,

of course you can take a look into what I have done. I just pushed the scripts I used [1]. You can find a short description of the ideas in the README  file

But some notes you should keep in mind and a little background information:
- First I wanted to use the ELK stack (especially Logstash and Elasticsearch) for such log analysis which I have in mind since quite some time and I thought this might be a good moment in time to use it.
(We already use ElasticSearch for our linked project)
- Unfortunately I haven't found just already configured pipelines for Solr logs  I could use out of the box (which surprised me a lot). I stumbled upon something [2]  but this didn't work out for my ideas
- Spending half a day for such purposes is more than enough and  I needed results for my questions about performance of Solr 5.4 servers I made just a quick hack with python and Mongo ... (see the description in README) Perhaps you can use it in a similar way for yourself
- As I already mentioned in the dev-call: The swissbib index schema is quite different compared to the VF2 schema and we don't use the SolrMarc process pipeline (which is the reason why I have to adapt the String normalization stuff for doc-values by myself). Our what we call "documentProcessing" is XSLT and Java based [4]  and in the future I'm planning to use a combination of the stream-based MetaFacture - Framework [5] (initially created by the German National Library) in combination with our current procedures. MetaFacture based workflows are already part of our linked  project [6]

So it's not we don't want to share something but sometimes it is difficult to put things together for people not really familiar with our work - I'm sorry for this.

Very best wishes from Basel!

Günter



[1] https://github.com/linked-swissbib/utilities/commit/8cadbe26271ee5803c73f7b42efed17ffef061a6
[2] http://blog.comperiosearch.com/blog/2015/09/21/solr-logstash-analysis/
[3] https://github.com/swissbib/searchconf/blob/master/solr/bib/conf-solr.4.10.2/configs/solr.home/bib/conf/schema.xml
[4] https://github.com/swissbib/content2SearchDocs
[5] https://github.com/culturegraph/metafacture-core
[6] https://github.com/linked-swissbib/mfWorkflows


On 02/16/2016 05:36 PM, Demian Katz wrote:

Günter,

 

Thanks for sharing these results (and for your participation in today’s call). Out of curiosity, what process did you do to run these tests? Is there any possibility that you might be able to share data/scripts so that I can run the same tests on this end against various configurations of Solr, or is your schema so heavily customized that the queries/facet values would be meaningless to a “stock” VuFind instance?

 

In any case, I understand if you are unable to share the data for whatever reason – but it seemed worth asking in case it could save a bit of time with my own testing over here!

 

thanks,

Demian

 

From: Günter Hipler [[hidden email]]
Sent: Tuesday, February 16, 2016 8:25 AM
To: Demian Katz; [hidden email]
Cc: [hidden email]
Subject: solr 5.4.1 performance tests with facets

 

Hi

sorry for the delayed answer. Last week I made some performance tests on our fresh Solr 5.4.1 index without doc-values for facets.

My results:
- it makes a huge difference if you are running on SSD or not. With SSD's the performance is quite the same compared to our productive index using version 4.10
- in detail:
-- I used only queries I collected from our logs on the productive servers and picked up only facet queries with at least one facet field
-- total queries: 76656
qTime < 100 milliseconds:  73270
qTime > 500 milliseconds: 511
qTime > 1000 milliseconds: 325
qTime > 1500 milliseconds: 232
qTime > 2500 milliseconds: 197
qTime > 4000 milliseconds: 30
qTime longest 5724

For me these results are reasonable and more or less comparable with our 4.10 Index.

But I think it's no reason to set change to doc-values aside. The reason why I tested it without doc-values: I would be lucky to postpone the adaptations for the moment because there is a lot of other work to be done.

 Demian, thanks for the link to https://issues.apache.org/jira/browse/SOLR-8096.
I wasn't aware, the Solr team has this severe problem which seems not to be solved until now. From my point of view: they are loosing connection to the development in the underlying Lucene building.

Hope we (swissbib) can take part at todays dev - call.

Günter

On 02/02/2016 02:55 PM, Demian Katz wrote:

Thanks for all of the valuable input – I’ll definitely mention all of these points on today’s call (and will do some catch-up reading on ZF3/Expressive very soon so we can discuss that in more detail next time around).

 

The Solr performance issue is very unfortunate – it seems that docValues are the solution, but this doesn’t really feel like a solution so much as a workaround! I’m not seeing much activity on the JIRA ticket about the performance problems (https://issues.apache.org/jira/browse/SOLR-8096), so we may be stuck having to do something.

 

I would be interested to hear what level of performance improvements you achieve by switching to docValues on your test instance – I think that it would be useful to at least confirm the order of magnitude of improvement on a large-scale index. I haven’t had time to do that sort of performance testing on this end yet.

 

Thanks again!

 

- Demian

 

From: Günter Hipler [[hidden email]]
Sent: Tuesday, February 02, 2016 5:48 AM
To: [hidden email]
Subject: Re: [VuFind-Tech] Developers Call Agenda - 2/2/16

 

Hi Demian,

unfortunately nobody of the swissbib team isn't able to take part today because of an internal meeting

Therefor some inputs from our side in advance related to the agenda:
d) Solr Upgrade
- As it is alreday discussed on the list the 5.4.1 release fixes the more-like-this nullpointer exception bug. 
e.g. https://testvf.swissbib.ch/Record/336203004 (similar items)
- At the moment we don't use doc values for facets and sort fields because on some facet fields we still perform text processing (e.g. [1] with no String types) on the Solr side. But this is kind of our own problem because we use a strongly modified index schema.
- Next days I want to run some performance tests. At the moment the performance is significantly slower compared to the productive 4.10 version. I guess the main reason could be that in production we are using SSD disks (hopefully..)

g) API.
As I mentioned in the past, we want to start a project with Markus Maechler to implement a REST API for our linked-data project (in the context of his university education). Unfortunately the start is a little bit delayed. We want to provide all our ideas, goals and design principles once their is enough to tell about. At the moment most preparation is only written in German.

h) Modularization
Don't know if you follow the current efforts for ZF3. There are some resources which are quite valuable for information about what's going on:
http://framework.zend.com/blog/
https://mwop.net/blog/2016-01-28-expressive-stable.html

>From my point of view the latest stable release of Expressive as the Zend Microframework for components is a major step to encapsulate components (as part of a modularization process). These components could be used even outside of the VuFind application and replace (if useful) some current use cases of VuFind for a smaller resource footprint.

We are going to follow the current development and it would be nice if more people are getting interest on this. Definitely it needs time and it should be a topic after the 3.0 release.


Günter
 


[1] https://github.com/swissbib/searchconf/blob/update/solr5x/solr/bib/solr.5.4/SOLR_HOME/sb-biblio/conf/schema.xml#L39


On 01/28/2016 02:59 PM, Demian Katz wrote:

The next developers call will be Tuesday, February 2, 2016 at 9am Eastern Standard Time (14:00 GMT).

AGENDA

1. Development Updates
2. Development Planning
    a. Improved Author Indexing
    b. Delimited Facets

    c. Eliminate "VuFind" Source in Database
    d. Solr Upgrade
    e. Javascript Reorganization
    f. Cover Issues
    g. API
    h. Modularization
    i. Improved Use of Permissions

    j. Session Performance Improvement
3. Other Topics?

More information on the free online call can be found at https://vufind.org/wiki/developers_call -- all are welcome!

- Demian

 

 





------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140





_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech




-- 
UNIVERSITÄT BASEL
Universitätsbibliothek
Günter Hipler
Projekt swissbib
Schönbeinstrasse 18-20
4056 Basel, Schweiz
Tel.: +41 61 267 31 12 
Fax: +41 61 267 31 03
E-Mail [hidden email]
URL www.swissbib.org



-- 
UNIVERSITÄT BASEL
Universitätsbibliothek
Günter Hipler
Projekt swissbib
Schönbeinstrasse 18-20
4056 Basel, Schweiz
Tel.: +41 61 267 31 12 
Fax: +41 61 267 31 03
E-Mail [hidden email]
URL www.swissbib.org

-- 
UNIVERSITÄT BASEL
Universitätsbibliothek
Günter Hipler
Projekt swissbib
Schönbeinstrasse 18-20
4056 Basel, Schweiz
Tel.: +41 61 267 31 12 
Fax: +41 61 267 31 03
E-Mail [hidden email]
URL www.swissbib.org

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|

Re: solr 5.4.1 performance tests with facets

Demian Katz

Thanks for sharing all of this (and the relevant background information). I’m not sure whether or not it will save me any time, but I’ll at least take a closer look and find out. Whatever happens, once I’ve had time to run some tests of my own, I’ll share my procedure with the group for future reference.

 

- Demian

 

From: Günter Hipler [mailto:[hidden email]]
Sent: Wednesday, February 17, 2016 5:53 AM
To: Demian Katz; [hidden email]
Subject: Re: solr 5.4.1 performance tests with facets

 

Demian,

of course you can take a look into what I have done. I just pushed the scripts I used [1]. You can find a short description of the ideas in the README  file

But some notes you should keep in mind and a little background information:
- First I wanted to use the ELK stack (especially Logstash and Elasticsearch) for such log analysis which I have in mind since quite some time and I thought this might be a good moment in time to use it.
(We already use ElasticSearch for our linked project)
- Unfortunately I haven't found just already configured pipelines for Solr logs  I could use out of the box (which surprised me a lot). I stumbled upon something [2]  but this didn't work out for my ideas
- Spending half a day for such purposes is more than enough and  I needed results for my questions about performance of Solr 5.4 servers I made just a quick hack with python and Mongo ... (see the description in README) Perhaps you can use it in a similar way for yourself
- As I already mentioned in the dev-call: The swissbib index schema is quite different compared to the VF2 schema and we don't use the SolrMarc process pipeline (which is the reason why I have to adapt the String normalization stuff for doc-values by myself). Our what we call "documentProcessing" is XSLT and Java based [4]  and in the future I'm planning to use a combination of the stream-based MetaFacture - Framework [5] (initially created by the German National Library) in combination with our current procedures. MetaFacture based workflows are already part of our linked  project [6]

So it's not we don't want to share something but sometimes it is difficult to put things together for people not really familiar with our work - I'm sorry for this.

Very best wishes from Basel!

Günter



[1] https://github.com/linked-swissbib/utilities/commit/8cadbe26271ee5803c73f7b42efed17ffef061a6
[2] http://blog.comperiosearch.com/blog/2015/09/21/solr-logstash-analysis/
[3] https://github.com/swissbib/searchconf/blob/master/solr/bib/conf-solr.4.10.2/configs/solr.home/bib/conf/schema.xml
[4] https://github.com/swissbib/content2SearchDocs
[5] https://github.com/culturegraph/metafacture-core
[6] https://github.com/linked-swissbib/mfWorkflows

On 02/16/2016 05:36 PM, Demian Katz wrote:

Günter,

 

Thanks for sharing these results (and for your participation in today’s call). Out of curiosity, what process did you do to run these tests? Is there any possibility that you might be able to share data/scripts so that I can run the same tests on this end against various configurations of Solr, or is your schema so heavily customized that the queries/facet values would be meaningless to a “stock” VuFind instance?

 

In any case, I understand if you are unable to share the data for whatever reason – but it seemed worth asking in case it could save a bit of time with my own testing over here!

 

thanks,

Demian

 

From: Günter Hipler [[hidden email]]
Sent: Tuesday, February 16, 2016 8:25 AM
To: Demian Katz; [hidden email]
Cc: [hidden email]
Subject: solr 5.4.1 performance tests with facets

 

Hi

sorry for the delayed answer. Last week I made some performance tests on our fresh Solr 5.4.1 index without doc-values for facets.

My results:
- it makes a huge difference if you are running on SSD or not. With SSD's the performance is quite the same compared to our productive index using version 4.10
- in detail:
-- I used only queries I collected from our logs on the productive servers and picked up only facet queries with at least one facet field
-- total queries: 76656
qTime < 100 milliseconds:  73270
qTime > 500 milliseconds: 511
qTime > 1000 milliseconds: 325
qTime > 1500 milliseconds: 232
qTime > 2500 milliseconds: 197
qTime > 4000 milliseconds: 30
qTime longest 5724

For me these results are reasonable and more or less comparable with our 4.10 Index.

But I think it's no reason to set change to doc-values aside. The reason why I tested it without doc-values: I would be lucky to postpone the adaptations for the moment because there is a lot of other work to be done.

 Demian, thanks for the link to https://issues.apache.org/jira/browse/SOLR-8096.
I wasn't aware, the Solr team has this severe problem which seems not to be solved until now. From my point of view: they are loosing connection to the development in the underlying Lucene building.

Hope we (swissbib) can take part at todays dev - call.

Günter


On 02/02/2016 02:55 PM, Demian Katz wrote:

Thanks for all of the valuable input – I’ll definitely mention all of these points on today’s call (and will do some catch-up reading on ZF3/Expressive very soon so we can discuss that in more detail next time around).

 

The Solr performance issue is very unfortunate – it seems that docValues are the solution, but this doesn’t really feel like a solution so much as a workaround! I’m not seeing much activity on the JIRA ticket about the performance problems (https://issues.apache.org/jira/browse/SOLR-8096), so we may be stuck having to do something.

 

I would be interested to hear what level of performance improvements you achieve by switching to docValues on your test instance – I think that it would be useful to at least confirm the order of magnitude of improvement on a large-scale index. I haven’t had time to do that sort of performance testing on this end yet.

 

Thanks again!

 

- Demian

 

From: Günter Hipler [[hidden email]]
Sent: Tuesday, February 02, 2016 5:48 AM
To: [hidden email]
Subject: Re: [VuFind-Tech] Developers Call Agenda - 2/2/16

 

Hi Demian,

unfortunately nobody of the swissbib team isn't able to take part today because of an internal meeting

Therefor some inputs from our side in advance related to the agenda:
d) Solr Upgrade
- As it is alreday discussed on the list the 5.4.1 release fixes the more-like-this nullpointer exception bug. 
e.g. https://testvf.swissbib.ch/Record/336203004 (similar items)
- At the moment we don't use doc values for facets and sort fields because on some facet fields we still perform text processing (e.g. [1] with no String types) on the Solr side. But this is kind of our own problem because we use a strongly modified index schema.
- Next days I want to run some performance tests. At the moment the performance is significantly slower compared to the productive 4.10 version. I guess the main reason could be that in production we are using SSD disks (hopefully..)

g) API.
As I mentioned in the past, we want to start a project with Markus Maechler to implement a REST API for our linked-data project (in the context of his university education). Unfortunately the start is a little bit delayed. We want to provide all our ideas, goals and design principles once their is enough to tell about. At the moment most preparation is only written in German.

h) Modularization
Don't know if you follow the current efforts for ZF3. There are some resources which are quite valuable for information about what's going on:
http://framework.zend.com/blog/
https://mwop.net/blog/2016-01-28-expressive-stable.html

>From my point of view the latest stable release of Expressive as the Zend Microframework for components is a major step to encapsulate components (as part of a modularization process). These components could be used even outside of the VuFind application and replace (if useful) some current use cases of VuFind for a smaller resource footprint.

We are going to follow the current development and it would be nice if more people are getting interest on this. Definitely it needs time and it should be a topic after the 3.0 release.


Günter
 


[1] https://github.com/swissbib/searchconf/blob/update/solr5x/solr/bib/solr.5.4/SOLR_HOME/sb-biblio/conf/schema.xml#L39



On 01/28/2016 02:59 PM, Demian Katz wrote:

The next developers call will be Tuesday, February 2, 2016 at 9am Eastern Standard Time (14:00 GMT).

AGENDA

1. Development Updates
2. Development Planning
    a. Improved Author Indexing
    b. Delimited Facets

    c. Eliminate "VuFind" Source in Database
    d. Solr Upgrade
    e. Javascript Reorganization
    f. Cover Issues
    g. API
    h. Modularization
    i. Improved Use of Permissions

    j. Session Performance Improvement
3. Other Topics?

More information on the free online call can be found at https://vufind.org/wiki/developers_call -- all are welcome!

- Demian

 

 






------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140






_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech





-- 
UNIVERSITÄT BASEL
Universitätsbibliothek
Günter Hipler
Projekt swissbib
Schönbeinstrasse 18-20
4056 Basel, Schweiz
Tel.: +41 61 267 31 12 
Fax: +41 61 267 31 03
E-Mail [hidden email]
URL www.swissbib.org




-- 
UNIVERSITÄT BASEL
Universitätsbibliothek
Günter Hipler
Projekt swissbib
Schönbeinstrasse 18-20
4056 Basel, Schweiz
Tel.: +41 61 267 31 12 
Fax: +41 61 267 31 03
E-Mail [hidden email]
URL www.swissbib.org



-- 
UNIVERSITÄT BASEL
Universitätsbibliothek
Günter Hipler
Projekt swissbib
Schönbeinstrasse 18-20
4056 Basel, Schweiz
Tel.: +41 61 267 31 12 
Fax: +41 61 267 31 03
E-Mail [hidden email]
URL www.swissbib.org

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|

Re: solr 5.4.1 performance tests with facets

Demian Katz
In reply to this post by Günter Hipler

Günter,

 

Thanks again for sharing this. As it happens, I wasn’t able to use these tools as-is, but they did help spark my thinking on the easiest approach for my own testing.

 

I’ll share the details in case anyone is interested…


I decided to take more of a Unix pipeline approach – create one tool that extracts parameters from Solr logs, and another tool that takes parameters as input and produces CSV output containing key statistics. Thus, I can do something like this:

 

php extractQueries.php < solr.log | grep "facet.field=" | sort | uniq | php runQueries.php > output.csv

 

Simple but flexible!

 

Here are my scripts:

 

extractQueries.php:

<?php

/**

* Given a Solr log file (sent through STDIN), extract all parameters to STDOUT.

*/

while ($line = fgets(STDIN)) {

    $parts = explode(' ', $line);

    $params = substr($parts[9], 8, strlen($parts[9]) - 9);

    echo "$params\n";

}

 

runQueries.php:

<?php

/**

* Given the output of extractQueries.php (sent to STDIN), create a CSV file of results (sent to STDOUT).

*/

$base = "http://localhost:8082/solr/biblio/select?";

fputcsv(STDOUT, ['input', 'success', 'time', 'matches']);

 

while ($line = fgets(STDIN)) {

    $url = $base . $line;

    $result = json_decode(file_get_contents($url));

    $success = isset($result->responseHeader->QTime) ? true : false;

    $csv = $success

        ? [$line, 'true', $result->responseHeader->QTime, $result->response->numFound]

        : [$line, 'false'];

    fputcsv(STDOUT, $csv);

}

 

Right now I’m still in the process of setting up my first test index, so perhaps these will be refined a little as I start using them… but in any case, here’s my intended procedure:

 

1.)    Spin up standard VuFind test instance on solr5 branch

2.)    Add an extra million records

3.)    Run queries (I have a random sampling of about 600… hopefully that’s reasonable) and save .csv file

4.)    Shut down test instance

5.)    Spin up standard VuFind test instance on solr5 branch with docValues changes merged in

6.)    Repeat steps 2-4.

7.)    Analyze .csv files

 

Hopefully that will show us whether or not there’s a significant measurable difference between the two configurations, even if some of the details aren’t as perfectly scientific as they might be.

 

I’ll post results as soon as I have them. In the meantime, I’m open to suggestions for refining the process!

 

Thanks,

Demian

 

From: Günter Hipler [mailto:[hidden email]]
Sent: Wednesday, February 17, 2016 5:53 AM
To: Demian Katz; [hidden email]
Subject: Re: solr 5.4.1 performance tests with facets

 

Demian,

of course you can take a look into what I have done. I just pushed the scripts I used [1]. You can find a short description of the ideas in the README  file

But some notes you should keep in mind and a little background information:
- First I wanted to use the ELK stack (especially Logstash and Elasticsearch) for such log analysis which I have in mind since quite some time and I thought this might be a good moment in time to use it.
(We already use ElasticSearch for our linked project)
- Unfortunately I haven't found just already configured pipelines for Solr logs  I could use out of the box (which surprised me a lot). I stumbled upon something [2]  but this didn't work out for my ideas
- Spending half a day for such purposes is more than enough and  I needed results for my questions about performance of Solr 5.4 servers I made just a quick hack with python and Mongo ... (see the description in README) Perhaps you can use it in a similar way for yourself
- As I already mentioned in the dev-call: The swissbib index schema is quite different compared to the VF2 schema and we don't use the SolrMarc process pipeline (which is the reason why I have to adapt the String normalization stuff for doc-values by myself). Our what we call "documentProcessing" is XSLT and Java based [4]  and in the future I'm planning to use a combination of the stream-based MetaFacture - Framework [5] (initially created by the German National Library) in combination with our current procedures. MetaFacture based workflows are already part of our linked  project [6]

So it's not we don't want to share something but sometimes it is difficult to put things together for people not really familiar with our work - I'm sorry for this.

Very best wishes from Basel!

Günter



[1] https://github.com/linked-swissbib/utilities/commit/8cadbe26271ee5803c73f7b42efed17ffef061a6
[2] http://blog.comperiosearch.com/blog/2015/09/21/solr-logstash-analysis/
[3] https://github.com/swissbib/searchconf/blob/master/solr/bib/conf-solr.4.10.2/configs/solr.home/bib/conf/schema.xml
[4] https://github.com/swissbib/content2SearchDocs
[5] https://github.com/culturegraph/metafacture-core
[6] https://github.com/linked-swissbib/mfWorkflows

On 02/16/2016 05:36 PM, Demian Katz wrote:

Günter,

 

Thanks for sharing these results (and for your participation in today’s call). Out of curiosity, what process did you do to run these tests? Is there any possibility that you might be able to share data/scripts so that I can run the same tests on this end against various configurations of Solr, or is your schema so heavily customized that the queries/facet values would be meaningless to a “stock” VuFind instance?

 

In any case, I understand if you are unable to share the data for whatever reason – but it seemed worth asking in case it could save a bit of time with my own testing over here!

 

thanks,

Demian

 

From: Günter Hipler [[hidden email]]
Sent: Tuesday, February 16, 2016 8:25 AM
To: Demian Katz;
[hidden email]
Cc:
[hidden email]
Subject: solr 5.4.1 performance tests with facets

 

Hi

sorry for the delayed answer. Last week I made some performance tests on our fresh Solr 5.4.1 index without doc-values for facets.

My results:
- it makes a huge difference if you are running on SSD or not. With SSD's the performance is quite the same compared to our productive index using version 4.10
- in detail:
-- I used only queries I collected from our logs on the productive servers and picked up only facet queries with at least one facet field
-- total queries: 76656
qTime < 100 milliseconds:  73270
qTime > 500 milliseconds: 511
qTime > 1000 milliseconds: 325
qTime > 1500 milliseconds: 232
qTime > 2500 milliseconds: 197
qTime > 4000 milliseconds: 30
qTime longest 5724

For me these results are reasonable and more or less comparable with our 4.10 Index.

But I think it's no reason to set change to doc-values aside. The reason why I tested it without doc-values: I would be lucky to postpone the adaptations for the moment because there is a lot of other work to be done.

 Demian, thanks for the link to https://issues.apache.org/jira/browse/SOLR-8096.
I wasn't aware, the Solr team has this severe problem which seems not to be solved until now. From my point of view: they are loosing connection to the development in the underlying Lucene building.

Hope we (swissbib) can take part at todays dev - call.

Günter


On 02/02/2016 02:55 PM, Demian Katz wrote:

Thanks for all of the valuable input – I’ll definitely mention all of these points on today’s call (and will do some catch-up reading on ZF3/Expressive very soon so we can discuss that in more detail next time around).

 

The Solr performance issue is very unfortunate – it seems that docValues are the solution, but this doesn’t really feel like a solution so much as a workaround! I’m not seeing much activity on the JIRA ticket about the performance problems (https://issues.apache.org/jira/browse/SOLR-8096), so we may be stuck having to do something.

 

I would be interested to hear what level of performance improvements you achieve by switching to docValues on your test instance – I think that it would be useful to at least confirm the order of magnitude of improvement on a large-scale index. I haven’t had time to do that sort of performance testing on this end yet.

 

Thanks again!

 

- Demian

 

From: Günter Hipler [[hidden email]]
Sent: Tuesday, February 02, 2016 5:48 AM
To:
[hidden email]
Subject: Re: [VuFind-Tech] Developers Call Agenda - 2/2/16

 

Hi Demian,

unfortunately nobody of the swissbib team isn't able to take part today because of an internal meeting

Therefor some inputs from our side in advance related to the agenda:
d) Solr Upgrade
- As it is alreday discussed on the list the 5.4.1 release fixes the more-like-this nullpointer exception bug. 
e.g. https://testvf.swissbib.ch/Record/336203004 (similar items)
- At the moment we don't use doc values for facets and sort fields because on some facet fields we still perform text processing (e.g. [1] with no String types) on the Solr side. But this is kind of our own problem because we use a strongly modified index schema.
- Next days I want to run some performance tests. At the moment the performance is significantly slower compared to the productive 4.10 version. I guess the main reason could be that in production we are using SSD disks (hopefully..)

g) API.
As I mentioned in the past, we want to start a project with Markus Maechler to implement a REST API for our linked-data project (in the context of his university education). Unfortunately the start is a little bit delayed. We want to provide all our ideas, goals and design principles once their is enough to tell about. At the moment most preparation is only written in German.

h) Modularization
Don't know if you follow the current efforts for ZF3. There are some resources which are quite valuable for information about what's going on:
http://framework.zend.com/blog/
https://mwop.net/blog/2016-01-28-expressive-stable.html

>From my point of view the latest stable release of Expressive as the Zend Microframework for components is a major step to encapsulate components (as part of a modularization process). These components could be used even outside of the VuFind application and replace (if useful) some current use cases of VuFind for a smaller resource footprint.

We are going to follow the current development and it would be nice if more people are getting interest on this. Definitely it needs time and it should be a topic after the 3.0 release.


Günter
 


[1] https://github.com/swissbib/searchconf/blob/update/solr5x/solr/bib/solr.5.4/SOLR_HOME/sb-biblio/conf/schema.xml#L39



On 01/28/2016 02:59 PM, Demian Katz wrote:

The next developers call will be Tuesday, February 2, 2016 at 9am Eastern Standard Time (14:00 GMT).

AGENDA

1. Development Updates
2. Development Planning
    a. Improved Author Indexing
    b. Delimited Facets

    c. Eliminate "VuFind" Source in Database
    d. Solr Upgrade
    e. Javascript Reorganization
    f. Cover Issues
    g. API
    h. Modularization
    i. Improved Use of Permissions

    j. Session Performance Improvement
3. Other Topics?

More information on the free online call can be found at
https://vufind.org/wiki/developers_call -- all are welcome!

- Demian

 

 






------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140






_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech





-- 
UNIVERSITÄT BASEL
Universitätsbibliothek
Günter Hipler
Projekt swissbib
Schönbeinstrasse 18-20
4056 Basel, Schweiz
Tel.: +41 61 267 31 12 
Fax: +41 61 267 31 03
E-Mail [hidden email]
URL www.swissbib.org




-- 
UNIVERSITÄT BASEL
Universitätsbibliothek
Günter Hipler
Projekt swissbib
Schönbeinstrasse 18-20
4056 Basel, Schweiz
Tel.: +41 61 267 31 12 
Fax: +41 61 267 31 03
E-Mail [hidden email]
URL www.swissbib.org



-- 
UNIVERSITÄT BASEL
Universitätsbibliothek
Günter Hipler
Projekt swissbib
Schönbeinstrasse 18-20
4056 Basel, Schweiz
Tel.: +41 61 267 31 12 
Fax: +41 61 267 31 03
E-Mail [hidden email]
URL www.swissbib.org

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|

Re: solr 5.4.1 performance tests with facets

Demian Katz

My first round of testing is complete, and with puzzling results.

 

First of all, there’s one small bug in the code I posted earlier… the while loop in runQueries.php should include “$line = trim($line);” at the top to prevent trailing carriage returns from causing problems with executing queries and building well-formed CSV data. (If anybody else really wants this code, I can post the final form somewhere – just let me know).

 

But more importantly, the first round of results seems to suggest that adding docValues makes things worse!

 

Without docvalues, my first time running my 610 sample queries took 48.412 seconds, an average of ~79.364ms per query, with a maximum query time of 4.053s.

 

With docvalues, my first time running the same queries took 55.847 seconds, an average of ~91.552ms per query, with a maximum query time of 7.352s.

 

That’s obviously not the result I expected to see, since the index that was supposed to be faster was actually significantly slower. However, strangely, if I repeat my test, subsequent runs with the docvalues index are much faster (and of course, when I say “repeat the test,” that includes restarting Solr to clear out any in-memory caching… so Solr caches don’t explain the speed increase, unless they are more persistent than I thought they were; perhaps this is actually reflective of some OS-level file caching making the index file loading faster). Unfortunately, I didn’t capture multiple runs with the non-docvalues index yet, since my time today was limited.

 

Bottom line: my results are inconclusive and confusing. I think I need to try this again with a bigger data set and with more runs under more circumstances. I should try multiple runs both with and without restarting Solr. Perhaps I should also reboot my entire server between tests for a cleaner environment. I should also run the tests against the current Solr4 code for further comparison points. I’ll try to do some of this tomorrow or next week. Suggestions for a better procedure would be welcomed!

 

thanks,

Demian

 

From: Demian Katz [mailto:[hidden email]]
Sent: Thursday, February 18, 2016 9:55 AM
To: Günter Hipler; [hidden email]
Subject: Re: [VuFind-Tech] solr 5.4.1 performance tests with facets

 

Günter,

 

Thanks again for sharing this. As it happens, I wasn’t able to use these tools as-is, but they did help spark my thinking on the easiest approach for my own testing.

 

I’ll share the details in case anyone is interested…


I decided to take more of a Unix pipeline approach – create one tool that extracts parameters from Solr logs, and another tool that takes parameters as input and produces CSV output containing key statistics. Thus, I can do something like this:

 

php extractQueries.php < solr.log | grep "facet.field=" | sort | uniq | php runQueries.php > output.csv

 

Simple but flexible!

 

Here are my scripts:

 

extractQueries.php:

<?php

/**

* Given a Solr log file (sent through STDIN), extract all parameters to STDOUT.

*/

while ($line = fgets(STDIN)) {

    $parts = explode(' ', $line);

    $params = substr($parts[9], 8, strlen($parts[9]) - 9);

    echo "$params\n";

}

 

runQueries.php:

<?php

/**

* Given the output of extractQueries.php (sent to STDIN), create a CSV file of results (sent to STDOUT).

*/

$base = "http://localhost:8082/solr/biblio/select?";

fputcsv(STDOUT, ['input', 'success', 'time', 'matches']);

 

while ($line = fgets(STDIN)) {

    $url = $base . $line;

    $result = json_decode(file_get_contents($url));

    $success = isset($result->responseHeader->QTime) ? true : false;

    $csv = $success

        ? [$line, 'true', $result->responseHeader->QTime, $result->response->numFound]

        : [$line, 'false'];

    fputcsv(STDOUT, $csv);

}

 

Right now I’m still in the process of setting up my first test index, so perhaps these will be refined a little as I start using them… but in any case, here’s my intended procedure:

 

1.)    Spin up standard VuFind test instance on solr5 branch

2.)    Add an extra million records

3.)    Run queries (I have a random sampling of about 600… hopefully that’s reasonable) and save .csv file

4.)    Shut down test instance

5.)    Spin up standard VuFind test instance on solr5 branch with docValues changes merged in

6.)    Repeat steps 2-4.

7.)    Analyze .csv files

 

Hopefully that will show us whether or not there’s a significant measurable difference between the two configurations, even if some of the details aren’t as perfectly scientific as they might be.

 

I’ll post results as soon as I have them. In the meantime, I’m open to suggestions for refining the process!

 

Thanks,

Demian

 

From: Günter Hipler [[hidden email]]
Sent: Wednesday, February 17, 2016 5:53 AM
To: Demian Katz; [hidden email]
Subject: Re: solr 5.4.1 performance tests with facets

 

Demian,

of course you can take a look into what I have done. I just pushed the scripts I used [1]. You can find a short description of the ideas in the README  file

But some notes you should keep in mind and a little background information:
- First I wanted to use the ELK stack (especially Logstash and Elasticsearch) for such log analysis which I have in mind since quite some time and I thought this might be a good moment in time to use it.
(We already use ElasticSearch for our linked project)
- Unfortunately I haven't found just already configured pipelines for Solr logs  I could use out of the box (which surprised me a lot). I stumbled upon something [2]  but this didn't work out for my ideas
- Spending half a day for such purposes is more than enough and  I needed results for my questions about performance of Solr 5.4 servers I made just a quick hack with python and Mongo ... (see the description in README) Perhaps you can use it in a similar way for yourself
- As I already mentioned in the dev-call: The swissbib index schema is quite different compared to the VF2 schema and we don't use the SolrMarc process pipeline (which is the reason why I have to adapt the String normalization stuff for doc-values by myself). Our what we call "documentProcessing" is XSLT and Java based [4]  and in the future I'm planning to use a combination of the stream-based MetaFacture - Framework [5] (initially created by the German National Library) in combination with our current procedures. MetaFacture based workflows are already part of our linked  project [6]

So it's not we don't want to share something but sometimes it is difficult to put things together for people not really familiar with our work - I'm sorry for this.

Very best wishes from Basel!

Günter



[1] https://github.com/linked-swissbib/utilities/commit/8cadbe26271ee5803c73f7b42efed17ffef061a6
[2] http://blog.comperiosearch.com/blog/2015/09/21/solr-logstash-analysis/
[3] https://github.com/swissbib/searchconf/blob/master/solr/bib/conf-solr.4.10.2/configs/solr.home/bib/conf/schema.xml
[4] https://github.com/swissbib/content2SearchDocs
[5] https://github.com/culturegraph/metafacture-core
[6] https://github.com/linked-swissbib/mfWorkflows

On 02/16/2016 05:36 PM, Demian Katz wrote:

Günter,

 

Thanks for sharing these results (and for your participation in today’s call). Out of curiosity, what process did you do to run these tests? Is there any possibility that you might be able to share data/scripts so that I can run the same tests on this end against various configurations of Solr, or is your schema so heavily customized that the queries/facet values would be meaningless to a “stock” VuFind instance?

 

In any case, I understand if you are unable to share the data for whatever reason – but it seemed worth asking in case it could save a bit of time with my own testing over here!

 

thanks,

Demian

 

From: Günter Hipler [[hidden email]]
Sent: Tuesday, February 16, 2016 8:25 AM
To: Demian Katz;
[hidden email]
Cc:
[hidden email]
Subject: solr 5.4.1 performance tests with facets

 

Hi

sorry for the delayed answer. Last week I made some performance tests on our fresh Solr 5.4.1 index without doc-values for facets.

My results:
- it makes a huge difference if you are running on SSD or not. With SSD's the performance is quite the same compared to our productive index using version 4.10
- in detail:
-- I used only queries I collected from our logs on the productive servers and picked up only facet queries with at least one facet field
-- total queries: 76656
qTime < 100 milliseconds:  73270
qTime > 500 milliseconds: 511
qTime > 1000 milliseconds: 325
qTime > 1500 milliseconds: 232
qTime > 2500 milliseconds: 197
qTime > 4000 milliseconds: 30
qTime longest 5724

For me these results are reasonable and more or less comparable with our 4.10 Index.

But I think it's no reason to set change to doc-values aside. The reason why I tested it without doc-values: I would be lucky to postpone the adaptations for the moment because there is a lot of other work to be done.

 Demian, thanks for the link to https://issues.apache.org/jira/browse/SOLR-8096.
I wasn't aware, the Solr team has this severe problem which seems not to be solved until now. From my point of view: they are loosing connection to the development in the underlying Lucene building.

Hope we (swissbib) can take part at todays dev - call.

Günter

On 02/02/2016 02:55 PM, Demian Katz wrote:

Thanks for all of the valuable input – I’ll definitely mention all of these points on today’s call (and will do some catch-up reading on ZF3/Expressive very soon so we can discuss that in more detail next time around).

 

The Solr performance issue is very unfortunate – it seems that docValues are the solution, but this doesn’t really feel like a solution so much as a workaround! I’m not seeing much activity on the JIRA ticket about the performance problems (https://issues.apache.org/jira/browse/SOLR-8096), so we may be stuck having to do something.

 

I would be interested to hear what level of performance improvements you achieve by switching to docValues on your test instance – I think that it would be useful to at least confirm the order of magnitude of improvement on a large-scale index. I haven’t had time to do that sort of performance testing on this end yet.

 

Thanks again!

 

- Demian

 

From: Günter Hipler [[hidden email]]
Sent: Tuesday, February 02, 2016 5:48 AM
To:
[hidden email]
Subject: Re: [VuFind-Tech] Developers Call Agenda - 2/2/16

 

Hi Demian,

unfortunately nobody of the swissbib team isn't able to take part today because of an internal meeting

Therefor some inputs from our side in advance related to the agenda:
d) Solr Upgrade
- As it is alreday discussed on the list the 5.4.1 release fixes the more-like-this nullpointer exception bug. 
e.g. https://testvf.swissbib.ch/Record/336203004 (similar items)
- At the moment we don't use doc values for facets and sort fields because on some facet fields we still perform text processing (e.g. [1] with no String types) on the Solr side. But this is kind of our own problem because we use a strongly modified index schema.
- Next days I want to run some performance tests. At the moment the performance is significantly slower compared to the productive 4.10 version. I guess the main reason could be that in production we are using SSD disks (hopefully..)

g) API.
As I mentioned in the past, we want to start a project with Markus Maechler to implement a REST API for our linked-data project (in the context of his university education). Unfortunately the start is a little bit delayed. We want to provide all our ideas, goals and design principles once their is enough to tell about. At the moment most preparation is only written in German.

h) Modularization
Don't know if you follow the current efforts for ZF3. There are some resources which are quite valuable for information about what's going on:
http://framework.zend.com/blog/
https://mwop.net/blog/2016-01-28-expressive-stable.html

>From my point of view the latest stable release of Expressive as the Zend Microframework for components is a major step to encapsulate components (as part of a modularization process). These components could be used even outside of the VuFind application and replace (if useful) some current use cases of VuFind for a smaller resource footprint.

We are going to follow the current development and it would be nice if more people are getting interest on this. Definitely it needs time and it should be a topic after the 3.0 release.


Günter
 


[1] https://github.com/swissbib/searchconf/blob/update/solr5x/solr/bib/solr.5.4/SOLR_HOME/sb-biblio/conf/schema.xml#L39


On 01/28/2016 02:59 PM, Demian Katz wrote:

The next developers call will be Tuesday, February 2, 2016 at 9am Eastern Standard Time (14:00 GMT).

AGENDA

1. Development Updates
2. Development Planning
    a. Improved Author Indexing
    b. Delimited Facets

    c. Eliminate "VuFind" Source in Database
    d. Solr Upgrade
    e. Javascript Reorganization
    f. Cover Issues
    g. API
    h. Modularization
    i. Improved Use of Permissions

    j. Session Performance Improvement
3. Other Topics?

More information on the free online call can be found at
https://vufind.org/wiki/developers_call -- all are welcome!

- Demian

 

 





------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140





_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech




-- 
UNIVERSITÄT BASEL
Universitätsbibliothek
Günter Hipler
Projekt swissbib
Schönbeinstrasse 18-20
4056 Basel, Schweiz
Tel.: +41 61 267 31 12 
Fax: +41 61 267 31 03
E-Mail [hidden email]
URL www.swissbib.org



-- 
UNIVERSITÄT BASEL
Universitätsbibliothek
Günter Hipler
Projekt swissbib
Schönbeinstrasse 18-20
4056 Basel, Schweiz
Tel.: +41 61 267 31 12 
Fax: +41 61 267 31 03
E-Mail [hidden email]
URL www.swissbib.org

 

-- 
UNIVERSITÄT BASEL
Universitätsbibliothek
Günter Hipler
Projekt swissbib
Schönbeinstrasse 18-20
4056 Basel, Schweiz
Tel.: +41 61 267 31 12 
Fax: +41 61 267 31 03
E-Mail [hidden email]
URL www.swissbib.org

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|

Re: solr 5.4.1 performance tests with facets

Demian Katz

One more data point: I’ve confirmed that if I reboot my entire server and then repeat my test on the docvalues branch, I get very slow results, but subsequent results are quite fast. So this may be related to OS-level loading of docValues data files into cache. That might also suggest that with these changes in place, we’re going to want to devote less memory to Java heap for Solr in order to free up more room for the OS to do its part. So I think for my next round of testing, I should capture data for at least four scenarios:

 

1.)    Server just rebooted, first run.

2.)    Second run without restarting Solr.

3.)    First run after restarting Solr, but not rebooting server.

4.)    Second run without restarting Solr.

 

I would expect runs 2 and 4 to have similar characteristics… run 1 to be the slowest, and run 3 to be a little bit slower than 2 and 4. It will be interesting to see how all of those scenarios compare across solr 5 with docValues, solr 5 without docValues, and solr 4.

 

More as soon as I manage to collect the relevant data!

 

- Demian

 

From: Demian Katz
Sent: Thursday, February 18, 2016 3:38 PM
To: Demian Katz; Günter Hipler; [hidden email]
Subject: RE: solr 5.4.1 performance tests with facets

 

My first round of testing is complete, and with puzzling results.

 

First of all, there’s one small bug in the code I posted earlier… the while loop in runQueries.php should include “$line = trim($line);” at the top to prevent trailing carriage returns from causing problems with executing queries and building well-formed CSV data. (If anybody else really wants this code, I can post the final form somewhere – just let me know).

 

But more importantly, the first round of results seems to suggest that adding docValues makes things worse!

 

Without docvalues, my first time running my 610 sample queries took 48.412 seconds, an average of ~79.364ms per query, with a maximum query time of 4.053s.

 

With docvalues, my first time running the same queries took 55.847 seconds, an average of ~91.552ms per query, with a maximum query time of 7.352s.

 

That’s obviously not the result I expected to see, since the index that was supposed to be faster was actually significantly slower. However, strangely, if I repeat my test, subsequent runs with the docvalues index are much faster (and of course, when I say “repeat the test,” that includes restarting Solr to clear out any in-memory caching… so Solr caches don’t explain the speed increase, unless they are more persistent than I thought they were; perhaps this is actually reflective of some OS-level file caching making the index file loading faster). Unfortunately, I didn’t capture multiple runs with the non-docvalues index yet, since my time today was limited.

 

Bottom line: my results are inconclusive and confusing. I think I need to try this again with a bigger data set and with more runs under more circumstances. I should try multiple runs both with and without restarting Solr. Perhaps I should also reboot my entire server between tests for a cleaner environment. I should also run the tests against the current Solr4 code for further comparison points. I’ll try to do some of this tomorrow or next week. Suggestions for a better procedure would be welcomed!

 

thanks,

Demian

 

From: Demian Katz [[hidden email]]
Sent: Thursday, February 18, 2016 9:55 AM
To: Günter Hipler; [hidden email]
Subject: Re: [VuFind-Tech] solr 5.4.1 performance tests with facets

 

Günter,

 

Thanks again for sharing this. As it happens, I wasn’t able to use these tools as-is, but they did help spark my thinking on the easiest approach for my own testing.

 

I’ll share the details in case anyone is interested…


I decided to take more of a Unix pipeline approach – create one tool that extracts parameters from Solr logs, and another tool that takes parameters as input and produces CSV output containing key statistics. Thus, I can do something like this:

 

php extractQueries.php < solr.log | grep "facet.field=" | sort | uniq | php runQueries.php > output.csv

 

Simple but flexible!

 

Here are my scripts:

 

extractQueries.php:

<?php

/**

* Given a Solr log file (sent through STDIN), extract all parameters to STDOUT.

*/

while ($line = fgets(STDIN)) {

    $parts = explode(' ', $line);

    $params = substr($parts[9], 8, strlen($parts[9]) - 9);

    echo "$params\n";

}

 

runQueries.php:

<?php

/**

* Given the output of extractQueries.php (sent to STDIN), create a CSV file of results (sent to STDOUT).

*/

$base = "http://localhost:8082/solr/biblio/select?";

fputcsv(STDOUT, ['input', 'success', 'time', 'matches']);

 

while ($line = fgets(STDIN)) {

    $url = $base . $line;

    $result = json_decode(file_get_contents($url));

    $success = isset($result->responseHeader->QTime) ? true : false;

    $csv = $success

        ? [$line, 'true', $result->responseHeader->QTime, $result->response->numFound]

        : [$line, 'false'];

    fputcsv(STDOUT, $csv);

}

 

Right now I’m still in the process of setting up my first test index, so perhaps these will be refined a little as I start using them… but in any case, here’s my intended procedure:

 

1.)    Spin up standard VuFind test instance on solr5 branch

2.)    Add an extra million records

3.)    Run queries (I have a random sampling of about 600… hopefully that’s reasonable) and save .csv file

4.)    Shut down test instance

5.)    Spin up standard VuFind test instance on solr5 branch with docValues changes merged in

6.)    Repeat steps 2-4.

7.)    Analyze .csv files

 

Hopefully that will show us whether or not there’s a significant measurable difference between the two configurations, even if some of the details aren’t as perfectly scientific as they might be.

 

I’ll post results as soon as I have them. In the meantime, I’m open to suggestions for refining the process!

 

Thanks,

Demian

 

From: Günter Hipler [[hidden email]]
Sent: Wednesday, February 17, 2016 5:53 AM
To: Demian Katz; [hidden email]
Subject: Re: solr 5.4.1 performance tests with facets

 

Demian,

of course you can take a look into what I have done. I just pushed the scripts I used [1]. You can find a short description of the ideas in the README  file

But some notes you should keep in mind and a little background information:
- First I wanted to use the ELK stack (especially Logstash and Elasticsearch) for such log analysis which I have in mind since quite some time and I thought this might be a good moment in time to use it.
(We already use ElasticSearch for our linked project)
- Unfortunately I haven't found just already configured pipelines for Solr logs  I could use out of the box (which surprised me a lot). I stumbled upon something [2]  but this didn't work out for my ideas
- Spending half a day for such purposes is more than enough and  I needed results for my questions about performance of Solr 5.4 servers I made just a quick hack with python and Mongo ... (see the description in README) Perhaps you can use it in a similar way for yourself
- As I already mentioned in the dev-call: The swissbib index schema is quite different compared to the VF2 schema and we don't use the SolrMarc process pipeline (which is the reason why I have to adapt the String normalization stuff for doc-values by myself). Our what we call "documentProcessing" is XSLT and Java based [4]  and in the future I'm planning to use a combination of the stream-based MetaFacture - Framework [5] (initially created by the German National Library) in combination with our current procedures. MetaFacture based workflows are already part of our linked  project [6]

So it's not we don't want to share something but sometimes it is difficult to put things together for people not really familiar with our work - I'm sorry for this.

Very best wishes from Basel!

Günter



[1] https://github.com/linked-swissbib/utilities/commit/8cadbe26271ee5803c73f7b42efed17ffef061a6
[2] http://blog.comperiosearch.com/blog/2015/09/21/solr-logstash-analysis/
[3] https://github.com/swissbib/searchconf/blob/master/solr/bib/conf-solr.4.10.2/configs/solr.home/bib/conf/schema.xml
[4] https://github.com/swissbib/content2SearchDocs
[5] https://github.com/culturegraph/metafacture-core
[6] https://github.com/linked-swissbib/mfWorkflows

On 02/16/2016 05:36 PM, Demian Katz wrote:

Günter,

 

Thanks for sharing these results (and for your participation in today’s call). Out of curiosity, what process did you do to run these tests? Is there any possibility that you might be able to share data/scripts so that I can run the same tests on this end against various configurations of Solr, or is your schema so heavily customized that the queries/facet values would be meaningless to a “stock” VuFind instance?

 

In any case, I understand if you are unable to share the data for whatever reason – but it seemed worth asking in case it could save a bit of time with my own testing over here!

 

thanks,

Demian

 

From: Günter Hipler [[hidden email]]
Sent: Tuesday, February 16, 2016 8:25 AM
To: Demian Katz;
[hidden email]
Cc:
[hidden email]
Subject: solr 5.4.1 performance tests with facets

 

Hi

sorry for the delayed answer. Last week I made some performance tests on our fresh Solr 5.4.1 index without doc-values for facets.

My results:
- it makes a huge difference if you are running on SSD or not. With SSD's the performance is quite the same compared to our productive index using version 4.10
- in detail:
-- I used only queries I collected from our logs on the productive servers and picked up only facet queries with at least one facet field
-- total queries: 76656
qTime < 100 milliseconds:  73270
qTime > 500 milliseconds: 511
qTime > 1000 milliseconds: 325
qTime > 1500 milliseconds: 232
qTime > 2500 milliseconds: 197
qTime > 4000 milliseconds: 30
qTime longest 5724

For me these results are reasonable and more or less comparable with our 4.10 Index.

But I think it's no reason to set change to doc-values aside. The reason why I tested it without doc-values: I would be lucky to postpone the adaptations for the moment because there is a lot of other work to be done.

 Demian, thanks for the link to https://issues.apache.org/jira/browse/SOLR-8096.
I wasn't aware, the Solr team has this severe problem which seems not to be solved until now. From my point of view: they are loosing connection to the development in the underlying Lucene building.

Hope we (swissbib) can take part at todays dev - call.

Günter

On 02/02/2016 02:55 PM, Demian Katz wrote:

Thanks for all of the valuable input – I’ll definitely mention all of these points on today’s call (and will do some catch-up reading on ZF3/Expressive very soon so we can discuss that in more detail next time around).

 

The Solr performance issue is very unfortunate – it seems that docValues are the solution, but this doesn’t really feel like a solution so much as a workaround! I’m not seeing much activity on the JIRA ticket about the performance problems (https://issues.apache.org/jira/browse/SOLR-8096), so we may be stuck having to do something.

 

I would be interested to hear what level of performance improvements you achieve by switching to docValues on your test instance – I think that it would be useful to at least confirm the order of magnitude of improvement on a large-scale index. I haven’t had time to do that sort of performance testing on this end yet.

 

Thanks again!

 

- Demian

 

From: Günter Hipler [[hidden email]]
Sent: Tuesday, February 02, 2016 5:48 AM
To:
[hidden email]
Subject: Re: [VuFind-Tech] Developers Call Agenda - 2/2/16

 

Hi Demian,

unfortunately nobody of the swissbib team isn't able to take part today because of an internal meeting

Therefor some inputs from our side in advance related to the agenda:
d) Solr Upgrade
- As it is alreday discussed on the list the 5.4.1 release fixes the more-like-this nullpointer exception bug. 
e.g. https://testvf.swissbib.ch/Record/336203004 (similar items)
- At the moment we don't use doc values for facets and sort fields because on some facet fields we still perform text processing (e.g. [1] with no String types) on the Solr side. But this is kind of our own problem because we use a strongly modified index schema.
- Next days I want to run some performance tests. At the moment the performance is significantly slower compared to the productive 4.10 version. I guess the main reason could be that in production we are using SSD disks (hopefully..)

g) API.
As I mentioned in the past, we want to start a project with Markus Maechler to implement a REST API for our linked-data project (in the context of his university education). Unfortunately the start is a little bit delayed. We want to provide all our ideas, goals and design principles once their is enough to tell about. At the moment most preparation is only written in German.

h) Modularization
Don't know if you follow the current efforts for ZF3. There are some resources which are quite valuable for information about what's going on:
http://framework.zend.com/blog/
https://mwop.net/blog/2016-01-28-expressive-stable.html

>From my point of view the latest stable release of Expressive as the Zend Microframework for components is a major step to encapsulate components (as part of a modularization process). These components could be used even outside of the VuFind application and replace (if useful) some current use cases of VuFind for a smaller resource footprint.

We are going to follow the current development and it would be nice if more people are getting interest on this. Definitely it needs time and it should be a topic after the 3.0 release.


Günter
 


[1] https://github.com/swissbib/searchconf/blob/update/solr5x/solr/bib/solr.5.4/SOLR_HOME/sb-biblio/conf/schema.xml#L39

On 01/28/2016 02:59 PM, Demian Katz wrote:

The next developers call will be Tuesday, February 2, 2016 at 9am Eastern Standard Time (14:00 GMT).

AGENDA

1. Development Updates
2. Development Planning
    a. Improved Author Indexing
    b. Delimited Facets

    c. Eliminate "VuFind" Source in Database
    d. Solr Upgrade
    e. Javascript Reorganization
    f. Cover Issues
    g. API
    h. Modularization
    i. Improved Use of Permissions

    j. Session Performance Improvement
3. Other Topics?

More information on the free online call can be found at
https://vufind.org/wiki/developers_call -- all are welcome!

- Demian

 

 




------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140




_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech



-- 
UNIVERSITÄT BASEL
Universitätsbibliothek
Günter Hipler
Projekt swissbib
Schönbeinstrasse 18-20
4056 Basel, Schweiz
Tel.: +41 61 267 31 12 
Fax: +41 61 267 31 03
E-Mail [hidden email]
URL www.swissbib.org

 

-- 
UNIVERSITÄT BASEL
Universitätsbibliothek
Günter Hipler
Projekt swissbib
Schönbeinstrasse 18-20
4056 Basel, Schweiz
Tel.: +41 61 267 31 12 
Fax: +41 61 267 31 03
E-Mail [hidden email]
URL www.swissbib.org

 

-- 
UNIVERSITÄT BASEL
Universitätsbibliothek
Günter Hipler
Projekt swissbib
Schönbeinstrasse 18-20
4056 Basel, Schweiz
Tel.: +41 61 267 31 12 
Fax: +41 61 267 31 03
E-Mail [hidden email]
URL www.swissbib.org

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|

Re: solr 5.4.1 performance tests with facets

Demian Katz
Okay, I have finished my first round of testing:


Solr 4 (no docvalues) Solr 5 (no docvalues) Solr 5 (docvalues)
Run 1: total time 78852 71449 56496
Run 1: max time 9759 8811 7463
Run 1: avg time 129.27 117.13 92.62
Run 2: total time 6396 7505 7272
Run 2: max time 431 477 489
Run 2: avg time 10.49 12.3 11.92
Run 3: total time 13637 14110 12187
Run 3: max time 2198 2386 827
Run 3: avg time 22.36 23.13 19.98
Run 4: total time 6643 8058 7493
Run 4: max time 438 542 433
Run 4: avg time 10.89 13.21 12.28

This data is for running 610 different queries, all including facet parameters, against an index of over 1,000,000 items. Run 1 was executed immediately after rebooting the server, with no other processes running. Run 2 was executed immediately after Run 1, with no environmental changes made. Run 3 was executed immediately after restarting Solr. Run 4 was executed immediately after Run 3, with no environmental changes made. All times are in ms. I confirmed that all runs yielded identical numbers of results, so I'm confident that the only difference between each scenario was the execution time, not the output.

As you can see, every scenario is showing the same basic pattern: the first run is very slow, since both Solr and OS-level caches have to be populated. The remaining runs are much faster, with Run 3 being slower than 2 and 4 since in this scenario, Solr has to rebuild its own caches. Runs 2 and 4 are very similar to one another in terms of performance.

I've saved all of my indexes, so I can repeat these tests a few times if people would like to see how things average out across multiple executions... but for now I decided not to spend more time until after we've discussed this initial data set. I'm also happy to share more detailed spreadsheets if anyone wishes to study the results in a more granular fashion. But the bottom line is that I'm not seeing the order-of-magnitude performance changes described in SOLR-8096. I expected that the "solr5" results would be much worse than the "solr4" results, with "solr5-docvalues" results being somewhere in the middle. As you can see, that's not really the case. You might be able to extrapolate some patterns from this data, but they're not nearly as extreme as I had feared. Maybe that's due to a flaw in my test set, and I would welcome more data from anyone else willing to run similar experiments... but right now I'm not seeing anything too conclusive. On the one hand, this is a relief -- it seems to suggest that the Solr 5 upgrade is not going to be a the performance game-changer I had feared; on the other hand, it's frustrating, because I'd really like to see clearer cause-and-effect here!

- Demian


From: Demian Katz [[hidden email]]
Sent: Thursday, February 18, 2016 3:46 PM
To: Günter Hipler; [hidden email]
Subject: Re: [VuFind-Tech] solr 5.4.1 performance tests with facets

One more data point: I’ve confirmed that if I reboot my entire server and then repeat my test on the docvalues branch, I get very slow results, but subsequent results are quite fast. So this may be related to OS-level loading of docValues data files into cache. That might also suggest that with these changes in place, we’re going to want to devote less memory to Java heap for Solr in order to free up more room for the OS to do its part. So I think for my next round of testing, I should capture data for at least four scenarios:

 

1.)    Server just rebooted, first run.

2.)    Second run without restarting Solr.

3.)    First run after restarting Solr, but not rebooting server.

4.)    Second run without restarting Solr.

 

I would expect runs 2 and 4 to have similar characteristics… run 1 to be the slowest, and run 3 to be a little bit slower than 2 and 4. It will be interesting to see how all of those scenarios compare across solr 5 with docValues, solr 5 without docValues, and solr 4.

 

More as soon as I manage to collect the relevant data!

 

- Demian

 

From: Demian Katz
Sent: Thursday, February 18, 2016 3:38 PM
To: Demian Katz; Günter Hipler; [hidden email]
Subject: RE: solr 5.4.1 performance tests with facets

 

My first round of testing is complete, and with puzzling results.

 

First of all, there’s one small bug in the code I posted earlier… the while loop in runQueries.php should include “$line = trim($line);” at the top to prevent trailing carriage returns from causing problems with executing queries and building well-formed CSV data. (If anybody else really wants this code, I can post the final form somewhere – just let me know).

 

But more importantly, the first round of results seems to suggest that adding docValues makes things worse!

 

Without docvalues, my first time running my 610 sample queries took 48.412 seconds, an average of ~79.364ms per query, with a maximum query time of 4.053s.

 

With docvalues, my first time running the same queries took 55.847 seconds, an average of ~91.552ms per query, with a maximum query time of 7.352s.

 

That’s obviously not the result I expected to see, since the index that was supposed to be faster was actually significantly slower. However, strangely, if I repeat my test, subsequent runs with the docvalues index are much faster (and of course, when I say “repeat the test,” that includes restarting Solr to clear out any in-memory caching… so Solr caches don’t explain the speed increase, unless they are more persistent than I thought they were; perhaps this is actually reflective of some OS-level file caching making the index file loading faster). Unfortunately, I didn’t capture multiple runs with the non-docvalues index yet, since my time today was limited.

 

Bottom line: my results are inconclusive and confusing. I think I need to try this again with a bigger data set and with more runs under more circumstances. I should try multiple runs both with and without restarting Solr. Perhaps I should also reboot my entire server between tests for a cleaner environment. I should also run the tests against the current Solr4 code for further comparison points. I’ll try to do some of this tomorrow or next week. Suggestions for a better procedure would be welcomed!

 

thanks,

Demian

 

From: Demian Katz [[hidden email]]
Sent: Thursday, February 18, 2016 9:55 AM
To: Günter Hipler; [hidden email]
Subject: Re: [VuFind-Tech] solr 5.4.1 performance tests with facets

 

Günter,

 

Thanks again for sharing this. As it happens, I wasn’t able to use these tools as-is, but they did help spark my thinking on the easiest approach for my own testing.

 

I’ll share the details in case anyone is interested…


I decided to take more of a Unix pipeline approach – create one tool that extracts parameters from Solr logs, and another tool that takes parameters as input and produces CSV output containing key statistics. Thus, I can do something like this:

 

php extractQueries.php < solr.log | grep "facet.field=" | sort | uniq | php runQueries.php > output.csv

 

Simple but flexible!

 

Here are my scripts:

 

extractQueries.php:

<?php

/**

* Given a Solr log file (sent through STDIN), extract all parameters to STDOUT.

*/

while ($line = fgets(STDIN)) {

    $parts = explode(' ', $line);

    $params = substr($parts[9], 8, strlen($parts[9]) - 9);

    echo "$params\n";

}

 

runQueries.php:

<?php

/**

* Given the output of extractQueries.php (sent to STDIN), create a CSV file of results (sent to STDOUT).

*/

$base = "http://localhost:8082/solr/biblio/select?";

fputcsv(STDOUT, ['input', 'success', 'time', 'matches']);

 

while ($line = fgets(STDIN)) {

    $url = $base . $line;

    $result = json_decode(file_get_contents($url));

    $success = isset($result->responseHeader->QTime) ? true : false;

    $csv = $success

        ? [$line, 'true', $result->responseHeader->QTime, $result->response->numFound]

        : [$line, 'false'];

    fputcsv(STDOUT, $csv);

}

 

Right now I’m still in the process of setting up my first test index, so perhaps these will be refined a little as I start using them… but in any case, here’s my intended procedure:

 

1.)    Spin up standard VuFind test instance on solr5 branch

2.)    Add an extra million records

3.)    Run queries (I have a random sampling of about 600… hopefully that’s reasonable) and save .csv file

4.)    Shut down test instance

5.)    Spin up standard VuFind test instance on solr5 branch with docValues changes merged in

6.)    Repeat steps 2-4.

7.)    Analyze .csv files

 

Hopefully that will show us whether or not there’s a significant measurable difference between the two configurations, even if some of the details aren’t as perfectly scientific as they might be.

 

I’ll post results as soon as I have them. In the meantime, I’m open to suggestions for refining the process!

 

Thanks,

Demian

 

From: Günter Hipler [[hidden email]]
Sent: Wednesday, February 17, 2016 5:53 AM
To: Demian Katz; [hidden email]
Subject: Re: solr 5.4.1 performance tests with facets

 

Demian,

of course you can take a look into what I have done. I just pushed the scripts I used [1]. You can find a short description of the ideas in the README  file

But some notes you should keep in mind and a little background information:
- First I wanted to use the ELK stack (especially Logstash and Elasticsearch) for such log analysis which I have in mind since quite some time and I thought this might be a good moment in time to use it.
(We already use ElasticSearch for our linked project)
- Unfortunately I haven't found just already configured pipelines for Solr logs  I could use out of the box (which surprised me a lot). I stumbled upon something [2]  but this didn't work out for my ideas
- Spending half a day for such purposes is more than enough and  I needed results for my questions about performance of Solr 5.4 servers I made just a quick hack with python and Mongo ... (see the description in README) Perhaps you can use it in a similar way for yourself
- As I already mentioned in the dev-call: The swissbib index schema is quite different compared to the VF2 schema and we don't use the SolrMarc process pipeline (which is the reason why I have to adapt the String normalization stuff for doc-values by myself). Our what we call "documentProcessing" is XSLT and Java based [4]  and in the future I'm planning to use a combination of the stream-based MetaFacture - Framework [5] (initially created by the German National Library) in combination with our current procedures. MetaFacture based workflows are already part of our linked  project [6]

So it's not we don't want to share something but sometimes it is difficult to put things together for people not really familiar with our work - I'm sorry for this.

Very best wishes from Basel!

Günter



[1] https://github.com/linked-swissbib/utilities/commit/8cadbe26271ee5803c73f7b42efed17ffef061a6
[2] http://blog.comperiosearch.com/blog/2015/09/21/solr-logstash-analysis/
[3] https://github.com/swissbib/searchconf/blob/master/solr/bib/conf-solr.4.10.2/configs/solr.home/bib/conf/schema.xml
[4] https://github.com/swissbib/content2SearchDocs
[5] https://github.com/culturegraph/metafacture-core
[6] https://github.com/linked-swissbib/mfWorkflows

On 02/16/2016 05:36 PM, Demian Katz wrote:

Günter,

 

Thanks for sharing these results (and for your participation in today’s call). Out of curiosity, what process did you do to run these tests? Is there any possibility that you might be able to share data/scripts so that I can run the same tests on this end against various configurations of Solr, or is your schema so heavily customized that the queries/facet values would be meaningless to a “stock” VuFind instance?

 

In any case, I understand if you are unable to share the data for whatever reason – but it seemed worth asking in case it could save a bit of time with my own testing over here!

 

thanks,

Demian

 

From: Günter Hipler [[hidden email]]
Sent: Tuesday, February 16, 2016 8:25 AM
To: Demian Katz;
[hidden email]
Cc:
[hidden email]
Subject: solr 5.4.1 performance tests with facets

 

Hi

sorry for the delayed answer. Last week I made some performance tests on our fresh Solr 5.4.1 index without doc-values for facets.

My results:
- it makes a huge difference if you are running on SSD or not. With SSD's the performance is quite the same compared to our productive index using version 4.10
- in detail:
-- I used only queries I collected from our logs on the productive servers and picked up only facet queries with at least one facet field
-- total queries: 76656
qTime < 100 milliseconds:  73270
qTime > 500 milliseconds: 511
qTime > 1000 milliseconds: 325
qTime > 1500 milliseconds: 232
qTime > 2500 milliseconds: 197
qTime > 4000 milliseconds: 30
qTime longest 5724

For me these results are reasonable and more or less comparable with our 4.10 Index.

But I think it's no reason to set change to doc-values aside. The reason why I tested it without doc-values: I would be lucky to postpone the adaptations for the moment because there is a lot of other work to be done.

 Demian, thanks for the link to https://issues.apache.org/jira/browse/SOLR-8096.
I wasn't aware, the Solr team has this severe problem which seems not to be solved until now. From my point of view: they are loosing connection to the development in the underlying Lucene building.

Hope we (swissbib) can take part at todays dev - call.

Günter

On 02/02/2016 02:55 PM, Demian Katz wrote:

Thanks for all of the valuable input – I’ll definitely mention all of these points on today’s call (and will do some catch-up reading on ZF3/Expressive very soon so we can discuss that in more detail next time around).

 

The Solr performance issue is very unfortunate – it seems that docValues are the solution, but this doesn’t really feel like a solution so much as a workaround! I’m not seeing much activity on the JIRA ticket about the performance problems (https://issues.apache.org/jira/browse/SOLR-8096), so we may be stuck having to do something.

 

I would be interested to hear what level of performance improvements you achieve by switching to docValues on your test instance – I think that it would be useful to at least confirm the order of magnitude of improvement on a large-scale index. I haven’t had time to do that sort of performance testing on this end yet.

 

Thanks again!

 

- Demian

 

From: Günter Hipler [[hidden email]]
Sent: Tuesday, February 02, 2016 5:48 AM
To:
[hidden email]
Subject: Re: [VuFind-Tech] Developers Call Agenda - 2/2/16

 

Hi Demian,

unfortunately nobody of the swissbib team isn't able to take part today because of an internal meeting

Therefor some inputs from our side in advance related to the agenda:
d) Solr Upgrade
- As it is alreday discussed on the list the 5.4.1 release fixes the more-like-this nullpointer exception bug. 
e.g. https://testvf.swissbib.ch/Record/336203004 (similar items)
- At the moment we don't use doc values for facets and sort fields because on some facet fields we still perform text processing (e.g. [1] with no String types) on the Solr side. But this is kind of our own problem because we use a strongly modified index schema.
- Next days I want to run some performance tests. At the moment the performance is significantly slower compared to the productive 4.10 version. I guess the main reason could be that in production we are using SSD disks (hopefully..)

g) API.
As I mentioned in the past, we want to start a project with Markus Maechler to implement a REST API for our linked-data project (in the context of his university education). Unfortunately the start is a little bit delayed. We want to provide all our ideas, goals and design principles once their is enough to tell about. At the moment most preparation is only written in German.

h) Modularization
Don't know if you follow the current efforts for ZF3. There are some resources which are quite valuable for information about what's going on:
http://framework.zend.com/blog/
https://mwop.net/blog/2016-01-28-expressive-stable.html

>From my point of view the latest stable release of Expressive as the Zend Microframework for components is a major step to encapsulate components (as part of a modularization process). These components could be used even outside of the VuFind application and replace (if useful) some current use cases of VuFind for a smaller resource footprint.

We are going to follow the current development and it would be nice if more people are getting interest on this. Definitely it needs time and it should be a topic after the 3.0 release.


Günter
 


[1] https://github.com/swissbib/searchconf/blob/update/solr5x/solr/bib/solr.5.4/SOLR_HOME/sb-biblio/conf/schema.xml#L39

On 01/28/2016 02:59 PM, Demian Katz wrote:

The next developers call will be Tuesday, February 2, 2016 at 9am Eastern Standard Time (14:00 GMT).

AGENDA

1. Development Updates
2. Development Planning
    a. Improved Author Indexing
    b. Delimited Facets

    c. Eliminate "VuFind" Source in Database
    d. Solr Upgrade
    e. Javascript Reorganization
    f. Cover Issues
    g. API
    h. Modularization
    i. Improved Use of Permissions

    j. Session Performance Improvement
3. Other Topics?

More information on the free online call can be found at
https://vufind.org/wiki/developers_call -- all are welcome!

- Demian

 

 




------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140




_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech



-- 
UNIVERSITÄT BASEL
Universitätsbibliothek
Günter Hipler
Projekt swissbib
Schönbeinstrasse 18-20
4056 Basel, Schweiz
Tel.: +41 61 267 31 12 
Fax: +41 61 267 31 03
E-Mail [hidden email]
URL www.swissbib.org

 

-- 
UNIVERSITÄT BASEL
Universitätsbibliothek
Günter Hipler
Projekt swissbib
Schönbeinstrasse 18-20
4056 Basel, Schweiz
Tel.: +41 61 267 31 12 
Fax: +41 61 267 31 03
E-Mail [hidden email]
URL www.swissbib.org

 

-- 
UNIVERSITÄT BASEL
Universitätsbibliothek
Günter Hipler
Projekt swissbib
Schönbeinstrasse 18-20
4056 Basel, Schweiz
Tel.: +41 61 267 31 12 
Fax: +41 61 267 31 03
E-Mail [hidden email]
URL www.swissbib.org

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|

Re: solr 5.4.1 performance tests with facets

Ere Maijala
...and now there's Solr 5.5.0 with at least one interesting item in the
changes list:

* Uninverted field faceting is re-enabled, for higher performance on
    rarely changing indices

This might warrant a new round of testing.

--Ere

19.2.2016, 17.38, Demian Katz kirjoitti:

> Okay, I have finished my first round of testing:
>
>
> Solr 4 (no docvalues) Solr 5 (no docvalues) Solr 5 (docvalues)
> Run 1: total time 78852 71449 56496
> Run 1: max time 9759 8811 7463
> Run 1: avg time 129.27 117.13 92.62
> Run 2: total time 6396 7505 7272
> Run 2: max time 431 477 489
> Run 2: avg time 10.49 12.3 11.92
> Run 3: total time 13637 14110 12187
> Run 3: max time 2198 2386 827
> Run 3: avg time 22.36 23.13 19.98
> Run 4: total time 6643 8058 7493
> Run 4: max time 438 542 433
> Run 4: avg time 10.89 13.21 12.28
>
>
> This data is for running 610 different queries, all including facet
> parameters, against an index of over 1,000,000 items. Run 1 was executed
> immediately after rebooting the server, with no other processes running.
> Run 2 was executed immediately after Run 1, with no environmental
> changes made. Run 3 was executed immediately after restarting Solr. Run
> 4 was executed immediately after Run 3, with no environmental changes
> made. All times are in ms. I confirmed that all runs yielded identical
> numbers of results, so I'm confident that the only difference between
> each scenario was the execution time, not the output.
>
> As you can see, every scenario is showing the same basic pattern: the
> first run is very slow, since both Solr and OS-level caches have to be
> populated. The remaining runs are much faster, with Run 3 being slower
> than 2 and 4 since in this scenario, Solr has to rebuild its own caches.
> Runs 2 and 4 are very similar to one another in terms of performance.
>
> I've saved all of my indexes, so I can repeat these tests a few times if
> people would like to see how things average out across multiple
> executions... but for now I decided not to spend more time until after
> we've discussed this initial data set. I'm also happy to share more
> detailed spreadsheets if anyone wishes to study the results in a more
> granular fashion. But the bottom line is that I'm not seeing the
> order-of-magnitude performance changes described in SOLR-8096. I
> expected that the "solr5" results would be much worse than the "solr4"
> results, with "solr5-docvalues" results being somewhere in the middle.
> As you can see, that's not really the case. You might be able to
> extrapolate some patterns from this data, but they're not nearly as
> extreme as I had feared. Maybe that's due to a flaw in my test set, and
> I would welcome more data from anyone else willing to run similar
> experiments... but right now I'm not seeing anything too conclusive. On
> the one hand, this is a relief -- it seems to suggest that the Solr 5
> upgrade is not going to be a the performance game-changer I had feared;
> on the other hand, it's frustrating, because I'd really like to see
> clearer cause-and-effect here!
>
> - Demian
>
> ------------------------------------------------------------------------
> *From:* Demian Katz [[hidden email]]
> *Sent:* Thursday, February 18, 2016 3:46 PM
> *To:* Günter Hipler; [hidden email]
> *Subject:* Re: [VuFind-Tech] solr 5.4.1 performance tests with facets
>
> One more data point: I’ve confirmed that if I reboot my entire server
> and then repeat my test on the docvalues branch, I get very slow
> results, but subsequent results are quite fast. So this may be related
> to OS-level loading of docValues data files into cache. That might also
> suggest that with these changes in place, we’re going to want to devote
> less memory to Java heap for Solr in order to free up more room for the
> OS to do its part. So I think for my next round of testing, I should
> capture data for at least four scenarios:
>
> 1.)Server just rebooted, first run.
>
> 2.)Second run without restarting Solr.
>
> 3.)First run after restarting Solr, but not rebooting server.
>
> 4.)Second run without restarting Solr.
>
> I would expect runs 2 and 4 to have similar characteristics… run 1 to be
> the slowest, and run 3 to be a little bit slower than 2 and 4. It will
> be interesting to see how all of those scenarios compare across solr 5
> with docValues, solr 5 without docValues, and solr 4.
>
> More as soon as I manage to collect the relevant data!
>
> - Demian
>
> *From:*Demian Katz
> *Sent:* Thursday, February 18, 2016 3:38 PM
> *To:* Demian Katz; Günter Hipler; [hidden email]
> *Subject:* RE: solr 5.4.1 performance tests with facets
>
> My first round of testing is complete, and with puzzling results.
>
> First of all, there’s one small bug in the code I posted earlier… the
> while loop in runQueries.php should include “$line = trim($line);” at
> the top to prevent trailing carriage returns from causing problems with
> executing queries and building well-formed CSV data. (If anybody else
> really wants this code, I can post the final form somewhere – just let
> me know).
>
> But more importantly, the first round of results seems to suggest that
> adding docValues makes things worse!
>
> Without docvalues, my first time running my 610 sample queries took
> 48.412 seconds, an average of ~79.364ms per query, with a maximum query
> time of 4.053s.
>
> With docvalues, my first time running the same queries took 55.847
> seconds, an average of ~91.552ms per query, with a maximum query time of
> 7.352s.
>
> That’s obviously not the result I expected to see, since the index that
> was supposed to be faster was actually significantly slower. However,
> strangely, if I repeat my test, subsequent runs with the docvalues index
> are much faster (and of course, when I say “repeat the test,” that
> includes restarting Solr to clear out any in-memory caching… so Solr
> caches don’t explain the speed increase, unless they are more persistent
> than I thought they were; perhaps this is actually reflective of some
> OS-level file caching making the index file loading faster).
> Unfortunately, I didn’t capture multiple runs with the non-docvalues
> index yet, since my time today was limited.
>
> Bottom line: my results are inconclusive and confusing. I think I need
> to try this again with a bigger data set and with more runs under more
> circumstances. I should try multiple runs both with and without
> restarting Solr. Perhaps I should also reboot my entire server between
> tests for a cleaner environment. I should also run the tests against the
> current Solr4 code for further comparison points. I’ll try to do some of
> this tomorrow or next week. Suggestions for a better procedure would be
> welcomed!
>
> thanks,
>
> Demian
>
> *From:*Demian Katz [mailto:[hidden email]]
> *Sent:* Thursday, February 18, 2016 9:55 AM
> *To:* Günter Hipler; [hidden email]
> <mailto:[hidden email]>
> *Subject:* Re: [VuFind-Tech] solr 5.4.1 performance tests with facets
>
> Günter,
>
> Thanks again for sharing this. As it happens, I wasn’t able to use these
> tools as-is, but they did help spark my thinking on the easiest approach
> for my own testing.
>
> I’ll share the details in case anyone is interested…
>
>
> I decided to take more of a Unix pipeline approach – create one tool
> that extracts parameters from Solr logs, and another tool that takes
> parameters as input and produces CSV output containing key statistics.
> Thus, I can do something like this:
>
> php extractQueries.php < solr.log | grep "facet.field=" | sort | uniq |
> php runQueries.php > output.csv
>
> Simple but flexible!
>
> Here are my scripts:
>
> extractQueries.php:
>
> <?php
>
> /**
>
> * Given a Solr log file (sent through STDIN), extract all parameters to
> STDOUT.
>
> */
>
> while ($line = fgets(STDIN)) {
>
>      $parts = explode(' ', $line);
>
>      $params = substr($parts[9], 8, strlen($parts[9]) - 9);
>
>      echo "$params\n";
>
> }
>
> runQueries.php:
>
> <?php
>
> /**
>
> * Given the output of extractQueries.php (sent to STDIN), create a CSV
> file of results (sent to STDOUT).
>
> */
>
> $base = "http://localhost:8082/solr/biblio/select?";
>
> fputcsv(STDOUT, ['input', 'success', 'time', 'matches']);
>
> while ($line = fgets(STDIN)) {
>
>      $url = $base . $line;
>
>      $result = json_decode(file_get_contents($url));
>
>      $success = isset($result->responseHeader->QTime) ? true : false;
>
>      $csv = $success
>
>          ? [$line, 'true', $result->responseHeader->QTime,
> $result->response->numFound]
>
>          : [$line, 'false'];
>
>      fputcsv(STDOUT, $csv);
>
> }
>
> Right now I’m still in the process of setting up my first test index, so
> perhaps these will be refined a little as I start using them… but in any
> case, here’s my intended procedure:
>
> 1.)Spin up standard VuFind test instance on solr5 branch
>
> 2.)Add an extra million records
>
> 3.)Run queries (I have a random sampling of about 600… hopefully that’s
> reasonable) and save .csv file
>
> 4.)Shut down test instance
>
> 5.)Spin up standard VuFind test instance on solr5 branch with docValues
> changes merged in
>
> 6.)Repeat steps 2-4.
>
> 7.)Analyze .csv files
>
> Hopefully that will show us whether or not there’s a significant
> measurable difference between the two configurations, even if some of
> the details aren’t as perfectly scientific as they might be.
>
> I’ll post results as soon as I have them. In the meantime, I’m open to
> suggestions for refining the process!
>
> Thanks,
>
> Demian
>
> *From:*Günter Hipler [mailto:[hidden email]]
> *Sent:* Wednesday, February 17, 2016 5:53 AM
> *To:* Demian Katz; [hidden email]
> <mailto:[hidden email]>
> *Subject:* Re: solr 5.4.1 performance tests with facets
>
> Demian,
>
> of course you can take a look into what I have done. I just pushed the
> scripts I used [1]. You can find a short description of the ideas in the
> README  file
>
> But some notes you should keep in mind and a little background information:
> - First I wanted to use the ELK stack (especially Logstash and
> Elasticsearch) for such log analysis which I have in mind since quite
> some time and I thought this might be a good moment in time to use it.
> (We already use ElasticSearch for our linked project)
> - Unfortunately I haven't found just already configured pipelines for
> Solr logs  I could use out of the box (which surprised me a lot). I
> stumbled upon something [2]  but this didn't work out for my ideas
> - Spending half a day for such purposes is more than enough and  I
> needed results for my questions about performance of Solr 5.4 servers I
> made just a quick hack with python and Mongo ... (see the description in
> README) Perhaps you can use it in a similar way for yourself
> - As I already mentioned in the dev-call: The swissbib index schema is
> quite different compared to the VF2 schema and we don't use the SolrMarc
> process pipeline (which is the reason why I have to adapt the String
> normalization stuff for doc-values by myself). Our what we call
> "documentProcessing" is XSLT and Java based [4]  and in the future I'm
> planning to use a combination of the stream-based MetaFacture -
> Framework [5] (initially created by the German National Library) in
> combination with our current procedures. MetaFacture based workflows are
> already part of our linked  project [6]
>
> So it's not we don't want to share something but sometimes it is
> difficult to put things together for people not really familiar with our
> work - I'm sorry for this.
>
> Very best wishes from Basel!
>
> Günter
>
>
>
> [1]
> https://github.com/linked-swissbib/utilities/commit/8cadbe26271ee5803c73f7b42efed17ffef061a6
> [2] http://blog.comperiosearch.com/blog/2015/09/21/solr-logstash-analysis/
> [3]
> https://github.com/swissbib/searchconf/blob/master/solr/bib/conf-solr.4.10.2/configs/solr.home/bib/conf/schema.xml
> [4] https://github.com/swissbib/content2SearchDocs
> [5] https://github.com/culturegraph/metafacture-core
> [6] https://github.com/linked-swissbib/mfWorkflows
>
> On 02/16/2016 05:36 PM, Demian Katz wrote:
>
>     Günter,
>
>     Thanks for sharing these results (and for your participation in
>     today’s call). Out of curiosity, what process did you do to run
>     these tests? Is there any possibility that you might be able to
>     share data/scripts so that I can run the same tests on this end
>     against various configurations of Solr, or is your schema so heavily
>     customized that the queries/facet values would be meaningless to a
>     “stock” VuFind instance?
>
>     In any case, I understand if you are unable to share the data for
>     whatever reason – but it seemed worth asking in case it could save a
>     bit of time with my own testing over here!
>
>     thanks,
>
>     Demian
>
>     *From:*Günter Hipler [mailto:[hidden email]]
>     *Sent:* Tuesday, February 16, 2016 8:25 AM
>     *To:* Demian Katz; [hidden email]
>     <mailto:[hidden email]>
>     *Cc:* [hidden email] <mailto:[hidden email]>
>     *Subject:* solr 5.4.1 performance tests with facets
>
>     Hi
>
>     sorry for the delayed answer. Last week I made some performance
>     tests on our fresh Solr 5.4.1 index without doc-values for facets.
>
>     My results:
>     - it makes a huge difference if you are running on SSD or not. With
>     SSD's the performance is quite the same compared to our productive
>     index using version 4.10
>     - in detail:
>     -- I used only queries I collected from our logs on the productive
>     servers and picked up only facet queries with at least one facet field
>     -- total queries: 76656
>     qTime < 100 milliseconds:  73270
>     qTime > 500 milliseconds: 511
>     qTime > 1000 milliseconds: 325
>     qTime > 1500 milliseconds: 232
>     qTime > 2500 milliseconds: 197
>     qTime > 4000 milliseconds: 30
>     qTime longest 5724
>
>     For me these results are reasonable and more or less comparable with
>     our 4.10 Index.
>
>     But I think it's no reason to set change to doc-values aside. The
>     reason why I tested it without doc-values: I would be lucky to
>     postpone the adaptations for the moment because there is a lot of
>     other work to be done.
>
>       Demian, thanks for the link to
>     https://issues.apache.org/jira/browse/SOLR-8096.
>     I wasn't aware, the Solr team has this severe problem which seems
>     not to be solved until now. From my point of view: they are loosing
>     connection to the development in the underlying Lucene building.
>
>     Hope we (swissbib) can take part at todays dev - call.
>
>     Günter
>
>     On 02/02/2016 02:55 PM, Demian Katz wrote:
>
>         Thanks for all of the valuable input – I’ll definitely mention
>         all of these points on today’s call (and will do some catch-up
>         reading on ZF3/Expressive very soon so we can discuss that in
>         more detail next time around).
>
>         The Solr performance issue is very unfortunate – it seems that
>         docValues are the solution, but this doesn’t really feel like a
>         solution so much as a workaround! I’m not seeing much activity
>         on the JIRA ticket about the performance problems
>         (https://issues.apache.org/jira/browse/SOLR-8096), so we may be
>         stuck having to do something.
>
>         I would be interested to hear what level of performance
>         improvements you achieve by switching to docValues on your test
>         instance – I think that it would be useful to at least confirm
>         the order of magnitude of improvement on a large-scale index. I
>         haven’t had time to do that sort of performance testing on this
>         end yet.
>
>         Thanks again!
>
>         - Demian
>
>         *From:*Günter Hipler [mailto:[hidden email]]
>         *Sent:* Tuesday, February 02, 2016 5:48 AM
>         *To:* [hidden email]
>         <mailto:[hidden email]>
>         *Subject:* Re: [VuFind-Tech] Developers Call Agenda - 2/2/16
>
>         Hi Demian,
>
>         unfortunately nobody of the swissbib team isn't able to take
>         part today because of an internal meeting
>
>         Therefor some inputs from our side in advance related to the agenda:
>         d) Solr Upgrade
>         - As it is alreday discussed on the list the 5.4.1 release fixes
>         the more-like-this nullpointer exception bug.
>         e.g. https://testvf.swissbib.ch/Record/336203004 (similar items)
>         - At the moment we don't use doc values for facets and sort
>         fields because on some facet fields we still perform text
>         processing (e.g. [1] with no String types) on the Solr side. But
>         this is kind of our own problem because we use a strongly
>         modified index schema.
>         - Next days I want to run some performance tests. At the moment
>         the performance is significantly slower compared to the
>         productive 4.10 version. I guess the main reason could be that
>         in production we are using SSD disks (hopefully..)
>
>         g) API.
>         As I mentioned in the past, we want to start a project with
>         Markus Maechler to implement a REST API for our linked-data
>         project (in the context of his university education).
>         Unfortunately the start is a little bit delayed. We want to
>         provide all our ideas, goals and design principles once their is
>         enough to tell about. At the moment most preparation is only
>         written in German.
>
>         h) Modularization
>         Don't know if you follow the current efforts for ZF3. There are
>         some resources which are quite valuable for information about
>         what's going on:
>         http://framework.zend.com/blog/
>         https://mwop.net/blog/2016-01-28-expressive-stable.html
>
>          >From my point of view the latest stable release of Expressive
>         as the Zend Microframework for components is a major step to
>         encapsulate components (as part of a modularization process).
>         These components could be used even outside of the VuFind
>         application and replace (if useful) some current use cases of
>         VuFind for a smaller resource footprint.
>
>         We are going to follow the current development and it would be
>         nice if more people are getting interest on this. Definitely it
>         needs time and it should be a topic after the 3.0 release.
>
>
>         Günter
>
>
>
>         [1]
>         https://github.com/swissbib/searchconf/blob/update/solr5x/solr/bib/solr.5.4/SOLR_HOME/sb-biblio/conf/schema.xml#L39
>
>         On 01/28/2016 02:59 PM, Demian Katz wrote:
>
>             The next developers call will be Tuesday, February 2, 2016
>             at 9am Eastern Standard Time (14:00 GMT).
>
>             AGENDA
>
>             1. Development Updates
>             2. Development Planning
>                  a. Improved Author Indexing
>                  b. Delimited Facets
>
>                  c. Eliminate "VuFind" Source in Database
>                  d. Solr Upgrade
>                  e. Javascript Reorganization
>                  f. Cover Issues
>                  g. API
>                  h. Modularization
>                  i. Improved Use of Permissions
>
>                  j. Session Performance Improvement
>             3. Other Topics?
>
>             More information on the free online call can be found at
>             https://vufind.org/wiki/developers_call-- all are welcome!
>
>             - Demian
>
>
>
>
>             ------------------------------------------------------------------------------
>
>             Site24x7 APM Insight: Get Deep Visibility into Application Performance
>
>             APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>
>             Monitor end-to-end web transactions and take corrective actions now
>
>             Troubleshoot faster and improve end-user experience. Signup Now!
>
>             http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
>
>
>
>
>             _______________________________________________
>
>             Vufind-tech mailing list
>
>             [hidden email]
>             <mailto:[hidden email]>
>
>             https://lists.sourceforge.net/lists/listinfo/vufind-tech
>
>
>
>         --
>
>         UNIVERSITÄT BASEL
>
>         Universitätsbibliothek
>
>         Günter Hipler
>
>         Projekt swissbib
>
>         Schönbeinstrasse 18-20
>
>         4056 Basel, Schweiz
>
>         Tel.: +41 61 267 31 12
>
>         Fax: +41 61 267 31 03
>
>         [hidden email] <mailto:[hidden email]>
>
>         URLwww.swissbib.org <http://www.swissbib.org>
>
>     --
>
>     UNIVERSITÄT BASEL
>
>     Universitätsbibliothek
>
>     Günter Hipler
>
>     Projekt swissbib
>
>     Schönbeinstrasse 18-20
>
>     4056 Basel, Schweiz
>
>     Tel.: +41 61 267 31 12
>
>     Fax: +41 61 267 31 03
>
>     [hidden email] <mailto:[hidden email]>
>
>     URLwww.swissbib.org <http://www.swissbib.org>
>
> --
>
> UNIVERSITÄT BASEL
>
> Universitätsbibliothek
>
> Günter Hipler
>
> Projekt swissbib
>
> Schönbeinstrasse 18-20
>
> 4056 Basel, Schweiz
>
> Tel.: +41 61 267 31 12
>
> Fax: +41 61 267 31 03
>
> [hidden email] <mailto:[hidden email]>
>
> URLwww.swissbib.org <http://www.swissbib.org>
>
>
>
> ------------------------------------------------------------------------------
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
>
>
>
> _______________________________________________
> Vufind-tech mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/vufind-tech
>

--
Ere Maijala
Kansalliskirjasto / The National Library of Finland

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|

Re: solr 5.4.1 performance tests with facets

Günter Hipler
Thanks Ere for this information - definitely I will run a second lap.

Günter

On 02/24/2016 08:10 AM, Ere Maijala wrote:

> ...and now there's Solr 5.5.0 with at least one interesting item in the
> changes list:
>
> * Uninverted field faceting is re-enabled, for higher performance on
>      rarely changing indices
>
> This might warrant a new round of testing.
>
> --Ere
>
> 19.2.2016, 17.38, Demian Katz kirjoitti:
>> Okay, I have finished my first round of testing:
>>
>>
>> Solr 4 (no docvalues) Solr 5 (no docvalues) Solr 5 (docvalues)
>> Run 1: total time 78852 71449 56496
>> Run 1: max time 9759 8811 7463
>> Run 1: avg time 129.27 117.13 92.62
>> Run 2: total time 6396 7505 7272
>> Run 2: max time 431 477 489
>> Run 2: avg time 10.49 12.3 11.92
>> Run 3: total time 13637 14110 12187
>> Run 3: max time 2198 2386 827
>> Run 3: avg time 22.36 23.13 19.98
>> Run 4: total time 6643 8058 7493
>> Run 4: max time 438 542 433
>> Run 4: avg time 10.89 13.21 12.28
>>
>>
>> This data is for running 610 different queries, all including facet
>> parameters, against an index of over 1,000,000 items. Run 1 was executed
>> immediately after rebooting the server, with no other processes running.
>> Run 2 was executed immediately after Run 1, with no environmental
>> changes made. Run 3 was executed immediately after restarting Solr. Run
>> 4 was executed immediately after Run 3, with no environmental changes
>> made. All times are in ms. I confirmed that all runs yielded identical
>> numbers of results, so I'm confident that the only difference between
>> each scenario was the execution time, not the output.
>>
>> As you can see, every scenario is showing the same basic pattern: the
>> first run is very slow, since both Solr and OS-level caches have to be
>> populated. The remaining runs are much faster, with Run 3 being slower
>> than 2 and 4 since in this scenario, Solr has to rebuild its own caches.
>> Runs 2 and 4 are very similar to one another in terms of performance.
>>
>> I've saved all of my indexes, so I can repeat these tests a few times if
>> people would like to see how things average out across multiple
>> executions... but for now I decided not to spend more time until after
>> we've discussed this initial data set. I'm also happy to share more
>> detailed spreadsheets if anyone wishes to study the results in a more
>> granular fashion. But the bottom line is that I'm not seeing the
>> order-of-magnitude performance changes described in SOLR-8096. I
>> expected that the "solr5" results would be much worse than the "solr4"
>> results, with "solr5-docvalues" results being somewhere in the middle.
>> As you can see, that's not really the case. You might be able to
>> extrapolate some patterns from this data, but they're not nearly as
>> extreme as I had feared. Maybe that's due to a flaw in my test set, and
>> I would welcome more data from anyone else willing to run similar
>> experiments... but right now I'm not seeing anything too conclusive. On
>> the one hand, this is a relief -- it seems to suggest that the Solr 5
>> upgrade is not going to be a the performance game-changer I had feared;
>> on the other hand, it's frustrating, because I'd really like to see
>> clearer cause-and-effect here!
>>
>> - Demian
>>
>> ------------------------------------------------------------------------
>> *From:* Demian Katz [[hidden email]]
>> *Sent:* Thursday, February 18, 2016 3:46 PM
>> *To:* Günter Hipler; [hidden email]
>> *Subject:* Re: [VuFind-Tech] solr 5.4.1 performance tests with facets
>>
>> One more data point: I’ve confirmed that if I reboot my entire server
>> and then repeat my test on the docvalues branch, I get very slow
>> results, but subsequent results are quite fast. So this may be related
>> to OS-level loading of docValues data files into cache. That might also
>> suggest that with these changes in place, we’re going to want to devote
>> less memory to Java heap for Solr in order to free up more room for the
>> OS to do its part. So I think for my next round of testing, I should
>> capture data for at least four scenarios:
>>
>> 1.)Server just rebooted, first run.
>>
>> 2.)Second run without restarting Solr.
>>
>> 3.)First run after restarting Solr, but not rebooting server.
>>
>> 4.)Second run without restarting Solr.
>>
>> I would expect runs 2 and 4 to have similar characteristics… run 1 to be
>> the slowest, and run 3 to be a little bit slower than 2 and 4. It will
>> be interesting to see how all of those scenarios compare across solr 5
>> with docValues, solr 5 without docValues, and solr 4.
>>
>> More as soon as I manage to collect the relevant data!
>>
>> - Demian
>>
>> *From:*Demian Katz
>> *Sent:* Thursday, February 18, 2016 3:38 PM
>> *To:* Demian Katz; Günter Hipler; [hidden email]
>> *Subject:* RE: solr 5.4.1 performance tests with facets
>>
>> My first round of testing is complete, and with puzzling results.
>>
>> First of all, there’s one small bug in the code I posted earlier… the
>> while loop in runQueries.php should include “$line = trim($line);” at
>> the top to prevent trailing carriage returns from causing problems with
>> executing queries and building well-formed CSV data. (If anybody else
>> really wants this code, I can post the final form somewhere – just let
>> me know).
>>
>> But more importantly, the first round of results seems to suggest that
>> adding docValues makes things worse!
>>
>> Without docvalues, my first time running my 610 sample queries took
>> 48.412 seconds, an average of ~79.364ms per query, with a maximum query
>> time of 4.053s.
>>
>> With docvalues, my first time running the same queries took 55.847
>> seconds, an average of ~91.552ms per query, with a maximum query time of
>> 7.352s.
>>
>> That’s obviously not the result I expected to see, since the index that
>> was supposed to be faster was actually significantly slower. However,
>> strangely, if I repeat my test, subsequent runs with the docvalues index
>> are much faster (and of course, when I say “repeat the test,” that
>> includes restarting Solr to clear out any in-memory caching… so Solr
>> caches don’t explain the speed increase, unless they are more persistent
>> than I thought they were; perhaps this is actually reflective of some
>> OS-level file caching making the index file loading faster).
>> Unfortunately, I didn’t capture multiple runs with the non-docvalues
>> index yet, since my time today was limited.
>>
>> Bottom line: my results are inconclusive and confusing. I think I need
>> to try this again with a bigger data set and with more runs under more
>> circumstances. I should try multiple runs both with and without
>> restarting Solr. Perhaps I should also reboot my entire server between
>> tests for a cleaner environment. I should also run the tests against the
>> current Solr4 code for further comparison points. I’ll try to do some of
>> this tomorrow or next week. Suggestions for a better procedure would be
>> welcomed!
>>
>> thanks,
>>
>> Demian
>>
>> *From:*Demian Katz [mailto:[hidden email]]
>> *Sent:* Thursday, February 18, 2016 9:55 AM
>> *To:* Günter Hipler; [hidden email]
>> <mailto:[hidden email]>
>> *Subject:* Re: [VuFind-Tech] solr 5.4.1 performance tests with facets
>>
>> Günter,
>>
>> Thanks again for sharing this. As it happens, I wasn’t able to use these
>> tools as-is, but they did help spark my thinking on the easiest approach
>> for my own testing.
>>
>> I’ll share the details in case anyone is interested…
>>
>>
>> I decided to take more of a Unix pipeline approach – create one tool
>> that extracts parameters from Solr logs, and another tool that takes
>> parameters as input and produces CSV output containing key statistics.
>> Thus, I can do something like this:
>>
>> php extractQueries.php < solr.log | grep "facet.field=" | sort | uniq |
>> php runQueries.php > output.csv
>>
>> Simple but flexible!
>>
>> Here are my scripts:
>>
>> extractQueries.php:
>>
>> <?php
>>
>> /**
>>
>> * Given a Solr log file (sent through STDIN), extract all parameters to
>> STDOUT.
>>
>> */
>>
>> while ($line = fgets(STDIN)) {
>>
>>       $parts = explode(' ', $line);
>>
>>       $params = substr($parts[9], 8, strlen($parts[9]) - 9);
>>
>>       echo "$params\n";
>>
>> }
>>
>> runQueries.php:
>>
>> <?php
>>
>> /**
>>
>> * Given the output of extractQueries.php (sent to STDIN), create a CSV
>> file of results (sent to STDOUT).
>>
>> */
>>
>> $base = "http://localhost:8082/solr/biblio/select?";
>>
>> fputcsv(STDOUT, ['input', 'success', 'time', 'matches']);
>>
>> while ($line = fgets(STDIN)) {
>>
>>       $url = $base . $line;
>>
>>       $result = json_decode(file_get_contents($url));
>>
>>       $success = isset($result->responseHeader->QTime) ? true : false;
>>
>>       $csv = $success
>>
>>           ? [$line, 'true', $result->responseHeader->QTime,
>> $result->response->numFound]
>>
>>           : [$line, 'false'];
>>
>>       fputcsv(STDOUT, $csv);
>>
>> }
>>
>> Right now I’m still in the process of setting up my first test index, so
>> perhaps these will be refined a little as I start using them… but in any
>> case, here’s my intended procedure:
>>
>> 1.)Spin up standard VuFind test instance on solr5 branch
>>
>> 2.)Add an extra million records
>>
>> 3.)Run queries (I have a random sampling of about 600… hopefully that’s
>> reasonable) and save .csv file
>>
>> 4.)Shut down test instance
>>
>> 5.)Spin up standard VuFind test instance on solr5 branch with docValues
>> changes merged in
>>
>> 6.)Repeat steps 2-4.
>>
>> 7.)Analyze .csv files
>>
>> Hopefully that will show us whether or not there’s a significant
>> measurable difference between the two configurations, even if some of
>> the details aren’t as perfectly scientific as they might be.
>>
>> I’ll post results as soon as I have them. In the meantime, I’m open to
>> suggestions for refining the process!
>>
>> Thanks,
>>
>> Demian
>>
>> *From:*Günter Hipler [mailto:[hidden email]]
>> *Sent:* Wednesday, February 17, 2016 5:53 AM
>> *To:* Demian Katz; [hidden email]
>> <mailto:[hidden email]>
>> *Subject:* Re: solr 5.4.1 performance tests with facets
>>
>> Demian,
>>
>> of course you can take a look into what I have done. I just pushed the
>> scripts I used [1]. You can find a short description of the ideas in the
>> README  file
>>
>> But some notes you should keep in mind and a little background information:
>> - First I wanted to use the ELK stack (especially Logstash and
>> Elasticsearch) for such log analysis which I have in mind since quite
>> some time and I thought this might be a good moment in time to use it.
>> (We already use ElasticSearch for our linked project)
>> - Unfortunately I haven't found just already configured pipelines for
>> Solr logs  I could use out of the box (which surprised me a lot). I
>> stumbled upon something [2]  but this didn't work out for my ideas
>> - Spending half a day for such purposes is more than enough and  I
>> needed results for my questions about performance of Solr 5.4 servers I
>> made just a quick hack with python and Mongo ... (see the description in
>> README) Perhaps you can use it in a similar way for yourself
>> - As I already mentioned in the dev-call: The swissbib index schema is
>> quite different compared to the VF2 schema and we don't use the SolrMarc
>> process pipeline (which is the reason why I have to adapt the String
>> normalization stuff for doc-values by myself). Our what we call
>> "documentProcessing" is XSLT and Java based [4]  and in the future I'm
>> planning to use a combination of the stream-based MetaFacture -
>> Framework [5] (initially created by the German National Library) in
>> combination with our current procedures. MetaFacture based workflows are
>> already part of our linked  project [6]
>>
>> So it's not we don't want to share something but sometimes it is
>> difficult to put things together for people not really familiar with our
>> work - I'm sorry for this.
>>
>> Very best wishes from Basel!
>>
>> Günter
>>
>>
>>
>> [1]
>> https://github.com/linked-swissbib/utilities/commit/8cadbe26271ee5803c73f7b42efed17ffef061a6
>> [2] http://blog.comperiosearch.com/blog/2015/09/21/solr-logstash-analysis/
>> [3]
>> https://github.com/swissbib/searchconf/blob/master/solr/bib/conf-solr.4.10.2/configs/solr.home/bib/conf/schema.xml
>> [4] https://github.com/swissbib/content2SearchDocs
>> [5] https://github.com/culturegraph/metafacture-core
>> [6] https://github.com/linked-swissbib/mfWorkflows
>>
>> On 02/16/2016 05:36 PM, Demian Katz wrote:
>>
>>      Günter,
>>
>>      Thanks for sharing these results (and for your participation in
>>      today’s call). Out of curiosity, what process did you do to run
>>      these tests? Is there any possibility that you might be able to
>>      share data/scripts so that I can run the same tests on this end
>>      against various configurations of Solr, or is your schema so heavily
>>      customized that the queries/facet values would be meaningless to a
>>      “stock” VuFind instance?
>>
>>      In any case, I understand if you are unable to share the data for
>>      whatever reason – but it seemed worth asking in case it could save a
>>      bit of time with my own testing over here!
>>
>>      thanks,
>>
>>      Demian
>>
>>      *From:*Günter Hipler [mailto:[hidden email]]
>>      *Sent:* Tuesday, February 16, 2016 8:25 AM
>>      *To:* Demian Katz; [hidden email]
>>      <mailto:[hidden email]>
>>      *Cc:* [hidden email] <mailto:[hidden email]>
>>      *Subject:* solr 5.4.1 performance tests with facets
>>
>>      Hi
>>
>>      sorry for the delayed answer. Last week I made some performance
>>      tests on our fresh Solr 5.4.1 index without doc-values for facets.
>>
>>      My results:
>>      - it makes a huge difference if you are running on SSD or not. With
>>      SSD's the performance is quite the same compared to our productive
>>      index using version 4.10
>>      - in detail:
>>      -- I used only queries I collected from our logs on the productive
>>      servers and picked up only facet queries with at least one facet field
>>      -- total queries: 76656
>>      qTime < 100 milliseconds:  73270
>>      qTime > 500 milliseconds: 511
>>      qTime > 1000 milliseconds: 325
>>      qTime > 1500 milliseconds: 232
>>      qTime > 2500 milliseconds: 197
>>      qTime > 4000 milliseconds: 30
>>      qTime longest 5724
>>
>>      For me these results are reasonable and more or less comparable with
>>      our 4.10 Index.
>>
>>      But I think it's no reason to set change to doc-values aside. The
>>      reason why I tested it without doc-values: I would be lucky to
>>      postpone the adaptations for the moment because there is a lot of
>>      other work to be done.
>>
>>        Demian, thanks for the link to
>>      https://issues.apache.org/jira/browse/SOLR-8096.
>>      I wasn't aware, the Solr team has this severe problem which seems
>>      not to be solved until now. From my point of view: they are loosing
>>      connection to the development in the underlying Lucene building.
>>
>>      Hope we (swissbib) can take part at todays dev - call.
>>
>>      Günter
>>
>>      On 02/02/2016 02:55 PM, Demian Katz wrote:
>>
>>          Thanks for all of the valuable input – I’ll definitely mention
>>          all of these points on today’s call (and will do some catch-up
>>          reading on ZF3/Expressive very soon so we can discuss that in
>>          more detail next time around).
>>
>>          The Solr performance issue is very unfortunate – it seems that
>>          docValues are the solution, but this doesn’t really feel like a
>>          solution so much as a workaround! I’m not seeing much activity
>>          on the JIRA ticket about the performance problems
>>          (https://issues.apache.org/jira/browse/SOLR-8096), so we may be
>>          stuck having to do something.
>>
>>          I would be interested to hear what level of performance
>>          improvements you achieve by switching to docValues on your test
>>          instance – I think that it would be useful to at least confirm
>>          the order of magnitude of improvement on a large-scale index. I
>>          haven’t had time to do that sort of performance testing on this
>>          end yet.
>>
>>          Thanks again!
>>
>>          - Demian
>>
>>          *From:*Günter Hipler [mailto:[hidden email]]
>>          *Sent:* Tuesday, February 02, 2016 5:48 AM
>>          *To:* [hidden email]
>>          <mailto:[hidden email]>
>>          *Subject:* Re: [VuFind-Tech] Developers Call Agenda - 2/2/16
>>
>>          Hi Demian,
>>
>>          unfortunately nobody of the swissbib team isn't able to take
>>          part today because of an internal meeting
>>
>>          Therefor some inputs from our side in advance related to the agenda:
>>          d) Solr Upgrade
>>          - As it is alreday discussed on the list the 5.4.1 release fixes
>>          the more-like-this nullpointer exception bug.
>>          e.g. https://testvf.swissbib.ch/Record/336203004 (similar items)
>>          - At the moment we don't use doc values for facets and sort
>>          fields because on some facet fields we still perform text
>>          processing (e.g. [1] with no String types) on the Solr side. But
>>          this is kind of our own problem because we use a strongly
>>          modified index schema.
>>          - Next days I want to run some performance tests. At the moment
>>          the performance is significantly slower compared to the
>>          productive 4.10 version. I guess the main reason could be that
>>          in production we are using SSD disks (hopefully..)
>>
>>          g) API.
>>          As I mentioned in the past, we want to start a project with
>>          Markus Maechler to implement a REST API for our linked-data
>>          project (in the context of his university education).
>>          Unfortunately the start is a little bit delayed. We want to
>>          provide all our ideas, goals and design principles once their is
>>          enough to tell about. At the moment most preparation is only
>>          written in German.
>>
>>          h) Modularization
>>          Don't know if you follow the current efforts for ZF3. There are
>>          some resources which are quite valuable for information about
>>          what's going on:
>>          http://framework.zend.com/blog/
>>          https://mwop.net/blog/2016-01-28-expressive-stable.html
>>
>>           >From my point of view the latest stable release of Expressive
>>          as the Zend Microframework for components is a major step to
>>          encapsulate components (as part of a modularization process).
>>          These components could be used even outside of the VuFind
>>          application and replace (if useful) some current use cases of
>>          VuFind for a smaller resource footprint.
>>
>>          We are going to follow the current development and it would be
>>          nice if more people are getting interest on this. Definitely it
>>          needs time and it should be a topic after the 3.0 release.
>>
>>
>>          Günter
>>
>>
>>
>>          [1]
>>          https://github.com/swissbib/searchconf/blob/update/solr5x/solr/bib/solr.5.4/SOLR_HOME/sb-biblio/conf/schema.xml#L39
>>
>>          On 01/28/2016 02:59 PM, Demian Katz wrote:
>>
>>              The next developers call will be Tuesday, February 2, 2016
>>              at 9am Eastern Standard Time (14:00 GMT).
>>
>>              AGENDA
>>
>>              1. Development Updates
>>              2. Development Planning
>>                   a. Improved Author Indexing
>>                   b. Delimited Facets
>>
>>                   c. Eliminate "VuFind" Source in Database
>>                   d. Solr Upgrade
>>                   e. Javascript Reorganization
>>                   f. Cover Issues
>>                   g. API
>>                   h. Modularization
>>                   i. Improved Use of Permissions
>>
>>                   j. Session Performance Improvement
>>              3. Other Topics?
>>
>>              More information on the free online call can be found at
>>              https://vufind.org/wiki/developers_call-- all are welcome!
>>
>>              - Demian
>>
>>
>>
>>
>>              ------------------------------------------------------------------------------
>>
>>              Site24x7 APM Insight: Get Deep Visibility into Application Performance
>>
>>              APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>>
>>              Monitor end-to-end web transactions and take corrective actions now
>>
>>              Troubleshoot faster and improve end-user experience. Signup Now!
>>
>>              http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
>>
>>
>>
>>
>>              _______________________________________________
>>
>>              Vufind-tech mailing list
>>
>>              [hidden email]
>>              <mailto:[hidden email]>
>>
>>              https://lists.sourceforge.net/lists/listinfo/vufind-tech
>>
>>
>>
>>          --
>>
>>          UNIVERSITÄT BASEL
>>
>>          Universitätsbibliothek
>>
>>          Günter Hipler
>>
>>          Projekt swissbib
>>
>>          Schönbeinstrasse 18-20
>>
>>          4056 Basel, Schweiz
>>
>>          Tel.: +41 61 267 31 12
>>
>>          Fax: +41 61 267 31 03
>>
>>          [hidden email] <mailto:[hidden email]>
>>
>>          URLwww.swissbib.org <http://www.swissbib.org>
>>
>>      --
>>
>>      UNIVERSITÄT BASEL
>>
>>      Universitätsbibliothek
>>
>>      Günter Hipler
>>
>>      Projekt swissbib
>>
>>      Schönbeinstrasse 18-20
>>
>>      4056 Basel, Schweiz
>>
>>      Tel.: +41 61 267 31 12
>>
>>      Fax: +41 61 267 31 03
>>
>>      [hidden email] <mailto:[hidden email]>
>>
>>      URLwww.swissbib.org <http://www.swissbib.org>
>>
>> --
>>
>> UNIVERSITÄT BASEL
>>
>> Universitätsbibliothek
>>
>> Günter Hipler
>>
>> Projekt swissbib
>>
>> Schönbeinstrasse 18-20
>>
>> 4056 Basel, Schweiz
>>
>> Tel.: +41 61 267 31 12
>>
>> Fax: +41 61 267 31 03
>>
>> [hidden email] <mailto:[hidden email]>
>>
>> URLwww.swissbib.org <http://www.swissbib.org>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Site24x7 APM Insight: Get Deep Visibility into Application Performance
>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>> Monitor end-to-end web transactions and take corrective actions now
>> Troubleshoot faster and improve end-user experience. Signup Now!
>> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
>>
>>
>>
>> _______________________________________________
>> Vufind-tech mailing list
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/vufind-tech
>>

--
UNIVERSITÄT BASEL
Universitätsbibliothek
Günter Hipler
Projekt swissbib
Schönbeinstrasse 18-20
4056 Basel, Schweiz
Tel.: +41 61 267 31 12
Fax: +41 61 267 31 03
E-Mail [hidden email]
URL www.swissbib.org


------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|

Re: solr 5.4.1 performance tests with facets

Demian Katz
In reply to this post by Ere Maijala
That's great news! I'll work on upgrading/testing the Solr 5 branch today and will post additional performance results (both with and without docValues) unless I run into an unexpected obstacle.

Other exciting news: SOLR-2649 has finally been resolved (thanks to Greg Pendlebury, among others), which means we can revisit using eDismax instead of "plain old" Dismax; see https://vufind.org/jira/browse/VUFIND-935.

- Demian

-----Original Message-----
From: Ere Maijala [mailto:[hidden email]]
Sent: Wednesday, February 24, 2016 2:10 AM
To: [hidden email]
Subject: Re: [VuFind-Tech] solr 5.4.1 performance tests with facets

...and now there's Solr 5.5.0 with at least one interesting item in the changes list:

* Uninverted field faceting is re-enabled, for higher performance on
    rarely changing indices

This might warrant a new round of testing.

--Ere

19.2.2016, 17.38, Demian Katz kirjoitti:

> Okay, I have finished my first round of testing:
>
>
> Solr 4 (no docvalues) Solr 5 (no docvalues) Solr 5 (docvalues)
> Run 1: total time 78852 71449 56496
> Run 1: max time 9759 8811 7463
> Run 1: avg time 129.27 117.13 92.62
> Run 2: total time 6396 7505 7272
> Run 2: max time 431 477 489
> Run 2: avg time 10.49 12.3 11.92
> Run 3: total time 13637 14110 12187
> Run 3: max time 2198 2386 827
> Run 3: avg time 22.36 23.13 19.98
> Run 4: total time 6643 8058 7493
> Run 4: max time 438 542 433
> Run 4: avg time 10.89 13.21 12.28
>
>
> This data is for running 610 different queries, all including facet
> parameters, against an index of over 1,000,000 items. Run 1 was
> executed immediately after rebooting the server, with no other processes running.
> Run 2 was executed immediately after Run 1, with no environmental
> changes made. Run 3 was executed immediately after restarting Solr.
> Run
> 4 was executed immediately after Run 3, with no environmental changes
> made. All times are in ms. I confirmed that all runs yielded identical
> numbers of results, so I'm confident that the only difference between
> each scenario was the execution time, not the output.
>
> As you can see, every scenario is showing the same basic pattern: the
> first run is very slow, since both Solr and OS-level caches have to be
> populated. The remaining runs are much faster, with Run 3 being slower
> than 2 and 4 since in this scenario, Solr has to rebuild its own caches.
> Runs 2 and 4 are very similar to one another in terms of performance.
>
> I've saved all of my indexes, so I can repeat these tests a few times
> if people would like to see how things average out across multiple
> executions... but for now I decided not to spend more time until after
> we've discussed this initial data set. I'm also happy to share more
> detailed spreadsheets if anyone wishes to study the results in a more
> granular fashion. But the bottom line is that I'm not seeing the
> order-of-magnitude performance changes described in SOLR-8096. I
> expected that the "solr5" results would be much worse than the "solr4"
> results, with "solr5-docvalues" results being somewhere in the middle.
> As you can see, that's not really the case. You might be able to
> extrapolate some patterns from this data, but they're not nearly as
> extreme as I had feared. Maybe that's due to a flaw in my test set,
> and I would welcome more data from anyone else willing to run similar
> experiments... but right now I'm not seeing anything too conclusive.
> On the one hand, this is a relief -- it seems to suggest that the Solr
> 5 upgrade is not going to be a the performance game-changer I had
> feared; on the other hand, it's frustrating, because I'd really like
> to see clearer cause-and-effect here!
>
> - Demian
>
> ----------------------------------------------------------------------
> --
> *From:* Demian Katz [[hidden email]]
> *Sent:* Thursday, February 18, 2016 3:46 PM
> *To:* Günter Hipler; [hidden email]
> *Subject:* Re: [VuFind-Tech] solr 5.4.1 performance tests with facets
>
> One more data point: I've confirmed that if I reboot my entire server
> and then repeat my test on the docvalues branch, I get very slow
> results, but subsequent results are quite fast. So this may be related
> to OS-level loading of docValues data files into cache. That might
> also suggest that with these changes in place, we're going to want to
> devote less memory to Java heap for Solr in order to free up more room
> for the OS to do its part. So I think for my next round of testing, I
> should capture data for at least four scenarios:
>
> 1.)Server just rebooted, first run.
>
> 2.)Second run without restarting Solr.
>
> 3.)First run after restarting Solr, but not rebooting server.
>
> 4.)Second run without restarting Solr.
>
> I would expect runs 2 and 4 to have similar characteristics. run 1 to
> be the slowest, and run 3 to be a little bit slower than 2 and 4. It
> will be interesting to see how all of those scenarios compare across
> solr 5 with docValues, solr 5 without docValues, and solr 4.
>
> More as soon as I manage to collect the relevant data!
>
> - Demian
>
> *From:*Demian Katz
> *Sent:* Thursday, February 18, 2016 3:38 PM
> *To:* Demian Katz; Günter Hipler; [hidden email]
> *Subject:* RE: solr 5.4.1 performance tests with facets
>
> My first round of testing is complete, and with puzzling results.
>
> First of all, there's one small bug in the code I posted earlier. the
> while loop in runQueries.php should include "$line = trim($line);" at
> the top to prevent trailing carriage returns from causing problems
> with executing queries and building well-formed CSV data. (If anybody
> else really wants this code, I can post the final form somewhere -
> just let me know).
>
> But more importantly, the first round of results seems to suggest that
> adding docValues makes things worse!
>
> Without docvalues, my first time running my 610 sample queries took
> 48.412 seconds, an average of ~79.364ms per query, with a maximum
> query time of 4.053s.
>
> With docvalues, my first time running the same queries took 55.847
> seconds, an average of ~91.552ms per query, with a maximum query time
> of 7.352s.
>
> That's obviously not the result I expected to see, since the index
> that was supposed to be faster was actually significantly slower.
> However, strangely, if I repeat my test, subsequent runs with the
> docvalues index are much faster (and of course, when I say "repeat the
> test," that includes restarting Solr to clear out any in-memory
> caching. so Solr caches don't explain the speed increase, unless they
> are more persistent than I thought they were; perhaps this is actually
> reflective of some OS-level file caching making the index file loading faster).
> Unfortunately, I didn't capture multiple runs with the non-docvalues
> index yet, since my time today was limited.
>
> Bottom line: my results are inconclusive and confusing. I think I need
> to try this again with a bigger data set and with more runs under more
> circumstances. I should try multiple runs both with and without
> restarting Solr. Perhaps I should also reboot my entire server between
> tests for a cleaner environment. I should also run the tests against
> the current Solr4 code for further comparison points. I'll try to do
> some of this tomorrow or next week. Suggestions for a better procedure
> would be welcomed!
>
> thanks,
>
> Demian
>
> *From:*Demian Katz [mailto:[hidden email]]
> *Sent:* Thursday, February 18, 2016 9:55 AM
> *To:* Günter Hipler; [hidden email]
> <mailto:[hidden email]>
> *Subject:* Re: [VuFind-Tech] solr 5.4.1 performance tests with facets
>
> Günter,
>
> Thanks again for sharing this. As it happens, I wasn't able to use
> these tools as-is, but they did help spark my thinking on the easiest
> approach for my own testing.
>
> I'll share the details in case anyone is interested.
>
>
> I decided to take more of a Unix pipeline approach - create one tool
> that extracts parameters from Solr logs, and another tool that takes
> parameters as input and produces CSV output containing key statistics.
> Thus, I can do something like this:
>
> php extractQueries.php < solr.log | grep "facet.field=" | sort | uniq
> | php runQueries.php > output.csv
>
> Simple but flexible!
>
> Here are my scripts:
>
> extractQueries.php:
>
> <?php
>
> /**
>
> * Given a Solr log file (sent through STDIN), extract all parameters
> to STDOUT.
>
> */
>
> while ($line = fgets(STDIN)) {
>
>      $parts = explode(' ', $line);
>
>      $params = substr($parts[9], 8, strlen($parts[9]) - 9);
>
>      echo "$params\n";
>
> }
>
> runQueries.php:
>
> <?php
>
> /**
>
> * Given the output of extractQueries.php (sent to STDIN), create a CSV
> file of results (sent to STDOUT).
>
> */
>
> $base = "http://localhost:8082/solr/biblio/select?";
>
> fputcsv(STDOUT, ['input', 'success', 'time', 'matches']);
>
> while ($line = fgets(STDIN)) {
>
>      $url = $base . $line;
>
>      $result = json_decode(file_get_contents($url));
>
>      $success = isset($result->responseHeader->QTime) ? true : false;
>
>      $csv = $success
>
>          ? [$line, 'true', $result->responseHeader->QTime,
> $result->response->numFound]
>
>          : [$line, 'false'];
>
>      fputcsv(STDOUT, $csv);
>
> }
>
> Right now I'm still in the process of setting up my first test index,
> so perhaps these will be refined a little as I start using them. but
> in any case, here's my intended procedure:
>
> 1.)Spin up standard VuFind test instance on solr5 branch
>
> 2.)Add an extra million records
>
> 3.)Run queries (I have a random sampling of about 600. hopefully
> that's
> reasonable) and save .csv file
>
> 4.)Shut down test instance
>
> 5.)Spin up standard VuFind test instance on solr5 branch with
> docValues changes merged in
>
> 6.)Repeat steps 2-4.
>
> 7.)Analyze .csv files
>
> Hopefully that will show us whether or not there's a significant
> measurable difference between the two configurations, even if some of
> the details aren't as perfectly scientific as they might be.
>
> I'll post results as soon as I have them. In the meantime, I'm open to
> suggestions for refining the process!
>
> Thanks,
>
> Demian
>
> *From:*Günter Hipler [mailto:[hidden email]]
> *Sent:* Wednesday, February 17, 2016 5:53 AM
> *To:* Demian Katz; [hidden email]
> <mailto:[hidden email]>
> *Subject:* Re: solr 5.4.1 performance tests with facets
>
> Demian,
>
> of course you can take a look into what I have done. I just pushed the
> scripts I used [1]. You can find a short description of the ideas in
> the README  file
>
> But some notes you should keep in mind and a little background information:
> - First I wanted to use the ELK stack (especially Logstash and
> Elasticsearch) for such log analysis which I have in mind since quite
> some time and I thought this might be a good moment in time to use it.
> (We already use ElasticSearch for our linked project)
> - Unfortunately I haven't found just already configured pipelines for
> Solr logs  I could use out of the box (which surprised me a lot). I
> stumbled upon something [2]  but this didn't work out for my ideas
> - Spending half a day for such purposes is more than enough and  I
> needed results for my questions about performance of Solr 5.4 servers
> I made just a quick hack with python and Mongo ... (see the
> description in
> README) Perhaps you can use it in a similar way for yourself
> - As I already mentioned in the dev-call: The swissbib index schema is
> quite different compared to the VF2 schema and we don't use the
> SolrMarc process pipeline (which is the reason why I have to adapt the
> String normalization stuff for doc-values by myself). Our what we call
> "documentProcessing" is XSLT and Java based [4]  and in the future I'm
> planning to use a combination of the stream-based MetaFacture -
> Framework [5] (initially created by the German National Library) in
> combination with our current procedures. MetaFacture based workflows
> are already part of our linked  project [6]
>
> So it's not we don't want to share something but sometimes it is
> difficult to put things together for people not really familiar with
> our work - I'm sorry for this.
>
> Very best wishes from Basel!
>
> Günter
>
>
>
> [1]
> https://github.com/linked-swissbib/utilities/commit/8cadbe26271ee5803c
> 73f7b42efed17ffef061a6 [2]
> http://blog.comperiosearch.com/blog/2015/09/21/solr-logstash-analysis/
> [3]
> https://github.com/swissbib/searchconf/blob/master/solr/bib/conf-solr.
> 4.10.2/configs/solr.home/bib/conf/schema.xml
> [4] https://github.com/swissbib/content2SearchDocs
> [5] https://github.com/culturegraph/metafacture-core
> [6] https://github.com/linked-swissbib/mfWorkflows
>
> On 02/16/2016 05:36 PM, Demian Katz wrote:
>
>     Günter,
>
>     Thanks for sharing these results (and for your participation in
>     today's call). Out of curiosity, what process did you do to run
>     these tests? Is there any possibility that you might be able to
>     share data/scripts so that I can run the same tests on this end
>     against various configurations of Solr, or is your schema so heavily
>     customized that the queries/facet values would be meaningless to a
>     "stock" VuFind instance?
>
>     In any case, I understand if you are unable to share the data for
>     whatever reason - but it seemed worth asking in case it could save a
>     bit of time with my own testing over here!
>
>     thanks,
>
>     Demian
>
>     *From:*Günter Hipler [mailto:[hidden email]]
>     *Sent:* Tuesday, February 16, 2016 8:25 AM
>     *To:* Demian Katz; [hidden email]
>     <mailto:[hidden email]>
>     *Cc:* [hidden email] <mailto:[hidden email]>
>     *Subject:* solr 5.4.1 performance tests with facets
>
>     Hi
>
>     sorry for the delayed answer. Last week I made some performance
>     tests on our fresh Solr 5.4.1 index without doc-values for facets.
>
>     My results:
>     - it makes a huge difference if you are running on SSD or not. With
>     SSD's the performance is quite the same compared to our productive
>     index using version 4.10
>     - in detail:
>     -- I used only queries I collected from our logs on the productive
>     servers and picked up only facet queries with at least one facet field
>     -- total queries: 76656
>     qTime < 100 milliseconds:  73270
>     qTime > 500 milliseconds: 511
>     qTime > 1000 milliseconds: 325
>     qTime > 1500 milliseconds: 232
>     qTime > 2500 milliseconds: 197
>     qTime > 4000 milliseconds: 30
>     qTime longest 5724
>
>     For me these results are reasonable and more or less comparable with
>     our 4.10 Index.
>
>     But I think it's no reason to set change to doc-values aside. The
>     reason why I tested it without doc-values: I would be lucky to
>     postpone the adaptations for the moment because there is a lot of
>     other work to be done.
>
>       Demian, thanks for the link to
>     https://issues.apache.org/jira/browse/SOLR-8096.
>     I wasn't aware, the Solr team has this severe problem which seems
>     not to be solved until now. From my point of view: they are loosing
>     connection to the development in the underlying Lucene building.
>
>     Hope we (swissbib) can take part at todays dev - call.
>
>     Günter
>
>     On 02/02/2016 02:55 PM, Demian Katz wrote:
>
>         Thanks for all of the valuable input - I'll definitely mention
>         all of these points on today's call (and will do some catch-up
>         reading on ZF3/Expressive very soon so we can discuss that in
>         more detail next time around).
>
>         The Solr performance issue is very unfortunate - it seems that
>         docValues are the solution, but this doesn't really feel like a
>         solution so much as a workaround! I'm not seeing much activity
>         on the JIRA ticket about the performance problems
>         (https://issues.apache.org/jira/browse/SOLR-8096), so we may be
>         stuck having to do something.
>
>         I would be interested to hear what level of performance
>         improvements you achieve by switching to docValues on your test
>         instance - I think that it would be useful to at least confirm
>         the order of magnitude of improvement on a large-scale index. I
>         haven't had time to do that sort of performance testing on this
>         end yet.
>
>         Thanks again!
>
>         - Demian
>
>         *From:*Günter Hipler [mailto:[hidden email]]
>         *Sent:* Tuesday, February 02, 2016 5:48 AM
>         *To:* [hidden email]
>         <mailto:[hidden email]>
>         *Subject:* Re: [VuFind-Tech] Developers Call Agenda - 2/2/16
>
>         Hi Demian,
>
>         unfortunately nobody of the swissbib team isn't able to take
>         part today because of an internal meeting
>
>         Therefor some inputs from our side in advance related to the agenda:
>         d) Solr Upgrade
>         - As it is alreday discussed on the list the 5.4.1 release fixes
>         the more-like-this nullpointer exception bug.
>         e.g. https://testvf.swissbib.ch/Record/336203004 (similar items)
>         - At the moment we don't use doc values for facets and sort
>         fields because on some facet fields we still perform text
>         processing (e.g. [1] with no String types) on the Solr side. But
>         this is kind of our own problem because we use a strongly
>         modified index schema.
>         - Next days I want to run some performance tests. At the moment
>         the performance is significantly slower compared to the
>         productive 4.10 version. I guess the main reason could be that
>         in production we are using SSD disks (hopefully..)
>
>         g) API.
>         As I mentioned in the past, we want to start a project with
>         Markus Maechler to implement a REST API for our linked-data
>         project (in the context of his university education).
>         Unfortunately the start is a little bit delayed. We want to
>         provide all our ideas, goals and design principles once their is
>         enough to tell about. At the moment most preparation is only
>         written in German.
>
>         h) Modularization
>         Don't know if you follow the current efforts for ZF3. There are
>         some resources which are quite valuable for information about
>         what's going on:
>         http://framework.zend.com/blog/
>         https://mwop.net/blog/2016-01-28-expressive-stable.html
>
>          >From my point of view the latest stable release of Expressive
>         as the Zend Microframework for components is a major step to
>         encapsulate components (as part of a modularization process).
>         These components could be used even outside of the VuFind
>         application and replace (if useful) some current use cases of
>         VuFind for a smaller resource footprint.
>
>         We are going to follow the current development and it would be
>         nice if more people are getting interest on this. Definitely it
>         needs time and it should be a topic after the 3.0 release.
>
>
>         Günter
>
>
>
>         [1]
>        
> https://github.com/swissbib/searchconf/blob/update/solr5x/solr/bib/sol
> r.5.4/SOLR_HOME/sb-biblio/conf/schema.xml#L39
>
>         On 01/28/2016 02:59 PM, Demian Katz wrote:
>
>             The next developers call will be Tuesday, February 2, 2016
>             at 9am Eastern Standard Time (14:00 GMT).
>
>             AGENDA
>
>             1. Development Updates
>             2. Development Planning
>                  a. Improved Author Indexing
>                  b. Delimited Facets
>
>                  c. Eliminate "VuFind" Source in Database
>                  d. Solr Upgrade
>                  e. Javascript Reorganization
>                  f. Cover Issues
>                  g. API
>                  h. Modularization
>                  i. Improved Use of Permissions
>
>                  j. Session Performance Improvement
>             3. Other Topics?
>
>             More information on the free online call can be found at
>             https://vufind.org/wiki/developers_call-- all are welcome!
>
>             - Demian
>
>
>
>
>            
> ----------------------------------------------------------------------
> --------
>
>             Site24x7 APM Insight: Get Deep Visibility into Application
> Performance
>
>             APM + Mobile APM + RUM: Monitor 3 App instances at just
> $35/Month
>
>             Monitor end-to-end web transactions and take corrective
> actions now
>
>             Troubleshoot faster and improve end-user experience. Signup Now!
>
>            
> http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
>
>
>
>
>             _______________________________________________
>
>             Vufind-tech mailing list
>
>             [hidden email]
>             <mailto:[hidden email]>
>
>             https://lists.sourceforge.net/lists/listinfo/vufind-tech
>
>
>
>         --
>
>         UNIVERSITÄT BASEL
>
>         Universitätsbibliothek
>
>         Günter Hipler
>
>         Projekt swissbib
>
>         Schönbeinstrasse 18-20
>
>         4056 Basel, Schweiz
>
>         Tel.: +41 61 267 31 12
>
>         Fax: +41 61 267 31 03
>
>         [hidden email]
> <mailto:[hidden email]>
>
>         URLwww.swissbib.org <http://www.swissbib.org>
>
>     --
>
>     UNIVERSITÄT BASEL
>
>     Universitätsbibliothek
>
>     Günter Hipler
>
>     Projekt swissbib
>
>     Schönbeinstrasse 18-20
>
>     4056 Basel, Schweiz
>
>     Tel.: +41 61 267 31 12
>
>     Fax: +41 61 267 31 03
>
>     [hidden email] <mailto:[hidden email]>
>
>     URLwww.swissbib.org <http://www.swissbib.org>
>
> --
>
> UNIVERSITÄT BASEL
>
> Universitätsbibliothek
>
> Günter Hipler
>
> Projekt swissbib
>
> Schönbeinstrasse 18-20
>
> 4056 Basel, Schweiz
>
> Tel.: +41 61 267 31 12
>
> Fax: +41 61 267 31 03
>
> [hidden email] <mailto:[hidden email]>
>
> URLwww.swissbib.org <http://www.swissbib.org>
>
>
>
> ----------------------------------------------------------------------
> --------
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
>
>
>
> _______________________________________________
> Vufind-tech mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/vufind-tech
>

--
Ere Maijala
Kansalliskirjasto / The National Library of Finland

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month Monitor end-to-end web transactions and take corrective actions now Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|

Re: solr 5.4.1 performance tests with facets

Demian Katz
In reply to this post by Ere Maijala

Okay, here is the table updated with results from test runs on Solr 5.5.0. I think it might be necessary to repeat these tests several times each and average the results to gain more confidence that any patterns shown in the results here are meaningful beyond normal server variation, and perhaps I would be better off with a larger data set, but I think the bottom line is that there are no shocking performance degradations caused by the Solr 5 upgrade for this test scenario. I’ll be interested to hear what Günter finds if he repeats his test using the latest update to the solr5 branch… but I have a feeling we’re going to discover that docValues are not crucial to performance at this stage, and we may be able to proceed with a merge without them. (Which is not to say that we should drop that issue entirely – but I don’t think it’s a prerequisite to moving forward, and unless we can find data to prove that it really significantly helps us, we may be better off without it in order to keep indexing a little simpler).

Solr 4 (no docvalues)

Solr 5.4.1 (no docvalues)

Solr 5.4.1 (docvalues)

Solr 5.5.0 (no docvalues)

Solr 5.5.0 (docvalues)

Run 1: total time

78852

71449

56496

61790

77422

Run 1: max time

9759

8811

7463

7840

9402

Run 1: avg time

129.27

117.13

92.62

110.1475

126.9213

Run 2: total time

6396

7505

7272

7564

7530

Run 2: max time

431

477

489

453

417

Run 2: avg time

10.49

12.3

11.92

12.4

12.344

Run 3: total time

13637

14110

12187

13457

11534

Run 3: max time

2198

2386

827

2066

852

Run 3: avg time

22.36

23.13

19.98

22.061

18.908

Run 4: total time

6643

8058

7493

7444

7478

Run 4: max time

438

542

433

435

502

Run 4: avg time

10.89

13.21

12.28

12.203

12.259

 

- Demian

 

-----Original Message-----
From: Ere Maijala [mailto:[hidden email]]
Sent: Wednesday, February 24, 2016 2:10 AM
To: [hidden email]
Subject: Re: [VuFind-Tech] solr 5.4.1 performance tests with facets

 

...and now there's Solr 5.5.0 with at least one interesting item in the changes list:

 

* Uninverted field faceting is re-enabled, for higher performance on

    rarely changing indices

 

This might warrant a new round of testing.

 

--Ere

 

19.2.2016, 17.38, Demian Katz kirjoitti:

> Okay, I have finished my first round of testing:

> 

> 

>             Solr 4 (no docvalues)      Solr 5 (no docvalues)      Solr 5 (docvalues)

> Run 1: total time           78852    71449    56496

> Run 1: max time            9759       8811       7463

> Run 1: avg time             129.27   117.13   92.62

> Run 2: total time           6396       7505       7272

> Run 2: max time            431         477         489

> Run 2: avg time             10.49     12.3        11.92

> Run 3: total time           13637    14110    12187

> Run 3: max time            2198       2386       827

> Run 3: avg time             22.36     23.13     19.98

> Run 4: total time           6643       8058       7493

> Run 4: max time            438         542         433

> Run 4: avg time             10.89     13.21     12.28

> 

> 

> This data is for running 610 different queries, all including facet

> parameters, against an index of over 1,000,000 items. Run 1 was

> executed immediately after rebooting the server, with no other processes running.

> Run 2 was executed immediately after Run 1, with no environmental

> changes made. Run 3 was executed immediately after restarting Solr.

> Run

> 4 was executed immediately after Run 3, with no environmental changes

> made. All times are in ms. I confirmed that all runs yielded identical

> numbers of results, so I'm confident that the only difference between

> each scenario was the execution time, not the output.

> 

> As you can see, every scenario is showing the same basic pattern: the

> first run is very slow, since both Solr and OS-level caches have to be

> populated. The remaining runs are much faster, with Run 3 being slower

> than 2 and 4 since in this scenario, Solr has to rebuild its own caches.

> Runs 2 and 4 are very similar to one another in terms of performance.

> 

> I've saved all of my indexes, so I can repeat these tests a few times

> if people would like to see how things average out across multiple

> executions... but for now I decided not to spend more time until after

> we've discussed this initial data set. I'm also happy to share more

> detailed spreadsheets if anyone wishes to study the results in a more

> granular fashion. But the bottom line is that I'm not seeing the

> order-of-magnitude performance changes described in SOLR-8096. I

> expected that the "solr5" results would be much worse than the "solr4"

> results, with "solr5-docvalues" results being somewhere in the middle.

> As you can see, that's not really the case. You might be able to

> extrapolate some patterns from this data, but they're not nearly as

> extreme as I had feared. Maybe that's due to a flaw in my test set,

> and I would welcome more data from anyone else willing to run similar

> experiments... but right now I'm not seeing anything too conclusive.

> On the one hand, this is a relief -- it seems to suggest that the Solr

> 5 upgrade is not going to be a the performance game-changer I had

> feared; on the other hand, it's frustrating, because I'd really like

> to see clearer cause-and-effect here!

> 

> - Demian

> 

> ----------------------------------------------------------------------

> --

> *From:* Demian Katz [[hidden email]]

> *Sent:* Thursday, February 18, 2016 3:46 PM

> *To:* Günter Hipler; [hidden email]

> *Subject:* Re: [VuFind-Tech] solr 5.4.1 performance tests with facets

> 

> One more data point: I’ve confirmed that if I reboot my entire server

> and then repeat my test on the docvalues branch, I get very slow

> results, but subsequent results are quite fast. So this may be related

> to OS-level loading of docValues data files into cache. That might

> also suggest that with these changes in place, we’re going to want to

> devote less memory to Java heap for Solr in order to free up more room

> for the OS to do its part. So I think for my next round of testing, I

> should capture data for at least four scenarios:

> 

> 1.)Server just rebooted, first run.

> 

> 2.)Second run without restarting Solr.

> 

> 3.)First run after restarting Solr, but not rebooting server.

> 

> 4.)Second run without restarting Solr.

> 

> I would expect runs 2 and 4 to have similar characteristics… run 1 to

> be the slowest, and run 3 to be a little bit slower than 2 and 4. It

> will be interesting to see how all of those scenarios compare across

> solr 5 with docValues, solr 5 without docValues, and solr 4.

> 

> More as soon as I manage to collect the relevant data!

> 

> - Demian

> 

> *From:*Demian Katz

> *Sent:* Thursday, February 18, 2016 3:38 PM

> *To:* Demian Katz; Günter Hipler; [hidden email]

> *Subject:* RE: solr 5.4.1 performance tests with facets

> 

> My first round of testing is complete, and with puzzling results.

> 

> First of all, there’s one small bug in the code I posted earlier… the

> while loop in runQueries.php should include “$line = trim($line);” at

> the top to prevent trailing carriage returns from causing problems

> with executing queries and building well-formed CSV data. (If anybody

> else really wants this code, I can post the final form somewhere –

> just let me know).

> 

> But more importantly, the first round of results seems to suggest that

> adding docValues makes things worse!

> 

> Without docvalues, my first time running my 610 sample queries took

> 48.412 seconds, an average of ~79.364ms per query, with a maximum

> query time of 4.053s.

> 

> With docvalues, my first time running the same queries took 55.847

> seconds, an average of ~91.552ms per query, with a maximum query time

> of 7.352s.

> 

> That’s obviously not the result I expected to see, since the index

> that was supposed to be faster was actually significantly slower.

> However, strangely, if I repeat my test, subsequent runs with the

> docvalues index are much faster (and of course, when I say “repeat the

> test,” that includes restarting Solr to clear out any in-memory

> caching… so Solr caches don’t explain the speed increase, unless they

> are more persistent than I thought they were; perhaps this is actually

> reflective of some OS-level file caching making the index file loading faster).

> Unfortunately, I didn’t capture multiple runs with the non-docvalues

> index yet, since my time today was limited.

> 

> Bottom line: my results are inconclusive and confusing. I think I need

> to try this again with a bigger data set and with more runs under more

> circumstances. I should try multiple runs both with and without

> restarting Solr. Perhaps I should also reboot my entire server between

> tests for a cleaner environment. I should also run the tests against

> the current Solr4 code for further comparison points. I’ll try to do

> some of this tomorrow or next week. Suggestions for a better procedure

> would be welcomed!

> 

> thanks,

> 

> Demian

> 

> *From:*Demian Katz [[hidden email]]

> *Sent:* Thursday, February 18, 2016 9:55 AM

> *To:* Günter Hipler; [hidden email]

> <[hidden email]>

> *Subject:* Re: [VuFind-Tech] solr 5.4.1 performance tests with facets

> 

> Günter,

> 

> Thanks again for sharing this. As it happens, I wasn’t able to use

> these tools as-is, but they did help spark my thinking on the easiest

> approach for my own testing.

> 

> I’ll share the details in case anyone is interested…

> 

> 

> I decided to take more of a Unix pipeline approach – create one tool

> that extracts parameters from Solr logs, and another tool that takes

> parameters as input and produces CSV output containing key statistics.

> Thus, I can do something like this:

> 

> php extractQueries.php < solr.log | grep "facet.field=" | sort | uniq

> | php runQueries.php > output.csv

> 

> Simple but flexible!

> 

> Here are my scripts:

> 

> extractQueries.php:

> 

> <?php

> 

> /**

> 

> * Given a Solr log file (sent through STDIN), extract all parameters

> to STDOUT.

> 

> */

> 

> while ($line = fgets(STDIN)) {

> 

>      $parts = explode(' ', $line);

> 

>      $params = substr($parts[9], 8, strlen($parts[9]) - 9);

> 

>      echo "$params\n";

> 

> }

> 

> runQueries.php:

> 

> <?php

> 

> /**

> 

> * Given the output of extractQueries.php (sent to STDIN), create a CSV

> file of results (sent to STDOUT).

> 

> */

> 

> $base = "http://localhost:8082/solr/biblio/select?";

> 

> fputcsv(STDOUT, ['input', 'success', 'time', 'matches']);

> 

> while ($line = fgets(STDIN)) {

> 

>      $url = $base . $line;

> 

>      $result = json_decode(file_get_contents($url));

> 

>      $success = isset($result->responseHeader->QTime) ? true : false;

> 

>      $csv = $success

> 

>          ? [$line, 'true', $result->responseHeader->QTime,

> $result->response->numFound]

> 

>          : [$line, 'false'];

> 

>      fputcsv(STDOUT, $csv);

> 

> }

> 

> Right now I’m still in the process of setting up my first test index,

> so perhaps these will be refined a little as I start using them… but

> in any case, here’s my intended procedure:

> 

> 1.)Spin up standard VuFind test instance on solr5 branch

> 

> 2.)Add an extra million records

> 

> 3.)Run queries (I have a random sampling of about 600… hopefully

> that’s

> reasonable) and save .csv file

> 

> 4.)Shut down test instance

> 

> 5.)Spin up standard VuFind test instance on solr5 branch with

> docValues changes merged in

> 

> 6.)Repeat steps 2-4.

> 

> 7.)Analyze .csv files

> 

> Hopefully that will show us whether or not there’s a significant

> measurable difference between the two configurations, even if some of

> the details aren’t as perfectly scientific as they might be.

> 

> I’ll post results as soon as I have them. In the meantime, I’m open to

> suggestions for refining the process!

> 

> Thanks,

> 

> Demian

> 

> *From:*Günter Hipler [[hidden email]]

> *Sent:* Wednesday, February 17, 2016 5:53 AM

> *To:* Demian Katz; [hidden email]

> <[hidden email]>

> *Subject:* Re: solr 5.4.1 performance tests with facets

> 

> Demian,

> 

> of course you can take a look into what I have done. I just pushed the

> scripts I used [1]. You can find a short description of the ideas in

> the README  file

> 

> But some notes you should keep in mind and a little background information:

> - First I wanted to use the ELK stack (especially Logstash and

> Elasticsearch) for such log analysis which I have in mind since quite

> some time and I thought this might be a good moment in time to use it.

> (We already use ElasticSearch for our linked project)

> - Unfortunately I haven't found just already configured pipelines for

> Solr logs  I could use out of the box (which surprised me a lot). I

> stumbled upon something [2]  but this didn't work out for my ideas

> - Spending half a day for such purposes is more than enough and  I

> needed results for my questions about performance of Solr 5.4 servers

> I made just a quick hack with python and Mongo ... (see the

> description in

> README) Perhaps you can use it in a similar way for yourself

> - As I already mentioned in the dev-call: The swissbib index schema is

> quite different compared to the VF2 schema and we don't use the

> SolrMarc process pipeline (which is the reason why I have to adapt the

> String normalization stuff for doc-values by myself). Our what we call

> "documentProcessing" is XSLT and Java based [4]  and in the future I'm

> planning to use a combination of the stream-based MetaFacture -

> Framework [5] (initially created by the German National Library) in

> combination with our current procedures. MetaFacture based workflows

> are already part of our linked  project [6]

> 

> So it's not we don't want to share something but sometimes it is

> difficult to put things together for people not really familiar with

> our work - I'm sorry for this.

> 

> Very best wishes from Basel!

> 

> Günter

> 

> 

> 

> [1]

> https://github.com/linked-swissbib/utilities/commit/8cadbe26271ee5803c

> 73f7b42efed17ffef061a6 [2]

> http://blog.comperiosearch.com/blog/2015/09/21/solr-logstash-analysis/

> [3]

> https://github.com/swissbib/searchconf/blob/master/solr/bib/conf-solr.

> 4.10.2/configs/solr.home/bib/conf/schema.xml

> [4] https://github.com/swissbib/content2SearchDocs

> [5] https://github.com/culturegraph/metafacture-core

> [6] https://github.com/linked-swissbib/mfWorkflows

> 

> On 02/16/2016 05:36 PM, Demian Katz wrote:

> 

>     Günter,

> 

>     Thanks for sharing these results (and for your participation in

>     today’s call). Out of curiosity, what process did you do to run

>     these tests? Is there any possibility that you might be able to

>     share data/scripts so that I can run the same tests on this end

>     against various configurations of Solr, or is your schema so heavily

>     customized that the queries/facet values would be meaningless to a

>     “stock” VuFind instance?

> 

>     In any case, I understand if you are unable to share the data for

>     whatever reason – but it seemed worth asking in case it could save a

>     bit of time with my own testing over here!

> 

>     thanks,

> 

>     Demian

> 

>     *From:*Günter Hipler [[hidden email]]

>     *Sent:* Tuesday, February 16, 2016 8:25 AM

>     *To:* Demian Katz; [hidden email]

>     <[hidden email]>

>     *Cc:* [hidden email] <[hidden email]>

>     *Subject:* solr 5.4.1 performance tests with facets

> 

>     Hi

> 

>     sorry for the delayed answer. Last week I made some performance

>     tests on our fresh Solr 5.4.1 index without doc-values for facets.

> 

>     My results:

>     - it makes a huge difference if you are running on SSD or not. With

>     SSD's the performance is quite the same compared to our productive

>     index using version 4.10

>     - in detail:

>     -- I used only queries I collected from our logs on the productive

>     servers and picked up only facet queries with at least one facet field

>     -- total queries: 76656

>     qTime < 100 milliseconds:  73270

>     qTime > 500 milliseconds: 511

>     qTime > 1000 milliseconds: 325

>     qTime > 1500 milliseconds: 232

>     qTime > 2500 milliseconds: 197

>     qTime > 4000 milliseconds: 30

>     qTime longest 5724

> 

>     For me these results are reasonable and more or less comparable with

>     our 4.10 Index.

> 

>     But I think it's no reason to set change to doc-values aside. The

>     reason why I tested it without doc-values: I would be lucky to

>     postpone the adaptations for the moment because there is a lot of

>     other work to be done.

> 

>       Demian, thanks for the link to

>     https://issues.apache.org/jira/browse/SOLR-8096.

>     I wasn't aware, the Solr team has this severe problem which seems

>     not to be solved until now. From my point of view: they are loosing

>     connection to the development in the underlying Lucene building.

> 

>     Hope we (swissbib) can take part at todays dev - call.

> 

>     Günter

> 

>     On 02/02/2016 02:55 PM, Demian Katz wrote:

> 

>         Thanks for all of the valuable input – I’ll definitely mention

>         all of these points on today’s call (and will do some catch-up

>         reading on ZF3/Expressive very soon so we can discuss that in

>         more detail next time around).

> 

>         The Solr performance issue is very unfortunate – it seems that

>         docValues are the solution, but this doesn’t really feel like a

>         solution so much as a workaround! I’m not seeing much activity

>         on the JIRA ticket about the performance problems

>         (https://issues.apache.org/jira/browse/SOLR-8096), so we may be

>         stuck having to do something.

> 

>         I would be interested to hear what level of performance

>         improvements you achieve by switching to docValues on your test

>         instance – I think that it would be useful to at least confirm

>         the order of magnitude of improvement on a large-scale index. I

>         haven’t had time to do that sort of performance testing on this

>         end yet.

> 

>         Thanks again!

> 

>         - Demian

> 

>         *From:*Günter Hipler [[hidden email]]

>         *Sent:* Tuesday, February 02, 2016 5:48 AM

>         *To:* [hidden email]

>         <[hidden email]>

>         *Subject:* Re: [VuFind-Tech] Developers Call Agenda - 2/2/16

> 

>         Hi Demian,

> 

>         unfortunately nobody of the swissbib team isn't able to take

>         part today because of an internal meeting

> 

>         Therefor some inputs from our side in advance related to the agenda:

>         d) Solr Upgrade

>         - As it is alreday discussed on the list the 5.4.1 release fixes

>         the more-like-this nullpointer exception bug.

>         e.g. https://testvf.swissbib.ch/Record/336203004 (similar items)

>         - At the moment we don't use doc values for facets and sort

>         fields because on some facet fields we still perform text

>         processing (e.g. [1] with no String types) on the Solr side. But

>         this is kind of our own problem because we use a strongly

>         modified index schema.

>         - Next days I want to run some performance tests. At the moment

>         the performance is significantly slower compared to the

>         productive 4.10 version. I guess the main reason could be that

>         in production we are using SSD disks (hopefully..)

> 

>         g) API.

>         As I mentioned in the past, we want to start a project with

>         Markus Maechler to implement a REST API for our linked-data

>         project (in the context of his university education).

>         Unfortunately the start is a little bit delayed. We want to

>         provide all our ideas, goals and design principles once their is

>         enough to tell about. At the moment most preparation is only

>         written in German.

> 

>         h) Modularization

>         Don't know if you follow the current efforts for ZF3. There are

>         some resources which are quite valuable for information about

>         what's going on:

>         http://framework.zend.com/blog/

>         https://mwop.net/blog/2016-01-28-expressive-stable.html

> 

>          >From my point of view the latest stable release of Expressive

>         as the Zend Microframework for components is a major step to

>         encapsulate components (as part of a modularization process).

>         These components could be used even outside of the VuFind

>         application and replace (if useful) some current use cases of

>         VuFind for a smaller resource footprint.

> 

>         We are going to follow the current development and it would be

>         nice if more people are getting interest on this. Definitely it

>         needs time and it should be a topic after the 3.0 release.

> 

> 

>         Günter

> 

> 

> 

>         [1]

>        

> https://github.com/swissbib/searchconf/blob/update/solr5x/solr/bib/sol

> r.5.4/SOLR_HOME/sb-biblio/conf/schema.xml#L39

> 

>         On 01/28/2016 02:59 PM, Demian Katz wrote:

> 

>             The next developers call will be Tuesday, February 2, 2016

>             at 9am Eastern Standard Time (14:00 GMT).

> 

>             AGENDA

> 

>             1. Development Updates

>             2. Development Planning

>                  a. Improved Author Indexing

>                  b. Delimited Facets

> 

>                  c. Eliminate "VuFind" Source in Database

>                  d. Solr Upgrade

>                  e. Javascript Reorganization

>                  f. Cover Issues

>                  g. API

>                  h. Modularization

>                  i. Improved Use of Permissions

> 

>                  j. Session Performance Improvement

>             3. Other Topics?

> 

>             More information on the free online call can be found at

>             https://vufind.org/wiki/developers_call-- all are welcome!

> 

>             - Demian

> 

> 

> 

> 

>            

> ----------------------------------------------------------------------

> --------

> 

>             Site24x7 APM Insight: Get Deep Visibility into Application

> Performance

> 

>             APM + Mobile APM + RUM: Monitor 3 App instances at just

> $35/Month

> 

>             Monitor end-to-end web transactions and take corrective

> actions now

> 

>             Troubleshoot faster and improve end-user experience. Signup Now!

> 

>            

> http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140

> 

> 

> 

> 

>             _______________________________________________

> 

>             Vufind-tech mailing list

> 

>             [hidden email]

>             <[hidden email]>

> 

>             https://lists.sourceforge.net/lists/listinfo/vufind-tech

> 

> 

> 

>         --

> 

>         UNIVERSITÄT BASEL

> 

>         Universitätsbibliothek

> 

>         Günter Hipler

> 

>         Projekt swissbib

> 

>         Schönbeinstrasse 18-20

> 

>         4056 Basel, Schweiz

> 

>         Tel.: +41 61 267 31 12

> 

>         Fax: +41 61 267 31 03

> 

>         [hidden email]

> <[hidden email]>

> 

>         URLwww.swissbib.org <http://www.swissbib.org>

> 

>     --

> 

>     UNIVERSITÄT BASEL

> 

>     Universitätsbibliothek

> 

>     Günter Hipler

> 

>     Projekt swissbib

> 

>     Schönbeinstrasse 18-20

> 

>     4056 Basel, Schweiz

> 

>     Tel.: +41 61 267 31 12

> 

>     Fax: +41 61 267 31 03

> 

>     [hidden email] <[hidden email]>

> 

>     URLwww.swissbib.org <http://www.swissbib.org>

> 

> --

> 

> UNIVERSITÄT BASEL

> 

> Universitätsbibliothek

> 

> Günter Hipler

> 

> Projekt swissbib

> 

> Schönbeinstrasse 18-20

> 

> 4056 Basel, Schweiz

> 

> Tel.: +41 61 267 31 12

> 

> Fax: +41 61 267 31 03

> 

> [hidden email] <[hidden email]>

> 

> URLwww.swissbib.org <http://www.swissbib.org>

> 

> 

> 

> ----------------------------------------------------------------------

> --------

> Site24x7 APM Insight: Get Deep Visibility into Application Performance

> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month

> Monitor end-to-end web transactions and take corrective actions now

> Troubleshoot faster and improve end-user experience. Signup Now!

> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140

> 

> 

> 

> _______________________________________________

> Vufind-tech mailing list

> [hidden email]

> https://lists.sourceforge.net/lists/listinfo/vufind-tech

> 

 

--

Ere Maijala

Kansalliskirjasto / The National Library of Finland

 

------------------------------------------------------------------------------

Site24x7 APM Insight: Get Deep Visibility into Application Performance APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month Monitor end-to-end web transactions and take corrective actions now Troubleshoot faster and improve end-user experience. Signup Now!

http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140

_______________________________________________

Vufind-tech mailing list

[hidden email]

https://lists.sourceforge.net/lists/listinfo/vufind-tech


------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech