Setting up VuFind for a library network, RecordManager, deduplication

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Setting up VuFind for a library network, RecordManager, deduplication

Albert Kiener
Hello VuFind-tech, Hello Mr. Maijala,
First a description of my goal and problems:
I have multiple questions regarding the necessary setup and
modifications to represent a library network of 7 public libraries in a
single VuFind instance.
The endresult should include a single SolR-Core containing the records
of all 7 libraries. The count of records sums up to about 700.000.
As the records of public libraries include a big overlap of media a form
of deduplication should be used to increase the usability of the search
for the end user. The RecordManager with its calculated mergeRecords
seems like the perfect solution for this problem.
You still need to be able to distinguish between the single libraries as
they all have their own library cards and userbase which only want to
use and search their local library.

Question regarding VuFind:
1. How can I include a library selection into VuFind? Has it been done
before, are there examples of libraries using a similar system?
This should be used for search as well as account-login and
account-management. I am using a custom ILS for the backend, so there
should be no problem there. My current idea is to use a dropdown with
the 7 libraries and store the currently selected library in the session
store.

@EreMaijala: Question regarding the RecordManager, deduplication:
2. Due to the size of records in MarcXML format (about 1.6GB), I have
run into some memory issues. (ErrorMsg in PS.) The actions of the
RecordManager I am trying to use include import.php, manage.php
--func=deduplicate and --func=updatesolr. I was able to fix the memory
problems on import by splitting the data into chunks between 5 and 20
MBs. This method does not work however when I try to deduplicate or
update the SolR. Is there a way to reduce the memory consumption (e.g.
by doing those tasks sequentially)? Did I reach a limit of the
RecordManager and should try to optimize it?

3. Do you think that the deduplication algorithm is suitable for a
network of 7 german public libraries? What changes/optimizations would
you recommend?

Best regards,
Albert Kiener

PS. For reference on question 2. I included the error message I receive
when trying to export about 84MB of deduplicated records from the
MongoDB to SolR via (manage.php --func=updatesolr)
PHP Fatal error:  Allowed memory size of 2097152 bytes exhausted (tried
to allocate 159744 bytes) in C:\RecordManager\classes\SolrUpdater.php on
line 2402
PHP Stack trace:
PHP   1. {main}() C:\RecordManager\manage.php:0
PHP   2. main() C:\RecordManager\manage.php:186
PHP   3. RecordManager->updateSolrIndex() C:\RecordManager\manage.php:104
PHP   4. SolrUpdater->updateRecords()
C:\RecordManager\classes\RecordManager.php:493
PHP   5. SolrUpdater->processMerged()
C:\RecordManager\classes\SolrUpdater.php:445
PHP   6. SolrUpdater->bufferedUpdate()
C:\RecordManager\classes\SolrUpdater.php:1054

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Setting up VuFind for a library network, RecordManager, deduplication

Ere Maijala
Hi Albert,

1. Maybe others can answer this one better, but I'll just say what I
know: I'm not aware of an implementation where there would be a global
library selection dropdown. We have multiple VuFind UI's for different
organisations so that other aspects of them can be customised as well,
and for shared views we offer the building facet and search tabs for
grouping results. We also use MultiILS authentication with the
MultiBackend driver, but it could also be used with your custom driver
if you implement the required methods (getLoginDrivers and
getDefaultLoginDriver, I believe). There's currently a strict check that
the ILS driver descends from MultiBackend in MultiILS.php, but it could
be changed to an interface or method availability check just as well.

2. RecordManager rarely processes all the records as a single chunk. The
admittedly quite lazily written file import process is one, but
deduplication or the Solr update process are not.

Looking at your error message it seems that you're running PHP with an
extremely low memory limit of 2 MB. That's way lower than any default
I've ever seen (from 32 MB upwards), but for any serious use I'd
recommend increasing memory_limit to at least 512 MB. We're running with
4 GB limit to be able to import large XML files due to the forementioned
lazy import implementation. This said, I'm surprised you could import a
file of even 5 MB with this limit, so there might be something fishy here.

3. I don't think I can answer that. We use the algorithm across 115
databases, so it should scale pretty well, but since there are
differences in cataloguing rules, you'll need to make the judgement by
checking the results. You can, for instance, run deduplication for a
single record with verbose output to see how it's processed like this:

     php manage.php --func=deduplicate --single=<record_id> --verbose

We currently run Percona Server for MongoDB with the WiredTiger engine
as the data store. It seems to provide a pretty good throughput to
RecordManager tasks.

Hope this helps!

Regards,
Ere

Albert Kiener kirjoitti 25.6.2017 klo 17.00:

> Hello VuFind-tech, Hello Mr. Maijala,
> First a description of my goal and problems:
> I have multiple questions regarding the necessary setup and
> modifications to represent a library network of 7 public libraries in a
> single VuFind instance.
> The endresult should include a single SolR-Core containing the records
> of all 7 libraries. The count of records sums up to about 700.000.
> As the records of public libraries include a big overlap of media a form
> of deduplication should be used to increase the usability of the search
> for the end user. The RecordManager with its calculated mergeRecords
> seems like the perfect solution for this problem.
> You still need to be able to distinguish between the single libraries as
> they all have their own library cards and userbase which only want to
> use and search their local library.
>
> Question regarding VuFind:
> 1. How can I include a library selection into VuFind? Has it been done
> before, are there examples of libraries using a similar system?
> This should be used for search as well as account-login and
> account-management. I am using a custom ILS for the backend, so there
> should be no problem there. My current idea is to use a dropdown with
> the 7 libraries and store the currently selected library in the session
> store.
>
> @EreMaijala: Question regarding the RecordManager, deduplication:
> 2. Due to the size of records in MarcXML format (about 1.6GB), I have
> run into some memory issues. (ErrorMsg in PS.) The actions of the
> RecordManager I am trying to use include import.php, manage.php
> --func=deduplicate and --func=updatesolr. I was able to fix the memory
> problems on import by splitting the data into chunks between 5 and 20
> MBs. This method does not work however when I try to deduplicate or
> update the SolR. Is there a way to reduce the memory consumption (e.g.
> by doing those tasks sequentially)? Did I reach a limit of the
> RecordManager and should try to optimize it?
>
> 3. Do you think that the deduplication algorithm is suitable for a
> network of 7 german public libraries? What changes/optimizations would
> you recommend?
>
> Best regards,
> Albert Kiener
>
> PS. For reference on question 2. I included the error message I receive
> when trying to export about 84MB of deduplicated records from the
> MongoDB to SolR via (manage.php --func=updatesolr)
> PHP Fatal error:  Allowed memory size of 2097152 bytes exhausted (tried
> to allocate 159744 bytes) in C:\RecordManager\classes\SolrUpdater.php on
> line 2402
> PHP Stack trace:
> PHP   1. {main}() C:\RecordManager\manage.php:0
> PHP   2. main() C:\RecordManager\manage.php:186
> PHP   3. RecordManager->updateSolrIndex() C:\RecordManager\manage.php:104
> PHP   4. SolrUpdater->updateRecords()
> C:\RecordManager\classes\RecordManager.php:493
> PHP   5. SolrUpdater->processMerged()
> C:\RecordManager\classes\SolrUpdater.php:445
> PHP   6. SolrUpdater->bufferedUpdate()
> C:\RecordManager\classes\SolrUpdater.php:1054
>
> ------------------------------------------------------------------------------
>
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Vufind-tech mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/vufind-tech

--
Ere Maijala
Kansalliskirjasto / The National Library of Finland

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Setting up VuFind for a library network, RecordManager, deduplication

Demian Katz
In reply to this post by Albert Kiener
Albert,

I think Ere has answered most of your questions pretty well... but regarding the library selection drop-down, one approach I have seen is this:

1.) Using VuFind's multi-site capabilities (see https://vufind.org/wiki/installation:installing_multiple_instances) create several VuFind instances each with a separate configuration directory and URL.

2.) Using the [Parent_Config] feature of VuFind's config files, all of these instances can inherit from a shared common set of configurations, overriding just a few settings (like using a library-specific theme, connecting to a different ILS driver, applying special default Solr filters, etc.).

3.) With this setup, you can have a separate base URL for each library's VuFind instance (and perhaps a separate URL for a global union catalog). All of these can be sharing the same Solr index and VuFind code, but each can have differentiated theming and configuration. In each custom theme, you can implement a simple library drop-down that simply redirects the user to the base URL of the selected library.

I'm not sure if that's exactly what you had in mind, but it's one general approach to the problem.

Let me know if you have further questions, problems or concerns!

- Demian

-----Original Message-----
From: Albert Kiener [mailto:[hidden email]]
Sent: Sunday, June 25, 2017 10:00 AM
To: [hidden email]
Subject: [VuFind-Tech] Setting up VuFind for a library network, RecordManager, deduplication

Hello VuFind-tech, Hello Mr. Maijala,
First a description of my goal and problems:
I have multiple questions regarding the necessary setup and modifications to represent a library network of 7 public libraries in a single VuFind instance.
The endresult should include a single SolR-Core containing the records of all 7 libraries. The count of records sums up to about 700.000.
As the records of public libraries include a big overlap of media a form of deduplication should be used to increase the usability of the search for the end user. The RecordManager with its calculated mergeRecords seems like the perfect solution for this problem.
You still need to be able to distinguish between the single libraries as they all have their own library cards and userbase which only want to use and search their local library.

Question regarding VuFind:
1. How can I include a library selection into VuFind? Has it been done before, are there examples of libraries using a similar system?
This should be used for search as well as account-login and account-management. I am using a custom ILS for the backend, so there should be no problem there. My current idea is to use a dropdown with the 7 libraries and store the currently selected library in the session store.

@EreMaijala: Question regarding the RecordManager, deduplication:
2. Due to the size of records in MarcXML format (about 1.6GB), I have run into some memory issues. (ErrorMsg in PS.) The actions of the RecordManager I am trying to use include import.php, manage.php --func=deduplicate and --func=updatesolr. I was able to fix the memory problems on import by splitting the data into chunks between 5 and 20 MBs. This method does not work however when I try to deduplicate or update the SolR. Is there a way to reduce the memory consumption (e.g.
by doing those tasks sequentially)? Did I reach a limit of the RecordManager and should try to optimize it?

3. Do you think that the deduplication algorithm is suitable for a network of 7 german public libraries? What changes/optimizations would you recommend?

Best regards,
Albert Kiener

PS. For reference on question 2. I included the error message I receive when trying to export about 84MB of deduplicated records from the MongoDB to SolR via (manage.php --func=updatesolr) PHP Fatal error:  Allowed memory size of 2097152 bytes exhausted (tried to allocate 159744 bytes) in C:\RecordManager\classes\SolrUpdater.php on line 2402 PHP Stack trace:
PHP   1. {main}() C:\RecordManager\manage.php:0
PHP   2. main() C:\RecordManager\manage.php:186
PHP   3. RecordManager->updateSolrIndex() C:\RecordManager\manage.php:104
PHP   4. SolrUpdater->updateRecords()
C:\RecordManager\classes\RecordManager.php:493
PHP   5. SolrUpdater->processMerged()
C:\RecordManager\classes\SolrUpdater.php:445
PHP   6. SolrUpdater->bufferedUpdate()
C:\RecordManager\classes\SolrUpdater.php:1054

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fsdm.link%2Fslashdot&data=02%7C01%7Cdemian.katz%40villanova.edu%7C895937b1d8374e896cd408d4bbd4ab85%7C765a8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C636339969474933591&sdata=fOZYveo8XZy8bH%2FwaJrymf6SvLrMujFQt54T3AF7OSQ%3D&reserved=0
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.sourceforge.net%2Flists%2Flistinfo%2Fvufind-tech&data=02%7C01%7Cdemian.katz%40villanova.edu%7C895937b1d8374e896cd408d4bbd4ab85%7C765a8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C636339969474933591&sdata=%2FHEmQmVSpGWXoXT5AUEIxlecGjrLEOD%2BA%2FjIPjaHpzY%3D&reserved=0

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Setting up VuFind for a library network, RecordManager, deduplication

Uwe Reh
Hi Albert,

our HDS (hds.hebis.de) uses just one Index for all (~50) member
libraries of our consortium.
The already running installations are build like Demian described. All
the same, but different sets of configuration. mainly each installation
contains a solr filter query to restrict the results to the libraries
own holdings.

This basic is quite easy to implement. But your question implements a
Problem, which is quite harder to solve. "Searching on items".

Assume:
* two libraries: Lib_a and Lib_b
* two books: Book_1 (title:foo) and Book_2 (title:bar)
* Lib_a has a item of Book_1 (shelfmark:abc)
* Lib_a has a item of Book_2  (shelfmark:def)
* Lib_b has a item of Book_1 (shelfmark:def)
* Lib_b has a item of Book_2 (shelfmark:ghi)

Now assume a installation restricted to books from Lib_a and following
query: "find title:foo AND shelfmark:def".
Even there is no real match, VuFind's Index will find Book_1 with
shelfmark:def.

This is a known problem of search engines like Solr/Lucene and there
some workarounds are known. The best may be
> https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index+Handlers?focusedCommentId=64553628#UploadingDatawithIndexHandlers-NestedChildDocuments
We tried to solve this Problem with dynamic fields, like 'shelfmark_*'.
This allows to distinct in the document of Book_1 between
shelfmark_Lib_a:abc and shelfmark_Lib_b:def. As example, again in the
Installation of Lib_a, the query "find title:foo AND
shelfmark_Lib_a:def" will find correctly nothing.


Conclusion:
- Having one Index for more than one installations is well supported and
used.
- Not supported is to distinct also attributes of items, like
shelfmarks, local topics, provenience information, etc. In this case,
you need to create your own index schema.

Uwe


Am 26.06.2017 um 16:31 schrieb Demian Katz:

> Albert,
>
> I think Ere has answered most of your questions pretty well... but regarding the library selection drop-down, one approach I have seen is this:
>
> 1.) Using VuFind's multi-site capabilities (see https://vufind.org/wiki/installation:installing_multiple_instances) create several VuFind instances each with a separate configuration directory and URL.
>
> 2.) Using the [Parent_Config] feature of VuFind's config files, all of these instances can inherit from a shared common set of configurations, overriding just a few settings (like using a library-specific theme, connecting to a different ILS driver, applying special default Solr filters, etc.).
>
> 3.) With this setup, you can have a separate base URL for each library's VuFind instance (and perhaps a separate URL for a global union catalog). All of these can be sharing the same Solr index and VuFind code, but each can have differentiated theming and configuration. In each custom theme, you can implement a simple library drop-down that simply redirects the user to the base URL of the selected library.
>
> I'm not sure if that's exactly what you had in mind, but it's one general approach to the problem.
>
> Let me know if you have further questions, problems or concerns!
>
> - Demian

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Loading...