Tika app hangs

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Tika app hangs

Library

Hello,

 

We are experiencing the following issue with VuFind 3.0.3 in a Windows 2012 R2 server, with Tika app v. 1.14 (latest). When importing records that have full-text documents, often and randomly Apache Tika hangs. It simply looks like being frozen (an infinite loop, perhaps?). No error message is given and the temporary file remains at the /tmp folder until manually deleted.

There is plenty of server memory and CPU available, and it has never been put at stress. I have also inflated PHP memory settings (file size, memory timeout…) without  making any difference. That’s why I think the issue must be related with Java/Tika.

 

We have checked whether it was related to specific files (file type, length…) with no success. It seems that, while processing, for an unknown reason, sometimes it decides to freeze. Has anyone experienced this issue before?

 

Best regards,

 

---------------------------------

Xavier Berdaguer

Information Specialist

NATO STO CMRE

Technical Library

Viale S. Bartolomeo, 400

19126 La Spezia, Italy

Phone: +390187527361

[hidden email]

www.cmre.nato.int

 


------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi
_______________________________________________
VuFind-General mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tika app hangs

Demian Katz

I have not heard reports from others of this specific problem.


What kind of records are you importing when this occurs? Are you loading MARC records with links in the 856 field, or are you loading some sort of XML file? Either way, it might be useful to add some debugging logic to the indexing code so you can determine whether the hang is happening inside Tika, or if there is some problem with processing the output of Tika in the subsequent indexing code.


Another thing that might be interesting to try would be collecting a list of URLs and writing a stand-alone script to call Tika against all of them. This might help narrow down whether the problem is confined to Tika itself (in which case reaching out to the Tika community might be helpful) or whether it's some specific interaction between TIka and the indexing logic.


Do you have access to a Linux server? Since Windows is a much less frequently used platform for VuFind, it's possible that others have not seen this issue because it is somehow Windows-specific. If you can run a test index on a Linux box, that might also be an interesting test.


Finally, where are the files you are indexing stored? Is it possible that the problem is not related to Tika or the indexer at all, but instead that HTTP connections are intermittently hanging on the content server?


Anyway, sorry for not having a more precise answer, but I hope that some of these ideas might be useful in tracking down the problem. If you are still stuck, please let me know your findings and I'll see if I can offer some additional ideas.


Good luck!


- Demian



From: Library <[hidden email]>
Sent: Friday, January 13, 2017 5:30 AM
To: [hidden email]
Subject: [VuFind-General] Tika app hangs
 

Hello,

 

We are experiencing the following issue with VuFind 3.0.3 in a Windows 2012 R2 server, with Tika app v. 1.14 (latest). When importing records that have full-text documents, often and randomly Apache Tika hangs. It simply looks like being frozen (an infinite loop, perhaps?). No error message is given and the temporary file remains at the /tmp folder until manually deleted.

There is plenty of server memory and CPU available, and it has never been put at stress. I have also inflated PHP memory settings (file size, memory timeout…) without  making any difference. That’s why I think the issue must be related with Java/Tika.

 

We have checked whether it was related to specific files (file type, length…) with no success. It seems that, while processing, for an unknown reason, sometimes it decides to freeze. Has anyone experienced this issue before?

 

Best regards,

 

---------------------------------

Xavier Berdaguer

Information Specialist

NATO STO CMRE

Technical Library

Viale S. Bartolomeo, 400

19126 La Spezia, Italy

Phone: +390187527361

[hidden email]

www.cmre.nato.int

 


------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi
_______________________________________________
VuFind-General mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tika app hangs

Library

Well, all data imported and let’s detail what we have found out in the process for the benefit of the community:

 

-          Objective: import about 95000 records (XML files) into VuFind. Many of these records have links to full text (some records had multiple links, as our URL field is multivalue). Also, while most of the linked files are in PDF format, many are huge (many are bigger than 50 MB, biggest is 700 Mb…).

-          Server: Windows 2012 R2 with VuFind 3.0.3, Php 7, Apache 2.4, Tika app 1.14. Hardware is 8 CPUs at 2.13 GHz, 16 GB of RAM

 

Lessons learnt:

 

-          The original XML file is too big for processing. We have separated the records in different files that are not bigger than 10 MB. to be sure they do not make Tika to hang

-          The following PHP variables values (php.ini) have been set to avoid fatal errors when importing as the PDFs were so big:

memory_limit = 1024M

max_execution_time = 1800

max_input_time = 240

-          Windows certainly has issues managing so many Java processes at once as it happened during the import. We monitored  the activity of the server when importing with the Process Explorer (http://www.sysinternals.com/) and we found hangs due to peaks of Java/Tika activity when indexing the full-text. We will increase the RAM and see if it helps. Also note that Process Explorer has helped us in allowing to raise the priority of the PHP/Java processes.

 

Anyway, the total time to import and index the stuff is huge: if we take out the time we have had to stop due to different issues, I would say that more or less five days of real processing have been needed. We also have hopes that the upcoming Tika app 1.15 may help, as it is described it improves the indexing of big files.

 

Best regards,

 

Xavier

 

From: Demian Katz [mailto:[hidden email]]
Sent: 13 January 2017 14:09
To: Library; [hidden email]
Subject: Re: [VuFind-General] Tika app hangs

 

I have not heard reports from others of this specific problem.

 

What kind of records are you importing when this occurs? Are you loading MARC records with links in the 856 field, or are you loading some sort of XML file? Either way, it might be useful to add some debugging logic to the indexing code so you can determine whether the hang is happening inside Tika, or if there is some problem with processing the output of Tika in the subsequent indexing code.

 

Another thing that might be interesting to try would be collecting a list of URLs and writing a stand-alone script to call Tika against all of them. This might help narrow down whether the problem is confined to Tika itself (in which case reaching out to the Tika community might be helpful) or whether it's some specific interaction between TIka and the indexing logic.

 

Do you have access to a Linux server? Since Windows is a much less frequently used platform for VuFind, it's possible that others have not seen this issue because it is somehow Windows-specific. If you can run a test index on a Linux box, that might also be an interesting test.

 

Finally, where are the files you are indexing stored? Is it possible that the problem is not related to Tika or the indexer at all, but instead that HTTP connections are intermittently hanging on the content server?

 

Anyway, sorry for not having a more precise answer, but I hope that some of these ideas might be useful in tracking down the problem. If you are still stuck, please let me know your findings and I'll see if I can offer some additional ideas.

 

Good luck!

 

- Demian

 


From: Library <[hidden email]>
Sent: Friday, January 13, 2017 5:30 AM
To: [hidden email]
Subject: [VuFind-General] Tika app hangs

 

Hello,

 

We are experiencing the following issue with VuFind 3.0.3 in a Windows 2012 R2 server, with Tika app v. 1.14 (latest). When importing records that have full-text documents, often and randomly Apache Tika hangs. It simply looks like being frozen (an infinite loop, perhaps?). No error message is given and the temporary file remains at the /tmp folder until manually deleted.

There is plenty of server memory and CPU available, and it has never been put at stress. I have also inflated PHP memory settings (file size, memory timeout…) without  making any difference. That’s why I think the issue must be related with Java/Tika.

 

We have checked whether it was related to specific files (file type, length…) with no success. It seems that, while processing, for an unknown reason, sometimes it decides to freeze. Has anyone experienced this issue before?

 

Best regards,

 

---------------------------------

Xavier Berdaguer

Information Specialist

NATO STO CMRE

Technical Library

Viale S. Bartolomeo, 400

19126 La Spezia, Italy

Phone: +390187527361

[hidden email]

www.cmre.nato.int

 


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
VuFind-General mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tika app hangs

Demian Katz

Xavier,

 

I’m glad to hear that you have successfully indexed your records, though sorry that it is proving to take so long. Are you using a custom XSLT with VuFind’s import-xsl.php script? If so, it is certainly possible that the PHP portion of the process is a bottleneck too – this is a fairly simple tool and has never been optimized for speed.

 

Quite some time ago, there was a JIRA ticket where some performance issues were discussed; not sure if much of this remains relevant, but here it is for your reference:

 

https://vufind.org/jira/browse/VUFIND-926

 

Let me know if there’s anything more I can do to help!

 

- Demian

 

From: Library [mailto:[hidden email]]
Sent: Thursday, January 26, 2017 4:55 AM
To: Demian Katz; [hidden email]
Subject: RE: [VuFind-General] Tika app hangs

 

Well, all data imported and let’s detail what we have found out in the process for the benefit of the community:

 

-          Objective: import about 95000 records (XML files) into VuFind. Many of these records have links to full text (some records had multiple links, as our URL field is multivalue). Also, while most of the linked files are in PDF format, many are huge (many are bigger than 50 MB, biggest is 700 Mb…).

-          Server: Windows 2012 R2 with VuFind 3.0.3, Php 7, Apache 2.4, Tika app 1.14. Hardware is 8 CPUs at 2.13 GHz, 16 GB of RAM

 

Lessons learnt:

 

-          The original XML file is too big for processing. We have separated the records in different files that are not bigger than 10 MB. to be sure they do not make Tika to hang

-          The following PHP variables values (php.ini) have been set to avoid fatal errors when importing as the PDFs were so big:

memory_limit = 1024M

max_execution_time = 1800

max_input_time = 240

-          Windows certainly has issues managing so many Java processes at once as it happened during the import. We monitored  the activity of the server when importing with the Process Explorer (http://www.sysinternals.com/) and we found hangs due to peaks of Java/Tika activity when indexing the full-text. We will increase the RAM and see if it helps. Also note that Process Explorer has helped us in allowing to raise the priority of the PHP/Java processes.

 

Anyway, the total time to import and index the stuff is huge: if we take out the time we have had to stop due to different issues, I would say that more or less five days of real processing have been needed. We also have hopes that the upcoming Tika app 1.15 may help, as it is described it improves the indexing of big files.

 

Best regards,

 

Xavier

 

From: Demian Katz [[hidden email]]
Sent: 13 January 2017 14:09
To: Library; [hidden email]
Subject: Re: [VuFind-General] Tika app hangs

 

I have not heard reports from others of this specific problem.

 

What kind of records are you importing when this occurs? Are you loading MARC records with links in the 856 field, or are you loading some sort of XML file? Either way, it might be useful to add some debugging logic to the indexing code so you can determine whether the hang is happening inside Tika, or if there is some problem with processing the output of Tika in the subsequent indexing code.

 

Another thing that might be interesting to try would be collecting a list of URLs and writing a stand-alone script to call Tika against all of them. This might help narrow down whether the problem is confined to Tika itself (in which case reaching out to the Tika community might be helpful) or whether it's some specific interaction between TIka and the indexing logic.

 

Do you have access to a Linux server? Since Windows is a much less frequently used platform for VuFind, it's possible that others have not seen this issue because it is somehow Windows-specific. If you can run a test index on a Linux box, that might also be an interesting test.

 

Finally, where are the files you are indexing stored? Is it possible that the problem is not related to Tika or the indexer at all, but instead that HTTP connections are intermittently hanging on the content server?

 

Anyway, sorry for not having a more precise answer, but I hope that some of these ideas might be useful in tracking down the problem. If you are still stuck, please let me know your findings and I'll see if I can offer some additional ideas.

 

Good luck!

 

- Demian

 


From: Library <[hidden email]>
Sent: Friday, January 13, 2017 5:30 AM
To: [hidden email]
Subject: [VuFind-General] Tika app hangs

 

Hello,

 

We are experiencing the following issue with VuFind 3.0.3 in a Windows 2012 R2 server, with Tika app v. 1.14 (latest). When importing records that have full-text documents, often and randomly Apache Tika hangs. It simply looks like being frozen (an infinite loop, perhaps?). No error message is given and the temporary file remains at the /tmp folder until manually deleted.

There is plenty of server memory and CPU available, and it has never been put at stress. I have also inflated PHP memory settings (file size, memory timeout…) without  making any difference. That’s why I think the issue must be related with Java/Tika.

 

We have checked whether it was related to specific files (file type, length…) with no success. It seems that, while processing, for an unknown reason, sometimes it decides to freeze. Has anyone experienced this issue before?

 

Best regards,

 

---------------------------------

Xavier Berdaguer

Information Specialist

NATO STO CMRE

Technical Library

Viale S. Bartolomeo, 400

19126 La Spezia, Italy

Phone: +390187527361

[hidden email]

www.cmre.nato.int

 


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
VuFind-General mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tika app hangs

Andrew Krause
In reply to this post by Library


I'd strongly encourage anyone making use of import-xsl.php to follow Xavier's advice and split the XML files.  I'd say anything below 50MB should be safe from memory issues, other than those that can be overcome with php.ini tweaks.  I may try splitting at 10, 50 and 100 to see if there is any overall performance difference.  I've never really found the upper limit I can get away with.  I do know anything around 1GB will always fail on my test server.  Then again my test server is an old Core 2 Quad desktop with 8GB of RAM that another local library was throwing out.


While our indexing situation is much different than Xavier's (just bog standard public library MARC in XML), I am getting rather good performance with import-xsl.php as it is.  Originally it was on par with SolrMARC but is a little slower now.  The slow down is really due to my misuse of XSLT as a language.  I have quite a few loops to clean up various strings before they are assigned to the fields.  I am getting about 600,000 records imported in under 10 minutes.  The nightly updates to keep VuFind in sync are 30 seconds at most.  



Andrew Krause

Web Developer

[hidden email] • p: (847) 923-3224

[hidden email]


From: Demian Katz <[hidden email]>
Sent: Thursday, January 26, 2017 7:41:00 AM
To: Library; [hidden email]
Subject: Re: [VuFind-General] Tika app hangs
 

Xavier,

 

I’m glad to hear that you have successfully indexed your records, though sorry that it is proving to take so long. Are you using a custom XSLT with VuFind’s import-xsl.php script? If so, it is certainly possible that the PHP portion of the process is a bottleneck too – this is a fairly simple tool and has never been optimized for speed.

 

Quite some time ago, there was a JIRA ticket where some performance issues were discussed; not sure if much of this remains relevant, but here it is for your reference:

 

https://vufind.org/jira/browse/VUFIND-926

 

Let me know if there’s anything more I can do to help!

 

- Demian

 

From: Library [mailto:[hidden email]]
Sent: Thursday, January 26, 2017 4:55 AM
To: Demian Katz; [hidden email]
Subject: RE: [VuFind-General] Tika app hangs

 

Well, all data imported and let’s detail what we have found out in the process for the benefit of the community:

 

-          Objective: import about 95000 records (XML files) into VuFind. Many of these records have links to full text (some records had multiple links, as our URL field is multivalue). Also, while most of the linked files are in PDF format, many are huge (many are bigger than 50 MB, biggest is 700 Mb…).

-          Server: Windows 2012 R2 with VuFind 3.0.3, Php 7, Apache 2.4, Tika app 1.14. Hardware is 8 CPUs at 2.13 GHz, 16 GB of RAM

 

Lessons learnt:

 

-          The original XML file is too big for processing. We have separated the records in different files that are not bigger than 10 MB. to be sure they do not make Tika to hang

-          The following PHP variables values (php.ini) have been set to avoid fatal errors when importing as the PDFs were so big:

memory_limit = 1024M

max_execution_time = 1800

max_input_time = 240

-          Windows certainly has issues managing so many Java processes at once as it happened during the import. We monitored  the activity of the server when importing with the Process Explorer (http://www.sysinternals.com/) and we found hangs due to peaks of Java/Tika activity when indexing the full-text. We will increase the RAM and see if it helps. Also note that Process Explorer has helped us in allowing to raise the priority of the PHP/Java processes.

 

Anyway, the total time to import and index the stuff is huge: if we take out the time we have had to stop due to different issues, I would say that more or less five days of real processing have been needed. We also have hopes that the upcoming Tika app 1.15 may help, as it is described it improves the indexing of big files.

 

Best regards,

 

Xavier

 

From: Demian Katz [[hidden email]]
Sent: 13 January 2017 14:09
To: Library; [hidden email]
Subject: Re: [VuFind-General] Tika app hangs

 

I have not heard reports from others of this specific problem.

 

What kind of records are you importing when this occurs? Are you loading MARC records with links in the 856 field, or are you loading some sort of XML file? Either way, it might be useful to add some debugging logic to the indexing code so you can determine whether the hang is happening inside Tika, or if there is some problem with processing the output of Tika in the subsequent indexing code.

 

Another thing that might be interesting to try would be collecting a list of URLs and writing a stand-alone script to call Tika against all of them. This might help narrow down whether the problem is confined to Tika itself (in which case reaching out to the Tika community might be helpful) or whether it's some specific interaction between TIka and the indexing logic.

 

Do you have access to a Linux server? Since Windows is a much less frequently used platform for VuFind, it's possible that others have not seen this issue because it is somehow Windows-specific. If you can run a test index on a Linux box, that might also be an interesting test.

 

Finally, where are the files you are indexing stored? Is it possible that the problem is not related to Tika or the indexer at all, but instead that HTTP connections are intermittently hanging on the content server?

 

Anyway, sorry for not having a more precise answer, but I hope that some of these ideas might be useful in tracking down the problem. If you are still stuck, please let me know your findings and I'll see if I can offer some additional ideas.

 

Good luck!

 

- Demian

 


From: Library <[hidden email]>
Sent: Friday, January 13, 2017 5:30 AM
To: [hidden email]
Subject: [VuFind-General] Tika app hangs

 

Hello,

 

We are experiencing the following issue with VuFind 3.0.3 in a Windows 2012 R2 server, with Tika app v. 1.14 (latest). When importing records that have full-text documents, often and randomly Apache Tika hangs. It simply looks like being frozen (an infinite loop, perhaps?). No error message is given and the temporary file remains at the /tmp folder until manually deleted.

There is plenty of server memory and CPU available, and it has never been put at stress. I have also inflated PHP memory settings (file size, memory timeout…) without  making any difference. That’s why I think the issue must be related with Java/Tika.

 

We have checked whether it was related to specific files (file type, length…) with no success. It seems that, while processing, for an unknown reason, sometimes it decides to freeze. Has anyone experienced this issue before?

 

Best regards,

 

---------------------------------

Xavier Berdaguer

Information Specialist

NATO STO CMRE

Technical Library

Viale S. Bartolomeo, 400

19126 La Spezia, Italy

Phone: +390187527361

[hidden email]

www.cmre.nato.int

 


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
VuFind-General mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tika app hangs

Demian Katz

Andrew,


Thanks for sharing your experience. I'm glad to hear that import-xsl.php is working reasonably well for you. However, I'm curious why you chose it over SolrMarc. It might be worth giving SolrMarc another look if you plan to upgrade (or already have upgraded) to VuFind 3.1.x. The new version is faster and more powerful than the old one, and it might be to your advantage to give it another look.


- Demian



From: Andrew Krause <[hidden email]>
Sent: Thursday, January 26, 2017 10:20 PM
To: [hidden email]
Subject: Re: [VuFind-General] Tika app hangs
 


I'd strongly encourage anyone making use of import-xsl.php to follow Xavier's advice and split the XML files.  I'd say anything below 50MB should be safe from memory issues, other than those that can be overcome with php.ini tweaks.  I may try splitting at 10, 50 and 100 to see if there is any overall performance difference.  I've never really found the upper limit I can get away with.  I do know anything around 1GB will always fail on my test server.  Then again my test server is an old Core 2 Quad desktop with 8GB of RAM that another local library was throwing out.


While our indexing situation is much different than Xavier's (just bog standard public library MARC in XML), I am getting rather good performance with import-xsl.php as it is.  Originally it was on par with SolrMARC but is a little slower now.  The slow down is really due to my misuse of XSLT as a language.  I have quite a few loops to clean up various strings before they are assigned to the fields.  I am getting about 600,000 records imported in under 10 minutes.  The nightly updates to keep VuFind in sync are 30 seconds at most.  



Andrew Krause

Web Developer

[hidden email] • p: (847) 923-3224

[hidden email]


From: Demian Katz <[hidden email]>
Sent: Thursday, January 26, 2017 7:41:00 AM
To: Library; [hidden email]
Subject: Re: [VuFind-General] Tika app hangs
 

Xavier,

 

I’m glad to hear that you have successfully indexed your records, though sorry that it is proving to take so long. Are you using a custom XSLT with VuFind’s import-xsl.php script? If so, it is certainly possible that the PHP portion of the process is a bottleneck too – this is a fairly simple tool and has never been optimized for speed.

 

Quite some time ago, there was a JIRA ticket where some performance issues were discussed; not sure if much of this remains relevant, but here it is for your reference:

 

https://vufind.org/jira/browse/VUFIND-926

 

Let me know if there’s anything more I can do to help!

 

- Demian

 

From: Library [mailto:[hidden email]]
Sent: Thursday, January 26, 2017 4:55 AM
To: Demian Katz; [hidden email]
Subject: RE: [VuFind-General] Tika app hangs

 

Well, all data imported and let’s detail what we have found out in the process for the benefit of the community:

 

-          Objective: import about 95000 records (XML files) into VuFind. Many of these records have links to full text (some records had multiple links, as our URL field is multivalue). Also, while most of the linked files are in PDF format, many are huge (many are bigger than 50 MB, biggest is 700 Mb…).

-          Server: Windows 2012 R2 with VuFind 3.0.3, Php 7, Apache 2.4, Tika app 1.14. Hardware is 8 CPUs at 2.13 GHz, 16 GB of RAM

 

Lessons learnt:

 

-          The original XML file is too big for processing. We have separated the records in different files that are not bigger than 10 MB. to be sure they do not make Tika to hang

-          The following PHP variables values (php.ini) have been set to avoid fatal errors when importing as the PDFs were so big:

memory_limit = 1024M

max_execution_time = 1800

max_input_time = 240

-          Windows certainly has issues managing so many Java processes at once as it happened during the import. We monitored  the activity of the server when importing with the Process Explorer (http://www.sysinternals.com/) and we found hangs due to peaks of Java/Tika activity when indexing the full-text. We will increase the RAM and see if it helps. Also note that Process Explorer has helped us in allowing to raise the priority of the PHP/Java processes.

 

Anyway, the total time to import and index the stuff is huge: if we take out the time we have had to stop due to different issues, I would say that more or less five days of real processing have been needed. We also have hopes that the upcoming Tika app 1.15 may help, as it is described it improves the indexing of big files.

 

Best regards,

 

Xavier

 

From: Demian Katz [[hidden email]]
Sent: 13 January 2017 14:09
To: Library; [hidden email]
Subject: Re: [VuFind-General] Tika app hangs

 

I have not heard reports from others of this specific problem.

 

What kind of records are you importing when this occurs? Are you loading MARC records with links in the 856 field, or are you loading some sort of XML file? Either way, it might be useful to add some debugging logic to the indexing code so you can determine whether the hang is happening inside Tika, or if there is some problem with processing the output of Tika in the subsequent indexing code.

 

Another thing that might be interesting to try would be collecting a list of URLs and writing a stand-alone script to call Tika against all of them. This might help narrow down whether the problem is confined to Tika itself (in which case reaching out to the Tika community might be helpful) or whether it's some specific interaction between TIka and the indexing logic.

 

Do you have access to a Linux server? Since Windows is a much less frequently used platform for VuFind, it's possible that others have not seen this issue because it is somehow Windows-specific. If you can run a test index on a Linux box, that might also be an interesting test.

 

Finally, where are the files you are indexing stored? Is it possible that the problem is not related to Tika or the indexer at all, but instead that HTTP connections are intermittently hanging on the content server?

 

Anyway, sorry for not having a more precise answer, but I hope that some of these ideas might be useful in tracking down the problem. If you are still stuck, please let me know your findings and I'll see if I can offer some additional ideas.

 

Good luck!

 

- Demian

 


From: Library <[hidden email]>
Sent: Friday, January 13, 2017 5:30 AM
To: [hidden email]
Subject: [VuFind-General] Tika app hangs

 

Hello,

 

We are experiencing the following issue with VuFind 3.0.3 in a Windows 2012 R2 server, with Tika app v. 1.14 (latest). When importing records that have full-text documents, often and randomly Apache Tika hangs. It simply looks like being frozen (an infinite loop, perhaps?). No error message is given and the temporary file remains at the /tmp folder until manually deleted.

There is plenty of server memory and CPU available, and it has never been put at stress. I have also inflated PHP memory settings (file size, memory timeout…) without  making any difference. That’s why I think the issue must be related with Java/Tika.

 

We have checked whether it was related to specific files (file type, length…) with no success. It seems that, while processing, for an unknown reason, sometimes it decides to freeze. Has anyone experienced this issue before?

 

Best regards,

 

---------------------------------

Xavier Berdaguer

Information Specialist

NATO STO CMRE

Technical Library

Viale S. Bartolomeo, 400

19126 La Spezia, Italy

Phone: +390187527361

[hidden email]

www.cmre.nato.int

 


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
VuFind-General mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tika app hangs

Andrew Krause

Demian,

 

VuFind 3.0.x SolrMarc:  45+ minutes for a full index

VuFind 3.1.x solrMarc:  under 8 minutes for a full index

 

It was so much faster I thought something wasn’t working correctly.  Now that I switched to XSLT, my goal is to keep index time about the same.

 

In another thread, I’ve touched on having a lot of XSLT to draw from and a tool that can push the XML reports from our Symphony ILS to VuFind without any real effort on my part.

 

With the fairly strict formatting for author/title and using GMDs/ItemCategories for format, I was able to get grouping working fairly well without major changes to VuFind itself:  https://s30.postimg.org/rsfv01gbl/grouping.png

All labels are clickable for the various formats.  Not having different formats of the same work repeated over and over really cleans up the search results.  I could see this done with some SolrMarc customizations.

 

Our XSLT is set up to add boosts to every title as it gets imported.  Currently the boost is set up as: [NumberOfCopies].[NumberOfBranchesWithTitle][TotalCirculation]   

 

Using boosts in this way gives substantially better search results than without (fairly old screenshot):  https://s29.postimg.org/bhfh6fvp3/boost.png

 

I think your message from 2012 best explains doing something similar with SolrMarc:  https://sourceforge.net/p/vufind/mailman/message/29805252/

 

Andrew Krause

Web Developer

[hidden email] • p: (847) 923-3224

 

 

 

From: Demian Katz [mailto:[hidden email]]
Sent: Friday, January 27, 2017 7:03 AM
To: Andrew Krause; [hidden email]
Subject: Re: [VuFind-General] Tika app hangs

 

Andrew,

 

Thanks for sharing your experience. I'm glad to hear that import-xsl.php is working reasonably well for you. However, I'm curious why you chose it over SolrMarc. It might be worth giving SolrMarc another look if you plan to upgrade (or already have upgraded) to VuFind 3.1.x. The new version is faster and more powerful than the old one, and it might be to your advantage to give it another look.

 

- Demian

 


From: Andrew Krause <[hidden email]>
Sent: Thursday, January 26, 2017 10:20 PM
To: [hidden email]
Subject: Re: [VuFind-General] Tika app hangs

 

 

I'd strongly encourage anyone making use of import-xsl.php to follow Xavier's advice and split the XML files.  I'd say anything below 50MB should be safe from memory issues, other than those that can be overcome with php.ini tweaks.  I may try splitting at 10, 50 and 100 to see if there is any overall performance difference.  I've never really found the upper limit I can get away with.  I do know anything around 1GB will always fail on my test server.  Then again my test server is an old Core 2 Quad desktop with 8GB of RAM that another local library was throwing out.

 

While our indexing situation is much different than Xavier's (just bog standard public library MARC in XML), I am getting rather good performance with import-xsl.php as it is.  Originally it was on par with SolrMARC but is a little slower now.  The slow down is really due to my misuse of XSLT as a language.  I have quite a few loops to clean up various strings before they are assigned to the fields.  I am getting about 600,000 records imported in under 10 minutes.  The nightly updates to keep VuFind in sync are 30 seconds at most.  

 

 

Andrew Krause

Web Developer

[hidden email] • p: (847) 923-3224


From: Demian Katz <[hidden email]>
Sent: Thursday, January 26, 2017 7:41:00 AM
To: Library; [hidden email]
Subject: Re: [VuFind-General] Tika app hangs

 

Xavier,

 

I’m glad to hear that you have successfully indexed your records, though sorry that it is proving to take so long. Are you using a custom XSLT with VuFind’s import-xsl.php script? If so, it is certainly possible that the PHP portion of the process is a bottleneck too – this is a fairly simple tool and has never been optimized for speed.

 

Quite some time ago, there was a JIRA ticket where some performance issues were discussed; not sure if much of this remains relevant, but here it is for your reference:

 

https://vufind.org/jira/browse/VUFIND-926

 

Let me know if there’s anything more I can do to help!

 

- Demian

 

From: Library [[hidden email]]
Sent: Thursday, January 26, 2017 4:55 AM
To: Demian Katz; [hidden email]
Subject: RE: [VuFind-General] Tika app hangs

 

Well, all data imported and let’s detail what we have found out in the process for the benefit of the community:

 

-          Objective: import about 95000 records (XML files) into VuFind. Many of these records have links to full text (some records had multiple links, as our URL field is multivalue). Also, while most of the linked files are in PDF format, many are huge (many are bigger than 50 MB, biggest is 700 Mb…).

-          Server: Windows 2012 R2 with VuFind 3.0.3, Php 7, Apache 2.4, Tika app 1.14. Hardware is 8 CPUs at 2.13 GHz, 16 GB of RAM

 

Lessons learnt:

 

-          The original XML file is too big for processing. We have separated the records in different files that are not bigger than 10 MB. to be sure they do not make Tika to hang

-          The following PHP variables values (php.ini) have been set to avoid fatal errors when importing as the PDFs were so big:

memory_limit = 1024M

max_execution_time = 1800

max_input_time = 240

-          Windows certainly has issues managing so many Java processes at once as it happened during the import. We monitored  the activity of the server when importing with the Process Explorer (http://www.sysinternals.com/) and we found hangs due to peaks of Java/Tika activity when indexing the full-text. We will increase the RAM and see if it helps. Also note that Process Explorer has helped us in allowing to raise the priority of the PHP/Java processes.

 

Anyway, the total time to import and index the stuff is huge: if we take out the time we have had to stop due to different issues, I would say that more or less five days of real processing have been needed. We also have hopes that the upcoming Tika app 1.15 may help, as it is described it improves the indexing of big files.

 

Best regards,

 

Xavier

 

From: Demian Katz [[hidden email]]
Sent: 13 January 2017 14:09
To: Library; [hidden email]
Subject: Re: [VuFind-General] Tika app hangs

 

I have not heard reports from others of this specific problem.

 

What kind of records are you importing when this occurs? Are you loading MARC records with links in the 856 field, or are you loading some sort of XML file? Either way, it might be useful to add some debugging logic to the indexing code so you can determine whether the hang is happening inside Tika, or if there is some problem with processing the output of Tika in the subsequent indexing code.

 

Another thing that might be interesting to try would be collecting a list of URLs and writing a stand-alone script to call Tika against all of them. This might help narrow down whether the problem is confined to Tika itself (in which case reaching out to the Tika community might be helpful) or whether it's some specific interaction between TIka and the indexing logic.

 

Do you have access to a Linux server? Since Windows is a much less frequently used platform for VuFind, it's possible that others have not seen this issue because it is somehow Windows-specific. If you can run a test index on a Linux box, that might also be an interesting test.

 

Finally, where are the files you are indexing stored? Is it possible that the problem is not related to Tika or the indexer at all, but instead that HTTP connections are intermittently hanging on the content server?

 

Anyway, sorry for not having a more precise answer, but I hope that some of these ideas might be useful in tracking down the problem. If you are still stuck, please let me know your findings and I'll see if I can offer some additional ideas.

 

Good luck!

 

- Demian

 


From: Library <[hidden email]>
Sent: Friday, January 13, 2017 5:30 AM
To: [hidden email]
Subject: [VuFind-General] Tika app hangs

 

Hello,

 

We are experiencing the following issue with VuFind 3.0.3 in a Windows 2012 R2 server, with Tika app v. 1.14 (latest). When importing records that have full-text documents, often and randomly Apache Tika hangs. It simply looks like being frozen (an infinite loop, perhaps?). No error message is given and the temporary file remains at the /tmp folder until manually deleted.

There is plenty of server memory and CPU available, and it has never been put at stress. I have also inflated PHP memory settings (file size, memory timeout…) without  making any difference. That’s why I think the issue must be related with Java/Tika.

 

We have checked whether it was related to specific files (file type, length…) with no success. It seems that, while processing, for an unknown reason, sometimes it decides to freeze. Has anyone experienced this issue before?

 

Best regards,

 

---------------------------------

Xavier Berdaguer

Information Specialist

NATO STO CMRE

Technical Library

Viale S. Bartolomeo, 400

19126 La Spezia, Italy

Phone: +390187527361

[hidden email]

www.cmre.nato.int

 


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
VuFind-General mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tika app hangs

Demian Katz

The speed improvements really are impressive -- and if your 8 minute number is done with a single thread, you can probably cut it even further down by turning on the optional multi-threading. But, of course, if XSLT fits your workflow better, that's definitely an important consideration. In any case, what you've done looks quite impressive. Congratulations, and let me know if I can be of any assistance with SolrMarc customization if at some point you decide to try to fit it in.


- Demian



From: Andrew Krause <[hidden email]>
Sent: Friday, January 27, 2017 12:21 PM
To: [hidden email]
Subject: Re: [VuFind-General] Tika app hangs
 

Demian,

 

VuFind 3.0.x SolrMarc:  45+ minutes for a full index

VuFind 3.1.x solrMarc:  under 8 minutes for a full index

 

It was so much faster I thought something wasn’t working correctly.  Now that I switched to XSLT, my goal is to keep index time about the same.

 

In another thread, I’ve touched on having a lot of XSLT to draw from and a tool that can push the XML reports from our Symphony ILS to VuFind without any real effort on my part.

 

With the fairly strict formatting for author/title and using GMDs/ItemCategories for format, I was able to get grouping working fairly well without major changes to VuFind itself:  https://s30.postimg.org/rsfv01gbl/grouping.png

All labels are clickable for the various formats.  Not having different formats of the same work repeated over and over really cleans up the search results.  I could see this done with some SolrMarc customizations.

 

Our XSLT is set up to add boosts to every title as it gets imported.  Currently the boost is set up as: [NumberOfCopies].[NumberOfBranchesWithTitle][TotalCirculation]   

 

Using boosts in this way gives substantially better search results than without (fairly old screenshot):  https://s29.postimg.org/bhfh6fvp3/boost.png

 

I think your message from 2012 best explains doing something similar with SolrMarc:  https://sourceforge.net/p/vufind/mailman/message/29805252/

 

Andrew Krause

Web Developer

[hidden email] • p: (847) 923-3224

 

 

 

From: Demian Katz [mailto:[hidden email]]
Sent: Friday, January 27, 2017 7:03 AM
To: Andrew Krause; [hidden email]
Subject: Re: [VuFind-General] Tika app hangs

 

Andrew,

 

Thanks for sharing your experience. I'm glad to hear that import-xsl.php is working reasonably well for you. However, I'm curious why you chose it over SolrMarc. It might be worth giving SolrMarc another look if you plan to upgrade (or already have upgraded) to VuFind 3.1.x. The new version is faster and more powerful than the old one, and it might be to your advantage to give it another look.

 

- Demian

 


From: Andrew Krause <[hidden email]>
Sent: Thursday, January 26, 2017 10:20 PM
To: [hidden email]
Subject: Re: [VuFind-General] Tika app hangs

 

 

I'd strongly encourage anyone making use of import-xsl.php to follow Xavier's advice and split the XML files.  I'd say anything below 50MB should be safe from memory issues, other than those that can be overcome with php.ini tweaks.  I may try splitting at 10, 50 and 100 to see if there is any overall performance difference.  I've never really found the upper limit I can get away with.  I do know anything around 1GB will always fail on my test server.  Then again my test server is an old Core 2 Quad desktop with 8GB of RAM that another local library was throwing out.

 

While our indexing situation is much different than Xavier's (just bog standard public library MARC in XML), I am getting rather good performance with import-xsl.php as it is.  Originally it was on par with SolrMARC but is a little slower now.  The slow down is really due to my misuse of XSLT as a language.  I have quite a few loops to clean up various strings before they are assigned to the fields.  I am getting about 600,000 records imported in under 10 minutes.  The nightly updates to keep VuFind in sync are 30 seconds at most.  

 

 

Andrew Krause

Web Developer

[hidden email] • p: (847) 923-3224


From: Demian Katz <[hidden email]>
Sent: Thursday, January 26, 2017 7:41:00 AM
To: Library; [hidden email]
Subject: Re: [VuFind-General] Tika app hangs

 

Xavier,

 

I’m glad to hear that you have successfully indexed your records, though sorry that it is proving to take so long. Are you using a custom XSLT with VuFind’s import-xsl.php script? If so, it is certainly possible that the PHP portion of the process is a bottleneck too – this is a fairly simple tool and has never been optimized for speed.

 

Quite some time ago, there was a JIRA ticket where some performance issues were discussed; not sure if much of this remains relevant, but here it is for your reference:

 

https://vufind.org/jira/browse/VUFIND-926

 

Let me know if there’s anything more I can do to help!

 

- Demian

 

From: Library [[hidden email]]
Sent: Thursday, January 26, 2017 4:55 AM
To: Demian Katz; [hidden email]
Subject: RE: [VuFind-General] Tika app hangs

 

Well, all data imported and let’s detail what we have found out in the process for the benefit of the community:

 

-          Objective: import about 95000 records (XML files) into VuFind. Many of these records have links to full text (some records had multiple links, as our URL field is multivalue). Also, while most of the linked files are in PDF format, many are huge (many are bigger than 50 MB, biggest is 700 Mb…).

-          Server: Windows 2012 R2 with VuFind 3.0.3, Php 7, Apache 2.4, Tika app 1.14. Hardware is 8 CPUs at 2.13 GHz, 16 GB of RAM

 

Lessons learnt:

 

-          The original XML file is too big for processing. We have separated the records in different files that are not bigger than 10 MB. to be sure they do not make Tika to hang

-          The following PHP variables values (php.ini) have been set to avoid fatal errors when importing as the PDFs were so big:

memory_limit = 1024M

max_execution_time = 1800

max_input_time = 240

-          Windows certainly has issues managing so many Java processes at once as it happened during the import. We monitored  the activity of the server when importing with the Process Explorer (http://www.sysinternals.com/) and we found hangs due to peaks of Java/Tika activity when indexing the full-text. We will increase the RAM and see if it helps. Also note that Process Explorer has helped us in allowing to raise the priority of the PHP/Java processes.

 

Anyway, the total time to import and index the stuff is huge: if we take out the time we have had to stop due to different issues, I would say that more or less five days of real processing have been needed. We also have hopes that the upcoming Tika app 1.15 may help, as it is described it improves the indexing of big files.

 

Best regards,

 

Xavier

 

From: Demian Katz [[hidden email]]
Sent: 13 January 2017 14:09
To: Library; [hidden email]
Subject: Re: [VuFind-General] Tika app hangs

 

I have not heard reports from others of this specific problem.

 

What kind of records are you importing when this occurs? Are you loading MARC records with links in the 856 field, or are you loading some sort of XML file? Either way, it might be useful to add some debugging logic to the indexing code so you can determine whether the hang is happening inside Tika, or if there is some problem with processing the output of Tika in the subsequent indexing code.

 

Another thing that might be interesting to try would be collecting a list of URLs and writing a stand-alone script to call Tika against all of them. This might help narrow down whether the problem is confined to Tika itself (in which case reaching out to the Tika community might be helpful) or whether it's some specific interaction between TIka and the indexing logic.

 

Do you have access to a Linux server? Since Windows is a much less frequently used platform for VuFind, it's possible that others have not seen this issue because it is somehow Windows-specific. If you can run a test index on a Linux box, that might also be an interesting test.

 

Finally, where are the files you are indexing stored? Is it possible that the problem is not related to Tika or the indexer at all, but instead that HTTP connections are intermittently hanging on the content server?

 

Anyway, sorry for not having a more precise answer, but I hope that some of these ideas might be useful in tracking down the problem. If you are still stuck, please let me know your findings and I'll see if I can offer some additional ideas.

 

Good luck!

 

- Demian

 


From: Library <[hidden email]>
Sent: Friday, January 13, 2017 5:30 AM
To: [hidden email]
Subject: [VuFind-General] Tika app hangs

 

Hello,

 

We are experiencing the following issue with VuFind 3.0.3 in a Windows 2012 R2 server, with Tika app v. 1.14 (latest). When importing records that have full-text documents, often and randomly Apache Tika hangs. It simply looks like being frozen (an infinite loop, perhaps?). No error message is given and the temporary file remains at the /tmp folder until manually deleted.

There is plenty of server memory and CPU available, and it has never been put at stress. I have also inflated PHP memory settings (file size, memory timeout…) without  making any difference. That’s why I think the issue must be related with Java/Tika.

 

We have checked whether it was related to specific files (file type, length…) with no success. It seems that, while processing, for an unknown reason, sometimes it decides to freeze. Has anyone experienced this issue before?

 

Best regards,

 

---------------------------------

Xavier Berdaguer

Information Specialist

NATO STO CMRE

Technical Library

Viale S. Bartolomeo, 400

19126 La Spezia, Italy

Phone: +390187527361

[hidden email]

www.cmre.nato.int

 


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
VuFind-General mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tika app hangs

Luke O'Sullivan
Hi Andrew,

We currently index around 25,000 records via the VuFind XML importer and it takes around 45mins to an hour.

If your routine is quicker than that, could you share any optimizations you currently use?

Kind Regards,
Luke


Sent from my HTC

----- Reply message -----
From: "Demian Katz" <[hidden email]>
To: "Andrew Krause" <[hidden email]>, "[hidden email]" <[hidden email]>
Subject: [VuFind-General] Tika app hangs
Date: Fri, Jan 27, 2017 18:16

The speed improvements really are impressive -- and if your 8 minute number is done with a single thread, you can probably cut it even further down by turning on the optional multi-threading. But, of course, if XSLT fits your workflow better, that's definitely an important consideration. In any case, what you've done looks quite impressive. Congratulations, and let me know if I can be of any assistance with SolrMarc customization if at some point you decide to try to fit it in.


- Demian



From: Andrew Krause <[hidden email]>
Sent: Friday, January 27, 2017 12:21 PM
To: [hidden email]
Subject: Re: [VuFind-General] Tika app hangs
 

Demian,

 

VuFind 3.0.x SolrMarc:  45+ minutes for a full index

VuFind 3.1.x solrMarc:  under 8 minutes for a full index

 

It was so much faster I thought something wasn’t working correctly.  Now that I switched to XSLT, my goal is to keep index time about the same.

 

In another thread, I’ve touched on having a lot of XSLT to draw from and a tool that can push the XML reports from our Symphony ILS to VuFind without any real effort on my part.

 

With the fairly strict formatting for author/title and using GMDs/ItemCategories for format, I was able to get grouping working fairly well without major changes to VuFind itself:  https://s30.postimg.org/rsfv01gbl/grouping.png

All labels are clickable for the various formats.  Not having different formats of the same work repeated over and over really cleans up the search results.  I could see this done with some SolrMarc customizations.

 

Our XSLT is set up to add boosts to every title as it gets imported.  Currently the boost is set up as: [NumberOfCopies].[NumberOfBranchesWithTitle][TotalCirculation]   

 

Using boosts in this way gives substantially better search results than without (fairly old screenshot):  https://s29.postimg.org/bhfh6fvp3/boost.png

 

I think your message from 2012 best explains doing something similar with SolrMarc:  https://sourceforge.net/p/vufind/mailman/message/29805252/

 

Andrew Krause

Web Developer

[hidden email] • p: (847) 923-3224

 

 

 

From: Demian Katz [mailto:[hidden email]]
Sent: Friday, January 27, 2017 7:03 AM
To: Andrew Krause; [hidden email]
Subject: Re: [VuFind-General] Tika app hangs

 

Andrew,

 

Thanks for sharing your experience. I'm glad to hear that import-xsl.php is working reasonably well for you. However, I'm curious why you chose it over SolrMarc. It might be worth giving SolrMarc another look if you plan to upgrade (or already have upgraded) to VuFind 3.1.x. The new version is faster and more powerful than the old one, and it might be to your advantage to give it another look.

 

- Demian

 


From: Andrew Krause <[hidden email]>
Sent: Thursday, January 26, 2017 10:20 PM
To: [hidden email]
Subject: Re: [VuFind-General] Tika app hangs

 

 

I'd strongly encourage anyone making use of import-xsl.php to follow Xavier's advice and split the XML files.  I'd say anything below 50MB should be safe from memory issues, other than those that can be overcome with php.ini tweaks.  I may try splitting at 10, 50 and 100 to see if there is any overall performance difference.  I've never really found the upper limit I can get away with.  I do know anything around 1GB will always fail on my test server.  Then again my test server is an old Core 2 Quad desktop with 8GB of RAM that another local library was throwing out.

 

While our indexing situation is much different than Xavier's (just bog standard public library MARC in XML), I am getting rather good performance with import-xsl.php as it is.  Originally it was on par with SolrMARC but is a little slower now.  The slow down is really due to my misuse of XSLT as a language.  I have quite a few loops to clean up various strings before they are assigned to the fields.  I am getting about 600,000 records imported in under 10 minutes.  The nightly updates to keep VuFind in sync are 30 seconds at most.  

 

 

Andrew Krause

Web Developer

[hidden email] • p: (847) 923-3224


From: Demian Katz <[hidden email]>
Sent: Thursday, January 26, 2017 7:41:00 AM
To: Library; [hidden email]
Subject: Re: [VuFind-General] Tika app hangs

 

Xavier,

 

I’m glad to hear that you have successfully indexed your records, though sorry that it is proving to take so long. Are you using a custom XSLT with VuFind’s import-xsl.php script? If so, it is certainly possible that the PHP portion of the process is a bottleneck too – this is a fairly simple tool and has never been optimized for speed.

 

Quite some time ago, there was a JIRA ticket where some performance issues were discussed; not sure if much of this remains relevant, but here it is for your reference:

 

https://vufind.org/jira/browse/VUFIND-926

 

Let me know if there’s anything more I can do to help!

 

- Demian

 

From: Library [[hidden email]]
Sent: Thursday, January 26, 2017 4:55 AM
To: Demian Katz; [hidden email]
Subject: RE: [VuFind-General] Tika app hangs

 

Well, all data imported and let’s detail what we have found out in the process for the benefit of the community:

 

-          Objective: import about 95000 records (XML files) into VuFind. Many of these records have links to full text (some records had multiple links, as our URL field is multivalue). Also, while most of the linked files are in PDF format, many are huge (many are bigger than 50 MB, biggest is 700 Mb…).

-          Server: Windows 2012 R2 with VuFind 3.0.3, Php 7, Apache 2.4, Tika app 1.14. Hardware is 8 CPUs at 2.13 GHz, 16 GB of RAM

 

Lessons learnt:

 

-          The original XML file is too big for processing. We have separated the records in different files that are not bigger than 10 MB. to be sure they do not make Tika to hang

-          The following PHP variables values (php.ini) have been set to avoid fatal errors when importing as the PDFs were so big:

memory_limit = 1024M

max_execution_time = 1800

max_input_time = 240

-          Windows certainly has issues managing so many Java processes at once as it happened during the import. We monitored  the activity of the server when importing with the Process Explorer (http://www.sysinternals.com/) and we found hangs due to peaks of Java/Tika activity when indexing the full-text. We will increase the RAM and see if it helps. Also note that Process Explorer has helped us in allowing to raise the priority of the PHP/Java processes.

 

Anyway, the total time to import and index the stuff is huge: if we take out the time we have had to stop due to different issues, I would say that more or less five days of real processing have been needed. We also have hopes that the upcoming Tika app 1.15 may help, as it is described it improves the indexing of big files.

 

Best regards,

 

Xavier

 

From: Demian Katz [[hidden email]]
Sent: 13 January 2017 14:09
To: Library; [hidden email]
Subject: Re: [VuFind-General] Tika app hangs

 

I have not heard reports from others of this specific problem.

 

What kind of records are you importing when this occurs? Are you loading MARC records with links in the 856 field, or are you loading some sort of XML file? Either way, it might be useful to add some debugging logic to the indexing code so you can determine whether the hang is happening inside Tika, or if there is some problem with processing the output of Tika in the subsequent indexing code.

 

Another thing that might be interesting to try would be collecting a list of URLs and writing a stand-alone script to call Tika against all of them. This might help narrow down whether the problem is confined to Tika itself (in which case reaching out to the Tika community might be helpful) or whether it's some specific interaction between TIka and the indexing logic.

 

Do you have access to a Linux server? Since Windows is a much less frequently used platform for VuFind, it's possible that others have not seen this issue because it is somehow Windows-specific. If you can run a test index on a Linux box, that might also be an interesting test.

 

Finally, where are the files you are indexing stored? Is it possible that the problem is not related to Tika or the indexer at all, but instead that HTTP connections are intermittently hanging on the content server?

 

Anyway, sorry for not having a more precise answer, but I hope that some of these ideas might be useful in tracking down the problem. If you are still stuck, please let me know your findings and I'll see if I can offer some additional ideas.

 

Good luck!

 

- Demian

 


From: Library <[hidden email]>
Sent: Friday, January 13, 2017 5:30 AM
To: [hidden email]
Subject: [VuFind-General] Tika app hangs

 

Hello,

 

We are experiencing the following issue with VuFind 3.0.3 in a Windows 2012 R2 server, with Tika app v. 1.14 (latest). When importing records that have full-text documents, often and randomly Apache Tika hangs. It simply looks like being frozen (an infinite loop, perhaps?). No error message is given and the temporary file remains at the /tmp folder until manually deleted.

There is plenty of server memory and CPU available, and it has never been put at stress. I have also inflated PHP memory settings (file size, memory timeout…) without  making any difference. That’s why I think the issue must be related with Java/Tika.

 

We have checked whether it was related to specific files (file type, length…) with no success. It seems that, while processing, for an unknown reason, sometimes it decides to freeze. Has anyone experienced this issue before?

 

Best regards,

 

---------------------------------

Xavier Berdaguer

Information Specialist

NATO STO CMRE

Technical Library

Viale S. Bartolomeo, 400

19126 La Spezia, Italy

Phone: +390187527361

[hidden email]

www.cmre.nato.int

 


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
VuFind-General mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tika app hangs

Andrew Krause
In reply to this post by Library

Hi Luke,

 

Our properties file is empty other than specifying the stylesheet.  We are using the batch-import-xsl script in the /vufind/harvest directory.  A full index is about 22 million lines of XML and 1GB for 600,000 records. 

 

Just to be clear the times I give exclude getting the XML data from the various sources.  If I include the ILS side of things, I’m sure our times would be closer to yours.  Then again, I don’t actually do full extracts from the ILS each time I reindex.  I typically do the initial full index files plus all the daily extracts. 

 

If I only grab the data I need from the ILS, I do get a speedup when importing into VuFind.  I think the extracts could be 1/3 larger if I didn’t tune the report that does the extract.

 

Using templates to assign variables can greatly reduce code duplication and complexity.  This does appear to have a performance hit though.  It is minor overall and well worth it.

 

Avoid using templates to create the equivalent of loops we’d see in most languages.  This ends up being very expensive.  If you use them, think of the max number of loops you might experience.  Might be worth testing the slowdown from each one you add as well.  I know mine are costing me a few minutes.

 

Don’t duplicate complex string manipulations if you are going to reuse the same value in multiple fields.  Either create a variable of the already processed string or better yet, copyfield it in the solr schema.  I typically use a variable since the string may need further manipulation for other uses.

 

While it doesn’t help the overall performance of indexing, I don’t do deletions.  All of our weeded title are put in a temporary location of “Discard” in the ILS.  The nightly index catches this change and will not mark the title as “Visible” in a custom solr field.  If any copy of a title is in a location that is not in a “ShadowedList” variable, the title is marked as “Visible”.  Any title not marked “Visible” is hidden from the end user.  I can’t seem to find the thread that details this technique.  It saves a lot of extra indexing efforts.

 

Once I have the XSLT to my liking, I don’t plan on reindexing for quite some time.  The nightly updates are at a max of 5MB and are keeping things in sync.  I may add weekly extracts to make sure nothing is missed.

 

If you’d like I can take a look at your XSLT. 

 

Andrew Krause

Web Developer

[hidden email] • p: (847) 923-3224

 

cid:image001.png@01D1F3CD.AF3E5C80

 

From: Osullivan L. [mailto:[hidden email]]
Sent: Friday, January 27, 2017 1:28 PM
To: Demian Katz; Andrew Krause; [hidden email]
Subject: Re: [VuFind-General] Tika app hangs

 

Hi Andrew,

 

We currently index around 25,000 records via the VuFind XML importer and it takes around 45mins to an hour.

 

If your routine is quicker than that, could you share any optimizations you currently use?

 

Kind Regards,

Luke

 

 

Sent from my HTC

 

----- Reply message -----
From: "Demian Katz" <[hidden email]>
To: "Andrew Krause" <[hidden email]>, "[hidden email]" <[hidden email]>
Subject: [VuFind-General] Tika app hangs
Date: Fri, Jan 27, 2017 18:16

 

The speed improvements really are impressive -- and if your 8 minute number is done with a single thread, you can probably cut it even further down by turning on the optional multi-threading. But, of course, if XSLT fits your workflow better, that's definitely an important consideration. In any case, what you've done looks quite impressive. Congratulations, and let me know if I can be of any assistance with SolrMarc customization if at some point you decide to try to fit it in.

 

- Demian

 


From: Andrew Krause <[hidden email]>
Sent: Friday, January 27, 2017 12:21 PM
To: [hidden email]
Subject: Re: [VuFind-General] Tika app hangs

 

Demian,

 

VuFind 3.0.x SolrMarc:  45+ minutes for a full index

VuFind 3.1.x solrMarc:  under 8 minutes for a full index

 

It was so much faster I thought something wasn’t working correctly.  Now that I switched to XSLT, my goal is to keep index time about the same.

 

In another thread, I’ve touched on having a lot of XSLT to draw from and a tool that can push the XML reports from our Symphony ILS to VuFind without any real effort on my part.

 

With the fairly strict formatting for author/title and using GMDs/ItemCategories for format, I was able to get grouping working fairly well without major changes to VuFind itself:  https://s30.postimg.org/rsfv01gbl/grouping.png

All labels are clickable for the various formats.  Not having different formats of the same work repeated over and over really cleans up the search results.  I could see this done with some SolrMarc customizations.

 

Our XSLT is set up to add boosts to every title as it gets imported.  Currently the boost is set up as: [NumberOfCopies].[NumberOfBranchesWithTitle][TotalCirculation]   

 

Using boosts in this way gives substantially better search results than without (fairly old screenshot):  https://s29.postimg.org/bhfh6fvp3/boost.png

 

I think your message from 2012 best explains doing something similar with SolrMarc:  https://sourceforge.net/p/vufind/mailman/message/29805252/

 

Andrew Krause

Web Developer

[hidden email] • p: (847) 923-3224

 

 

 

From: Demian Katz [[hidden email]]
Sent: Friday, January 27, 2017 7:03 AM
To: Andrew Krause; [hidden email]
Subject: Re: [VuFind-General] Tika app hangs

 

Andrew,

 

Thanks for sharing your experience. I'm glad to hear that import-xsl.php is working reasonably well for you. However, I'm curious why you chose it over SolrMarc. It might be worth giving SolrMarc another look if you plan to upgrade (or already have upgraded) to VuFind 3.1.x. The new version is faster and more powerful than the old one, and it might be to your advantage to give it another look.

 

- Demian

 


From: Andrew Krause <[hidden email]>
Sent: Thursday, January 26, 2017 10:20 PM
To: [hidden email]
Subject: Re: [VuFind-General] Tika app hangs

 

 

I'd strongly encourage anyone making use of import-xsl.php to follow Xavier's advice and split the XML files.  I'd say anything below 50MB should be safe from memory issues, other than those that can be overcome with php.ini tweaks.  I may try splitting at 10, 50 and 100 to see if there is any overall performance difference.  I've never really found the upper limit I can get away with.  I do know anything around 1GB will always fail on my test server.  Then again my test server is an old Core 2 Quad desktop with 8GB of RAM that another local library was throwing out.

 

While our indexing situation is much different than Xavier's (just bog standard public library MARC in XML), I am getting rather good performance with import-xsl.php as it is.  Originally it was on par with SolrMARC but is a little slower now.  The slow down is really due to my misuse of XSLT as a language.  I have quite a few loops to clean up various strings before they are assigned to the fields.  I am getting about 600,000 records imported in under 10 minutes.  The nightly updates to keep VuFind in sync are 30 seconds at most.  

 

 

Andrew Krause

Web Developer

[hidden email] • p: (847) 923-3224


From: Demian Katz <[hidden email]>
Sent: Thursday, January 26, 2017 7:41:00 AM
To: Library; [hidden email]
Subject: Re: [VuFind-General] Tika app hangs

 

Xavier,

 

I’m glad to hear that you have successfully indexed your records, though sorry that it is proving to take so long. Are you using a custom XSLT with VuFind’s import-xsl.php script? If so, it is certainly possible that the PHP portion of the process is a bottleneck too – this is a fairly simple tool and has never been optimized for speed.

 

Quite some time ago, there was a JIRA ticket where some performance issues were discussed; not sure if much of this remains relevant, but here it is for your reference:

 

https://vufind.org/jira/browse/VUFIND-926

 

Let me know if there’s anything more I can do to help!

 

- Demian

 

From: Library [[hidden email]]
Sent: Thursday, January 26, 2017 4:55 AM
To: Demian Katz; [hidden email]
Subject: RE: [VuFind-General] Tika app hangs

 

Well, all data imported and let’s detail what we have found out in the process for the benefit of the community:

 

-          Objective: import about 95000 records (XML files) into VuFind. Many of these records have links to full text (some records had multiple links, as our URL field is multivalue). Also, while most of the linked files are in PDF format, many are huge (many are bigger than 50 MB, biggest is 700 Mb…).

-          Server: Windows 2012 R2 with VuFind 3.0.3, Php 7, Apache 2.4, Tika app 1.14. Hardware is 8 CPUs at 2.13 GHz, 16 GB of RAM

 

Lessons learnt:

 

-          The original XML file is too big for processing. We have separated the records in different files that are not bigger than 10 MB. to be sure they do not make Tika to hang

-          The following PHP variables values (php.ini) have been set to avoid fatal errors when importing as the PDFs were so big:

memory_limit = 1024M

max_execution_time = 1800

max_input_time = 240

-          Windows certainly has issues managing so many Java processes at once as it happened during the import. We monitored  the activity of the server when importing with the Process Explorer (http://www.sysinternals.com/) and we found hangs due to peaks of Java/Tika activity when indexing the full-text. We will increase the RAM and see if it helps. Also note that Process Explorer has helped us in allowing to raise the priority of the PHP/Java processes.

 

Anyway, the total time to import and index the stuff is huge: if we take out the time we have had to stop due to different issues, I would say that more or less five days of real processing have been needed. We also have hopes that the upcoming Tika app 1.15 may help, as it is described it improves the indexing of big files.

 

Best regards,

 

Xavier

 

From: Demian Katz [[hidden email]]
Sent: 13 January 2017 14:09
To: Library; [hidden email]
Subject: Re: [VuFind-General] Tika app hangs

 

I have not heard reports from others of this specific problem.

 

What kind of records are you importing when this occurs? Are you loading MARC records with links in the 856 field, or are you loading some sort of XML file? Either way, it might be useful to add some debugging logic to the indexing code so you can determine whether the hang is happening inside Tika, or if there is some problem with processing the output of Tika in the subsequent indexing code.

 

Another thing that might be interesting to try would be collecting a list of URLs and writing a stand-alone script to call Tika against all of them. This might help narrow down whether the problem is confined to Tika itself (in which case reaching out to the Tika community might be helpful) or whether it's some specific interaction between TIka and the indexing logic.

 

Do you have access to a Linux server? Since Windows is a much less frequently used platform for VuFind, it's possible that others have not seen this issue because it is somehow Windows-specific. If you can run a test index on a Linux box, that might also be an interesting test.

 

Finally, where are the files you are indexing stored? Is it possible that the problem is not related to Tika or the indexer at all, but instead that HTTP connections are intermittently hanging on the content server?

 

Anyway, sorry for not having a more precise answer, but I hope that some of these ideas might be useful in tracking down the problem. If you are still stuck, please let me know your findings and I'll see if I can offer some additional ideas.

 

Good luck!

 

- Demian

 


From: Library <[hidden email]>
Sent: Friday, January 13, 2017 5:30 AM
To: [hidden email]
Subject: [VuFind-General] Tika app hangs

 

Hello,

 

We are experiencing the following issue with VuFind 3.0.3 in a Windows 2012 R2 server, with Tika app v. 1.14 (latest). When importing records that have full-text documents, often and randomly Apache Tika hangs. It simply looks like being frozen (an infinite loop, perhaps?). No error message is given and the temporary file remains at the /tmp folder until manually deleted.

There is plenty of server memory and CPU available, and it has never been put at stress. I have also inflated PHP memory settings (file size, memory timeout…) without  making any difference. That’s why I think the issue must be related with Java/Tika.

 

We have checked whether it was related to specific files (file type, length…) with no success. It seems that, while processing, for an unknown reason, sometimes it decides to freeze. Has anyone experienced this issue before?

 

Best regards,

 

---------------------------------

Xavier Berdaguer

Information Specialist

NATO STO CMRE

Technical Library

Viale S. Bartolomeo, 400

19126 La Spezia, Italy

Phone: +390187527361

[hidden email]

www.cmre.nato.int

 


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
VuFind-General mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tika app hangs

Luke O'Sullivan
Hi Andrew,

Thanks for your detailed response.

I'm about to start a project which will see some modifications of our
xsl file. If it's ok with you, I'll send you a copy when it's done so
that you can see if there's anything better we could be doing. It's not
very complicated so hopefully shouldn't take too long.

Kind regards,
Luke

On Fri, 2017-01-27 at 21:24 +0000, Andrew Krause wrote:

> Hi Luke,
>  
> Our properties file is empty other than specifying the stylesheet. 
> We are using the batch-import-xsl script in the /vufind/harvest
> directory.  A full index is about 22 million lines of XML and 1GB for
> 600,000 records. 
>  
> Just to be clear the times I give exclude getting the XML data from
> the various sources.  If I include the ILS side of things, I’m sure
> our times would be closer to yours.  Then again, I don’t actually do
> full extracts from the ILS each time I reindex.  I typically do the
> initial full index files plus all the daily extracts. 
>  
> If I only grab the data I need from the ILS, I do get a speedup when
> importing into VuFind.  I think the extracts could be 1/3 larger if I
> didn’t tune the report that does the extract.
>  
> Using templates to assign variables can greatly reduce code
> duplication and complexity.  This does appear to have a performance
> hit though.  It is minor overall and well worth it.
>  
> Avoid using templates to create the equivalent of loops we’d see in
> most languages.  This ends up being very expensive.  If you use them,
> think of the max number of loops you might experience.  Might be
> worth testing the slowdown from each one you add as well.  I know
> mine are costing me a few minutes.
>  
> Don’t duplicate complex string manipulations if you are going to
> reuse the same value in multiple fields.  Either create a variable of
> the already processed string or better yet, copyfield it in the solr
> schema.  I typically use a variable since the string may need further
> manipulation for other uses.
>  
> While it doesn’t help the overall performance of indexing, I don’t do
> deletions.  All of our weeded title are put in a temporary location
> of “Discard” in the ILS.  The nightly index catches this change and
> will not mark the title as “Visible” in a custom solr field.  If any
> copy of a title is in a location that is not in a “ShadowedList”
> variable, the title is marked as “Visible”.  Any title not marked
> “Visible” is hidden from the end user.  I can’t seem to find the
> thread that details this technique.  It saves a lot of extra indexing
> efforts.
>  
> Once I have the XSLT to my liking, I don’t plan on reindexing for
> quite some time.  The nightly updates are at a max of 5MB and are
> keeping things in sync.  I may add weekly extracts to make sure
> nothing is missed.
>  
> If you’d like I can take a look at your XSLT. 
>  
> Andrew Krause
> Web Developer
> [hidden email] • p: (847) 923-3224
>  
>
>  
> From: Osullivan L. [mailto:[hidden email]
> Sent: Friday, January 27, 2017 1:28 PM
> To: Demian Katz; Andrew Krause; [hidden email]
> Subject: Re: [VuFind-General] Tika app hangs
>  
> Hi Andrew,
>  
> We currently index around 25,000 records via the VuFind XML importer
> and it takes around 45mins to an hour.
>  
> If your routine is quicker than that, could you share any
> optimizations you currently use?
>  
> Kind Regards,
> Luke
>  
>  
> Sent from my HTC
>  
> ----- Reply message -----
> From: "Demian Katz" <[hidden email]>
> To: "Andrew Krause" <[hidden email]>, "[hidden email]
> orge.net" <[hidden email]>
> Subject: [VuFind-General] Tika app hangs
> Date: Fri, Jan 27, 2017 18:16
>  
> The speed improvements really are impressive -- and if your 8 minute
> number is done with a single thread, you can probably cut it even
> further down by turning on the optional multi-threading. But, of
> course, if XSLT fits your workflow better, that's definitely an
> important consideration. In any case, what you've done looks quite
> impressive. Congratulations, and let me know if I can be of any
> assistance with SolrMarc customization if at some point you decide to
> try to fit it in.
>  
> - Demian
>  
> From: Andrew Krause <[hidden email]>
> Sent: Friday, January 27, 2017 12:21 PM
> To: [hidden email]
> Subject: Re: [VuFind-General] Tika app hangs
>  
> Demian,
>  
> VuFind 3.0.x SolrMarc:  45+ minutes for a full index
> VuFind 3.1.x solrMarc:  under 8 minutes for a full index
>  
> It was so much faster I thought something wasn’t working correctly. 
> Now that I switched to XSLT, my goal is to keep index time about the
> same.
>  
> In another thread, I’ve touched on having a lot of XSLT to draw from
> and a tool that can push the XML reports from our Symphony ILS to
> VuFind without any real effort on my part.
>  
> With the fairly strict formatting for author/title and using
> GMDs/ItemCategories for format, I was able to get grouping working
> fairly well without major changes to VuFind itself:  https://s30.post
> img.org/rsfv01gbl/grouping.png
> All labels are clickable for the various formats.  Not having
> different formats of the same work repeated over and over really
> cleans up the search results.  I could see this done with some
> SolrMarc customizations.
>  
> Our XSLT is set up to add boosts to every title as it gets imported. 
> Currently the boost is set up as:
> [NumberOfCopies].[NumberOfBranchesWithTitle][TotalCirculation]   
>  
> Using boosts in this way gives substantially better search results
> than without (fairly old screenshot):  https://s29.postimg.org/bhfh6f
> vp3/boost.png
>  
> I think your message from 2012 best explains doing something similar
> with SolrMarc:  https://sourceforge.net/p/vufind/mailman/message/2980
> 5252/
>  
> Andrew Krause
> Web Developer
> [hidden email] • p: (847) 923-3224
>  
>  
>  
> From: Demian Katz [mailto:[hidden email]
> Sent: Friday, January 27, 2017 7:03 AM
> To: Andrew Krause; [hidden email]
> Subject: Re: [VuFind-General] Tika app hangs
>  
> Andrew,
>  
> Thanks for sharing your experience. I'm glad to hear that import-
> xsl.php is working reasonably well for you. However, I'm curious why
> you chose it over SolrMarc. It might be worth giving SolrMarc another
> look if you plan to upgrade (or already have upgraded) to VuFind
> 3.1.x. The new version is faster and more powerful than the old one,
> and it might be to your advantage to give it another look.
>  
> - Demian
>  
> From: Andrew Krause <[hidden email]>
> Sent: Thursday, January 26, 2017 10:20 PM
> To: [hidden email]
> Subject: Re: [VuFind-General] Tika app hangs
>  
>  
> I'd strongly encourage anyone making use of import-xsl.php to follow
> Xavier's advice and split the XML files.  I'd say anything below 50MB
> should be safe from memory issues, other than those that can be
> overcome with php.ini tweaks.  I may try splitting at 10, 50 and 100
> to see if there is any overall performance difference.  I've never
> really found the upper limit I can get away with.  I do know anything
> around 1GB will always fail on my test server.  Then again my test
> server is an old Core 2 Quad desktop with 8GB of RAM that another
> local library was throwing out.
>  
> While our indexing situation is much different than Xavier's (just
> bog standard public library MARC in XML), I am getting rather good
> performance with import-xsl.php as it is.  Originally it was on par
> with SolrMARC but is a little slower now.  The slow down is really
> due to my misuse of XSLT as a language.  I have quite a few loops to
> clean up various strings before they are assigned to the fields.  I
> am getting about 600,000 records imported in under 10 minutes.  The
> nightly updates to keep VuFind in sync are 30 seconds at most.  
>  
>  
> Andrew Krause
> Web Developer
> [hidden email] • p: (847) 923-3224
> From: Demian Katz <[hidden email]>
> Sent: Thursday, January 26, 2017 7:41:00 AM
> To: Library; [hidden email]
> Subject: Re: [VuFind-General] Tika app hangs
>  
> Xavier,
>  
> I’m glad to hear that you have successfully indexed your records,
> though sorry that it is proving to take so long. Are you using a
> custom XSLT with VuFind’s import-xsl.php script? If so, it is
> certainly possible that the PHP portion of the process is a
> bottleneck too – this is a fairly simple tool and has never been
> optimized for speed.
>  
> Quite some time ago, there was a JIRA ticket where some performance
> issues were discussed; not sure if much of this remains relevant, but
> here it is for your reference:
>  
> https://vufind.org/jira/browse/VUFIND-926
>  
> Let me know if there’s anything more I can do to help!
>  
> - Demian
>  
> From: Library [mailto:[hidden email]
> Sent: Thursday, January 26, 2017 4:55 AM
> To: Demian Katz; [hidden email]
> Subject: RE: [VuFind-General] Tika app hangs
>  
> Well, all data imported and let’s detail what we have found out in
> the process for the benefit of the community:
>  
> -          Objective: import about 95000 records (XML files) into
> VuFind. Many of these records have links to full text (some records
> had multiple links, as our URL field is multivalue). Also, while most
> of the linked files are in PDF format, many are huge (many are bigger
> than 50 MB, biggest is 700 Mb…).
> -          Server: Windows 2012 R2 with VuFind 3.0.3, Php 7, Apache
> 2.4, Tika app 1.14. Hardware is 8 CPUs at 2.13 GHz, 16 GB of RAM
>  
> Lessons learnt:
>  
> -          The original XML file is too big for processing. We have
> separated the records in different files that are not bigger than 10
> MB. to be sure they do not make Tika to hang
> -          The following PHP variables values (php.ini) have been set
> to avoid fatal errors when importing as the PDFs were so big:
> memory_limit = 1024M
> max_execution_time = 1800
> max_input_time = 240
> -          Windows certainly has issues managing so many Java
> processes at once as it happened during the import. We monitored  the
> activity of the server when importing with the Process Explorer (http
> ://www.sysinternals.com/) and we found hangs due to peaks of
> Java/Tika activity when indexing the full-text. We will increase the
> RAM and see if it helps. Also note that Process Explorer has helped
> us in allowing to raise the priority of the PHP/Java processes.
>  
> Anyway, the total time to import and index the stuff is huge: if we
> take out the time we have had to stop due to different issues, I
> would say that more or less five days of real processing have been
> needed. We also have hopes that the upcoming Tika app 1.15 may help,
> as it is described it improves the indexing of big files.
>  
> Best regards,
>  
> Xavier
>  
> From: Demian Katz [mailto:[hidden email]
> Sent: 13 January 2017 14:09
> To: Library; [hidden email]
> Subject: Re: [VuFind-General] Tika app hangs
>  
> I have not heard reports from others of this specific problem.
>  
> What kind of records are you importing when this occurs? Are you
> loading MARC records with links in the 856 field, or are you loading
> some sort of XML file? Either way, it might be useful to add some
> debugging logic to the indexing code so you can determine whether the
> hang is happening inside Tika, or if there is some problem with
> processing the output of Tika in the subsequent indexing code.
>  
> Another thing that might be interesting to try would be collecting a
> list of URLs and writing a stand-alone script to call Tika against
> all of them. This might help narrow down whether the problem is
> confined to Tika itself (in which case reaching out to the Tika
> community might be helpful) or whether it's some specific interaction
> between TIka and the indexing logic.
>  
> Do you have access to a Linux server? Since Windows is a much less
> frequently used platform for VuFind, it's possible that others have
> not seen this issue because it is somehow Windows-specific. If you
> can run a test index on a Linux box, that might also be an
> interesting test.
>  
> Finally, where are the files you are indexing stored? Is it possible
> that the problem is not related to Tika or the indexer at all, but
> instead that HTTP connections are intermittently hanging on the
> content server?
>  
> Anyway, sorry for not having a more precise answer, but I hope that
> some of these ideas might be useful in tracking down the problem. If
> you are still stuck, please let me know your findings and I'll see if
> I can offer some additional ideas.
>  
> Good luck!
>  
> - Demian
>  
> From: Library <[hidden email]>
> Sent: Friday, January 13, 2017 5:30 AM
> To: [hidden email]
> Subject: [VuFind-General] Tika app hangs
>  
> Hello,
>  
> We are experiencing the following issue with VuFind 3.0.3 in a
> Windows 2012 R2 server, with Tika app v. 1.14 (latest). When
> importing records that have full-text documents, often and randomly
> Apache Tika hangs. It simply looks like being frozen (an infinite
> loop, perhaps?). No error message is given and the temporary file
> remains at the /tmp folder until manually deleted.
> There is plenty of server memory and CPU available, and it has never
> been put at stress. I have also inflated PHP memory settings (file
> size, memory timeout…) without  making any difference. That’s why I
> think the issue must be related with Java/Tika.
>  
> We have checked whether it was related to specific files (file type,
> length…) with no success. It seems that, while processing, for an
> unknown reason, sometimes it decides to freeze. Has anyone
> experienced this issue before?
>  
> Best regards,
>  
> ---------------------------------
> Xavier Berdaguer
> Information Specialist
> NATO STO CMRE
> Technical Library
> Viale S. Bartolomeo, 400
> 19126 La Spezia, Italy
> Phone: +390187527361
> [hidden email]
> www.cmre.nato.int
>  
> -------------------------------------------------------------------
> -----------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
> _______________________________________________
> VuFind-General mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/vufind-general

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
VuFind-General mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-general
Loading...