splitOnCaseChange=1 for German words?

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

splitOnCaseChange=1 for German words?

Martin Fuchs

Hello all, especially Germans,

The VuFind-Standard in the schema

splitOnCaseChange="1" in

<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>

 

This results in Splits of CamelCase-words like

VuFind -> Vu  Find

GmbH  -> Gmb  H   (GmbH = Gesellschaft mit beschränkter Haftung)

 

The searches for gmbh or GMBH give the same results not the same results as „GmbH“. (gmbh has less results than GmbH). Our readers would expect the same results for GmbH, GMBH, gmbh, Gmbh because they are accustomed to case insensitive searches. Thus we intend to set splitOnCaseChange=“0“.

Are there any experiences with this setting? Are there often used CamelCase-words where CamelCase-Splitting would be better? One example would be „CamelCase“. With  splitOnCaseChange the searches for „Camel Case“ or „CamelCase“ should have the same results which is nice but less important than the problem with GmbH.

 

Martin


------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: splitOnCaseChange=1 for German words?

Demian Katz

This is definitely one of those settings for which there is no perfect answer – and often it is at the root of confusing/unexpected search behavior. As you say, in many cases, true case-insensitivity is a good way to make results more predictable and consistent – but then you potentially lose some search breadth in situations like your CamelCase example. It comes down to weighing costs and benefits, and in your case, it does sound like going for pure case insensitivity may be best. However, it would definitely be interesting to hear from other German libraries for their experiences!

 

- Demian

 

From: Dr. Martin Fuchs [mailto:[hidden email]]
Sent: Wednesday, January 11, 2017 9:52 AM
To: [hidden email]
Subject: [VuFind-Tech] splitOnCaseChange=1 for German words?

 

Hello all, especially Germans,

The VuFind-Standard in the schema

splitOnCaseChange="1" in

<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>

 

This results in Splits of CamelCase-words like

VuFind -> Vu  Find

GmbH  -> Gmb  H   (GmbH = Gesellschaft mit beschränkter Haftung)

 

The searches for gmbh or GMBH give the same results not the same results as „GmbH“. (gmbh has less results than GmbH). Our readers would expect the same results for GmbH, GMBH, gmbh, Gmbh because they are accustomed to case insensitive searches. Thus we intend to set splitOnCaseChange=“0“.

Are there any experiences with this setting? Are there often used CamelCase-words where CamelCase-Splitting would be better? One example would be „CamelCase“. With  splitOnCaseChange the searches for „Camel Case“ or „CamelCase“ should have the same results which is nice but less important than the problem with GmbH.

 

Martin


------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: splitOnCaseChange=1 for German words?

Ere Maijala
For what it's worth, we're lowercasing everything and running with
splitOnCaseChange="0" (see
<https://github.com/NatLibFi/NDL-VuFind-Solr/blob/master/vufind/biblio/conf/schema.xml>).
This seems to work best for us so far.

--Ere

11.1.2017, 17.29, Demian Katz kirjoitti:

> This is definitely one of those settings for which there is no perfect
> answer – and often it is at the root of confusing/unexpected search
> behavior. As you say, in many cases, true case-insensitivity is a good
> way to make results more predictable and consistent – but then you
> potentially lose some search breadth in situations like your CamelCase
> example. It comes down to weighing costs and benefits, and in your case,
> it does sound like going for pure case insensitivity may be best.
> However, it would definitely be interesting to hear from other German
> libraries for their experiences!
>
>
>
> - Demian
>
>
>
> *From:* Dr. Martin Fuchs [mailto:[hidden email]]
> *Sent:* Wednesday, January 11, 2017 9:52 AM
> *To:* [hidden email]
> *Subject:* [VuFind-Tech] splitOnCaseChange=1 for German words?
>
>
>
> Hello all, especially Germans,
>
> The VuFind-Standard in the schema
>
> splitOnCaseChange="1" in
>
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1"/>
>
>
>
> This results in Splits of CamelCase-words like
>
> VuFind -> Vu  Find
>
> GmbH  -> Gmb  H   (GmbH = Gesellschaft mit beschränkter Haftung)
>
>
>
> The searches for gmbh or GMBH give the same results not the same results
> as „GmbH“. (gmbh has less results than GmbH). Our readers would expect
> the same results for GmbH, GMBH, gmbh, Gmbh because they are accustomed
> to case insensitive searches. Thus we intend to set splitOnCaseChange=“0“.
>
> Are there any experiences with this setting? Are there often used
> CamelCase-words where CamelCase-Splitting would be better? One example
> would be „CamelCase“. With  splitOnCaseChange the searches for „Camel
> Case“ or „CamelCase“ should have the same results which is nice but less
> important than the problem with GmbH.
>
>
>
> Martin
>
>
>
> ------------------------------------------------------------------------------
> Developer Access Program for Intel Xeon Phi Processors
> Access to Intel Xeon Phi processor-based developer platforms.
> With one year of Intel Parallel Studio XE.
> Training and support from Colfax.
> Order your platform today. http://sdm.link/xeonphi
>
>
>
> _______________________________________________
> Vufind-tech mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/vufind-tech
>

--
Ere Maijala
Kansalliskirjasto / The National Library of Finland

------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: splitOnCaseChange=1 for German words?

Uwe Reh
Am 11.01.2017 um 17:15 schrieb Ere Maijala:
> For what it's worth, we're lowercasing everything ...
+1

Ere is right. The solr.WordDelimiterFilterFactory (WDF) has more of
drawbacks than advances. In the examples to the WDL (shop for elektronic
hardware) the WDL is extremely useful. In our scope (bibliografic data
records) I can't see an a real advantage in using the WDL.
(even not for "Literaturlexikon" vs. "Literatur-Lexikon" vs. "Literatur
Lexikon")

@Martin,
if you think the WDL is needed for your installation, there are maybe
two workarounds

* Use the WDL parameter 'preserveOriginal'
>https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-WordDelimiterFilter
Having the original form preserved, may solve your problem.
But take care. During our tests with the WDL (Solr4.2) we got problems.
The  'mm' parameter of the edismax query handler couldn't handle the
additional entries made by the WDL.

* Mask known Problems
Solr has several filters to hide known Problems. We use them to make
terms like "C++" searchable. ("C++" -> "cplusplus")
> https://cwiki.apache.org/confluence/display/solr/CharFilterFactories#CharFilterFactories-solr.MappingCharFilterFactory
> https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-PatternReplaceFilter
Or a bit more complex
> https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-SynonymFilter

Uwe

------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: splitOnCaseChange=1 for German words?

Martin Fuchs
In reply to this post by Ere Maijala
Thank you Demian and Ere for your discussion and advice,
We will take splitOnCaseChange="0".
We are still waiting for a German VuFind-User's experience. Are there any
experiences?
Martin

-----Ursprüngliche Nachricht-----
Von: Ere Maijala [mailto:[hidden email]]
Gesendet: Mittwoch, 11. Januar 2017 17:16
An: [hidden email]
Betreff: Re: [VuFind-Tech] splitOnCaseChange=1 for German words?

For what it's worth, we're lowercasing everything and running with
splitOnCaseChange="0" (see
<https://github.com/NatLibFi/NDL-VuFind-Solr/blob/master/vufind/biblio/conf/
schema.xml>).
This seems to work best for us so far.

--Ere

11.1.2017, 17.29, Demian Katz kirjoitti:

> This is definitely one of those settings for which there is no perfect
> answer – and often it is at the root of confusing/unexpected search
> behavior. As you say, in many cases, true case-insensitivity is a good
> way to make results more predictable and consistent – but then you
> potentially lose some search breadth in situations like your CamelCase
> example. It comes down to weighing costs and benefits, and in your
> case, it does sound like going for pure case insensitivity may be best.
> However, it would definitely be interesting to hear from other German
> libraries for their experiences!
>
>
>
> - Demian
>
>
>
> *From:* Dr. Martin Fuchs [mailto:[hidden email]]
> *Sent:* Wednesday, January 11, 2017 9:52 AM
> *To:* [hidden email]
> *Subject:* [VuFind-Tech] splitOnCaseChange=1 for German words?
>
>
>
> Hello all, especially Germans,
>
> The VuFind-Standard in the schema
>
> splitOnCaseChange="1" in
>
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1"/>
>
>
>
> This results in Splits of CamelCase-words like
>
> VuFind -> Vu  Find
>
> GmbH  -> Gmb  H   (GmbH = Gesellschaft mit beschränkter Haftung)
>
>
>
> The searches for gmbh or GMBH give the same results but not the same
> results as „GmbH“. (gmbh has less results than GmbH). Our readers
> would expect the same results for GmbH, GMBH, gmbh, Gmbh because they
> are accustomed to case insensitive searches. Thus we intend to set
splitOnCaseChange=“0“.

>
> Are there any experiences with this setting? Are there often used
> CamelCase-words where CamelCase-Splitting would be better? One example
> would be „CamelCase“. With  splitOnCaseChange the searches for „Camel
> Case“ or „CamelCase“ should have the same results which is nice but
> less important than the problem with GmbH.
>
>
>
> Martin
>
>
>
> ----------------------------------------------------------------------
> -------- Developer Access Program for Intel Xeon Phi Processors Access
> to Intel Xeon Phi processor-based developer platforms.
> With one year of Intel Parallel Studio XE.
> Training and support from Colfax.
> Order your platform today. http://sdm.link/xeonphi
>
>
>
> _______________________________________________
> Vufind-tech mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/vufind-tech
>

--
Ere Maijala
Kansalliskirjasto / The National Library of Finland

----------------------------------------------------------------------------
--
Developer Access Program for Intel Xeon Phi Processors Access to Intel Xeon
Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: splitOnCaseChange=1 for German words?

Uwe Reh
Hi Martin,

seems you have missed my post, in this thread at 2017-01-12 09:47:21

Uwe



Am 17.01.2017 um 15:31 schrieb Dr. Martin Fuchs:
> Thank you Demian and Ere for your discussion and advice,
> We will take splitOnCaseChange="0".
> We are still waiting for a German VuFind-User's experience. Are there any
> experiences?
> Martin
>

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Loading...