searching and non-latin diacritics

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

searching and non-latin diacritics

Naomi Dushay
We have a lot of russian materials, and searching them is very  
sensitive to diacritics (see below).  Has anyone else encountered a  
similar problem?  have you solved it?

I'm guessing other non-latin-1 diacritics are also affected;  we have  
a report of a problem with the macron, which is used in Asian languages.

On Aug 26, 2008, at 2:30 PM, Vitus Tang wrote:

>
> I'm a little confused by VuFind's handling of the ligature used in
> Russian transliteration, in indexing and searching. I can search this
> author successfully with or without the ligature:
>
> T͡Salikov, Feliks.
>
> Either way I get two records, which is the way it should be. But with
> this other name:
>
> T͡Sigelʹman, I͡Akov.
>
> which has both the ligature and the miagkii znak (aka the soft sign,  
> or
>  "modifier letter prime" in Unicode). If I include them all in a  
> VuFind
> search, I'd get 3 records, which is the correct result. If I delete  
> the
> ligature between the T and the S, the search would retrieve nothing.  
> But
> If I keep the ligature between the T and S, and delete instead the
> ligature between the I and the A, then the search would still  
> retrieve 3
> records! I don't understand why it is required in one word and not
> required in another. If I keep both ligatures but delete the miagkii
> znak, the search would also retrieve nothing. So, it looks like you  
> have
> to include the miagkii znak.
>
> And here's an example of the tverdyi znak (aka the hard sign, or
> "modifier letter double prime" in Unicode):
>
> Podʺi͡apolʹskiĭ, G.|q(Grigoriĭ),|d1926-
>
> If I search just the last name (Podʺi͡apolʹskiĭ), I'd retrieve 7  
> records
> (the correct result). If the ligature between the i and a is deleted,
> the search would retrieve nothing. The same is true if the tverdyi  
> znak
> is deleted, or the miagkii znak is deleted, or the breve is deleted.  
> It
> appears that all three are required.
>
> -- Vitus



Thanks,
Naomi

Naomi Dushay
[hidden email]




-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|

Re: searching and non-latin diacritics

Andrew Nagy-2
Naomi - our unicode filter is currently not yet working with the current SVN trunk.  This should solve some of your problems.  If you are using other language formats such as cyrillic, CJK (chinese, japanese, korean), etc. you will need to do some additional customization to Solr to support this.  Lucene has a CJK filter and I think a Cyrillic filter too.

The folks at UVa have implemented the CJK filter since they have a large chinese collection.  You might want to see how blacklight is handling this.

Andrew

> -----Original Message-----
> From: [hidden email] [mailto:vufind-tech-
> [hidden email]] On Behalf Of Naomi Dushay
> Sent: Tuesday, August 26, 2008 8:20 PM
> To: [hidden email]
> Subject: [VuFind-Tech] searching and non-latin diacritics
>
> We have a lot of russian materials, and searching them is very
> sensitive to diacritics (see below).  Has anyone else encountered a
> similar problem?  have you solved it?
>
> I'm guessing other non-latin-1 diacritics are also affected;  we have
> a report of a problem with the macron, which is used in Asian
> languages.
>
> On Aug 26, 2008, at 2:30 PM, Vitus Tang wrote:
> >
> > I'm a little confused by VuFind's handling of the ligature used in
> > Russian transliteration, in indexing and searching. I can search this
> > author successfully with or without the ligature:
> >
> > T͡Salikov, Feliks.
> >
> > Either way I get two records, which is the way it should be. But with
> > this other name:
> >
> > T͡Sigelʹman, I͡Akov.
> >
> > which has both the ligature and the miagkii znak (aka the soft sign,
> > or
> >  "modifier letter prime" in Unicode). If I include them all in a
> > VuFind
> > search, I'd get 3 records, which is the correct result. If I delete
> > the
> > ligature between the T and the S, the search would retrieve nothing.
> > But
> > If I keep the ligature between the T and S, and delete instead the
> > ligature between the I and the A, then the search would still
> > retrieve 3
> > records! I don't understand why it is required in one word and not
> > required in another. If I keep both ligatures but delete the miagkii
> > znak, the search would also retrieve nothing. So, it looks like you
> > have
> > to include the miagkii znak.
> >
> > And here's an example of the tverdyi znak (aka the hard sign, or
> > "modifier letter double prime" in Unicode):
> >
> > Podʺi͡apolʹskiĭ, G.|q(Grigoriĭ),|d1926-
> >
> > If I search just the last name (Podʺi͡apolʹskiĭ), I'd retrieve 7
> > records
> > (the correct result). If the ligature between the i and a is deleted,
> > the search would retrieve nothing. The same is true if the tverdyi
> > znak
> > is deleted, or the miagkii znak is deleted, or the breve is deleted.
> > It
> > appears that all three are required.
> >
> > -- Vitus
>
>
>
> Thanks,
> Naomi
>
> Naomi Dushay
> [hidden email]
>
>
>
>
> -----------------------------------------------------------------------
> --
> This SF.Net email is sponsored by the Moblin Your Move Developer's
> challenge
> Build the coolest Linux based applications with Moblin SDK & win great
> prizes
> Grand prize is a trip for two to an Open Source event anywhere in the
> world
> http://moblin-contest.org/redirect.php?banner_id=100&url=/
> _______________________________________________
> Vufind-tech mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/vufind-tech
-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech