SOLR synonym file

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

SOLR synonym file

Naomi Dushay
I had a stray thought that *might* have some legs:

Can we do anything with the SOLR synonyms.txt  for authority  
information?  I don't know how well it might scale (I just learned we  
have 1.8 million authority records), but it's something to think about.

Naomi Dushay
[hidden email]




-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|

Re: SOLR synonym file

Andrew Nagy-2
Wow - I never thought of this.  This actually might be a perfect solution.  Let's think out loud about this a bit more.

Andrew

> -----Original Message-----
> From: [hidden email] [mailto:vufind-tech-
> [hidden email]] On Behalf Of Naomi Dushay
> Sent: Monday, July 28, 2008 1:03 PM
> To: [hidden email]
> Subject: [VuFind-Tech] SOLR synonym file
>
> I had a stray thought that *might* have some legs:
>
> Can we do anything with the SOLR synonyms.txt  for authority
> information?  I don't know how well it might scale (I just learned we
> have 1.8 million authority records), but it's something to think about.
>
> Naomi Dushay
> [hidden email]
>
>
>
>
> -----------------------------------------------------------------------
> --
> This SF.Net email is sponsored by the Moblin Your Move Developer's
> challenge
> Build the coolest Linux based applications with Moblin SDK & win great
> prizes
> Grand prize is a trip for two to an Open Source event anywhere in the
> world
> http://moblin-contest.org/redirect.php?banner_id=100&url=/
> _______________________________________________
> Vufind-tech mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/vufind-tech

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|

Re: SOLR synonym file

Naomi Dushay

Reading recent postings on the solr-user list:

1.  we need to use the synonym file at index time, not query time (this is configured in the filters/analyzers in schema.xml).

2.  someone used 7000 synonyms at index time without trouble.

3.  this may not behave exactly as we expect, but there may be a solution in the post to solr-user below:

4.  there is also a possibility of having a separate index for "synonyms" or authority information and querying the synonym index to form the query.


All in all, I suspect experimentation is indicated.

Begin forwarded message:
From: "Laurent Gilles" <[hidden email]>
Date: July 28, 2008 9:02:19 AM PDT
Subject: RE: solr synonyms behaviour
Reply-To: [hidden email]

Hi,

I was faced with the same issues reguarding multiwords synonyms
Let's say a synonyms list like:

club, bar, night cabaret

Now if we have a document containing "club", with the default synonyms
filter behaviour with expand=true, we will end up in the lucene index with a
document containing "club|bar|night cabaret".
So if the user search for "night", the query-time will search for "night" in
the index and will match our document since it had been "enriched" @
index-time, and it really contains the token "night".

The only valid solution I've founded was to create a field-type exclusively
used for synonyms search where: 

@IndexTime
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false" />
@QueryTime
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false" />

And with a customised synonyms file that looks like:

SYN_ID_1, club, bar, night cabaret

So for our document containing "club", the synonym filter at index time with
expand=false will replace every matching token/expression in the document
with the SYN_ID_1.

And at query time, when an user search for "night", since "night" is not
alone in synonyms definition, it will not be matched, even by "normal"
search, because every document containing "club" or "bar" would have been
"enriched" with "SYN_ID_1" and NOT with "club|bar|night cabaret", so the
final indexed document will not contains isolated token from synonyms
expression that risks to be matched later without notice.

In order to match our document containing "club", the user HAVE TO type the
entire expression "night cabaret", and not only part of the expression.


Of course, as I said before, this field was exclusively used for synonym
matching, so it requires another field for normal full-text-stemmed search
to add normal results, this approach give us the opportunity to setup
Boosting separately on full-text-stemmed search VS synonyms search, let's
say :

"title_stem":"club"^100 OR "title_syns":"club"^10

I hope to have been clear, even if I don’t believe to.. Fact is this
approach have fixed your problem, since we didn't what synonym matching if
the user only types part of synonymic expression.

Regards,
Laurent



-----Message d'origine-----
De : swarag [[hidden email]
Envoyé : vendredi 25 juillet 2008 23:48
À : [hidden email]
Objet : Re: solr synonyms behaviour



swarag wrote:


Yonik Seeley wrote:

On Tue, Jul 15, 2008 at 2:27 PM, swarag <[hidden email]>
wrote:
To my understanding, this means I am using synonyms at index time and
NOT
query time. And yet, I am still having these problems with synonyms.

Can you give a specific example?  Use debugQuery=true to see what the
resulting query is.
You can also use the admin analysis page to see what the output of the
index and query analyzers.

-Yonik



So it sounds like using the '=>' operator for synonyms that may or may not
contain multiple words causes problems.  So I changed my synonyms.txt to
the following:

club,bar,night cabaret

In schema.xml, I now have the following:
   <fieldType name="text" class="solr.TextField"
positionIncrementGap="100">
     <analyzer type="index">
       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
       <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
       <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true"/>
       <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
       <filter class="solr.LowerCaseFilterFactory"/>
       <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
     </analyzer>
     <analyzer type="query">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
       <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
       <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
       <filter class="solr.LowerCaseFilterFactory"/>
       <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
     </analyzer>
   </fieldType>

As you can see, 'night cabaret' is my only multi-word synonym term.
Searches for 'bar' and 'club' now behave as expected.  However, if I
search for JUST 'night' or JUST 'cabaret', it looks like it is still using
the synonyms 'bar' and 'club', which is not what is desired.  I only want
'bar' and 'club' to be returned if a search for the complete 'night
cabaret' is submitted.

Since query-time synonyms is turned "off", the resulting
parsedquery_toString is simply "name:night", "name:cabaret", etc...

Thanks!


We are still having problems. Searches for single words that are part of a
multi-word synonym seem to be affected by the synonyms, when they should
not.  Anyone else experience this?  If not, would you mind explaining your
config and the format of your synonyms.txt file?
-- 
View this message in context:
http://www.nabble.com/solr-synonyms-behaviour-tp15051211p18660135.html
Sent from the Solr - User mailing list archive at Nabble.com.

On Jul 28, 2008, at 11:00 AM, Andrew Nagy wrote:

Wow - I never thought of this.  This actually might be a perfect solution.  Let's think out loud about this a bit more.

Andrew

-----Original Message-----
From: [hidden email] [[hidden email]
[hidden email]] On Behalf Of Naomi Dushay
Sent: Monday, July 28, 2008 1:03 PM
To: [hidden email]
Subject: [VuFind-Tech] SOLR synonym file

I had a stray thought that *might* have some legs:

Can we do anything with the SOLR synonyms.txt  for authority
information?  I don't know how well it might scale (I just learned we
have 1.8 million authority records), but it's something to think about.

Naomi Dushay
[hidden email]




-----------------------------------------------------------------------
--
This SF.Net email is sponsored by the Moblin Your Move Developer's
challenge
Build the coolest Linux based applications with Moblin SDK & win great
prizes
Grand prize is a trip for two to an Open Source event anywhere in the
world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech

Naomi Dushay




-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|

Re: SOLR synonym file

Andrew Nagy-2

Hmmm … as I think about this more - I think this is probably not the best approach.  We could be potentially implementing more than a hundred thousand synonyms and this would limit the faceting and searching capabilities.  We would lose that more granular control.

 

I'm thinking the best approach is to still store the auth data in its own index and munge the auth data into the bib index on index time.

 

Andrew

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of Naomi Dushay
Sent: Monday, July 28, 2008 2:29 PM
To: [hidden email]
Subject: Re: [VuFind-Tech] SOLR synonym file

 

 

Reading recent postings on the solr-user list:

 

1.  we need to use the synonym file at index time, not query time (this is configured in the filters/analyzers in schema.xml).

 

2.  someone used 7000 synonyms at index time without trouble.

 

3.  this may not behave exactly as we expect, but there may be a solution in the post to solr-user below:

 

4.  there is also a possibility of having a separate index for "synonyms" or authority information and querying the synonym index to form the query.

 

 

All in all, I suspect experimentation is indicated.

 

Begin forwarded message:

From: "Laurent Gilles" <[hidden email]>

Date: July 28, 2008 9:02:19 AM PDT

Subject: RE: solr synonyms behaviour

Reply-To: [hidden email]

 

Hi,

I was faced with the same issues reguarding multiwords synonyms
Let's say a synonyms list like:

club, bar, night cabaret

Now if we have a document containing "club", with the default synonyms
filter behaviour with expand=true, we will end up in the lucene index with a
document containing "club|bar|night cabaret".
So if the user search for "night", the query-time will search for "night" in
the index and will match our document since it had been "enriched" @
index-time, and it really contains the token "night".

The only valid solution I've founded was to create a field-type exclusively
used for synonyms search where: 

@IndexTime
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false" />
@QueryTime
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false" />

And with a customised synonyms file that looks like:

SYN_ID_1, club, bar, night cabaret

So for our document containing "club", the synonym filter at index time with
expand=false will replace every matching token/expression in the document
with the SYN_ID_1.

And at query time, when an user search for "night", since "night" is not
alone in synonyms definition, it will not be matched, even by "normal"
search, because every document containing "club" or "bar" would have been
"enriched" with "SYN_ID_1" and NOT with "club|bar|night cabaret", so the
final indexed document will not contains isolated token from synonyms
expression that risks to be matched later without notice.

In order to match our document containing "club", the user HAVE TO type the
entire expression "night cabaret", and not only part of the expression.


Of course, as I said before, this field was exclusively used for synonym
matching, so it requires another field for normal full-text-stemmed search
to add normal results, this approach give us the opportunity to setup
Boosting separately on full-text-stemmed search VS synonyms search, let's
say :

"title_stem":"club"^100 OR "title_syns":"club"^10

I hope to have been clear, even if I don’t believe to.. Fact is this
approach have fixed your problem, since we didn't what synonym matching if
the user only types part of synonymic expression.

Regards,
Laurent



-----Message d'origine-----
De : swarag [[hidden email]
Envoyé : vendredi 25 juillet 2008 23:48
À : [hidden email]
Objet : Re: solr synonyms behaviour



swarag wrote:

 

 

Yonik Seeley wrote:

 

On Tue, Jul 15, 2008 at 2:27 PM, swarag <[hidden email]>

wrote:

To my understanding, this means I am using synonyms at index time and

NOT

query time. And yet, I am still having these problems with synonyms.

 

Can you give a specific example?  Use debugQuery=true to see what the

resulting query is.

You can also use the admin analysis page to see what the output of the

index and query analyzers.

 

-Yonik

 

 

 

So it sounds like using the '=>' operator for synonyms that may or may not

contain multiple words causes problems.  So I changed my synonyms.txt to

the following:

 

club,bar,night cabaret

 

In schema.xml, I now have the following:

   <fieldType name="text" class="solr.TextField"

positionIncrementGap="100">

     <analyzer type="index">

       <tokenizer class="solr.WhitespaceTokenizerFactory"/>

       <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"

ignoreCase="true" expand="true"/>

       <filter class="solr.StopFilterFactory" ignoreCase="true"

words="stopwords.txt" enablePositionIncrements="true"/>

       <filter class="solr.WordDelimiterFilterFactory"

generateWordParts="1" generateNumberParts="1" catenateWords="1"

catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>

       <filter class="solr.LowerCaseFilterFactory"/>

       <filter class="solr.EnglishPorterFilterFactory"

protected="protwords.txt"/>

       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

     </analyzer>

     <analyzer type="query">

          <tokenizer class="solr.WhitespaceTokenizerFactory"/>

       <filter class="solr.StopFilterFactory" ignoreCase="true"

words="stopwords.txt"/>

       <filter class="solr.WordDelimiterFilterFactory"

generateWordParts="1" generateNumberParts="1" catenateWords="0"

catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>

       <filter class="solr.LowerCaseFilterFactory"/>

       <filter class="solr.EnglishPorterFilterFactory"

protected="protwords.txt"/>

       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

     </analyzer>

   </fieldType>

 

As you can see, 'night cabaret' is my only multi-word synonym term.

Searches for 'bar' and 'club' now behave as expected.  However, if I

search for JUST 'night' or JUST 'cabaret', it looks like it is still using

the synonyms 'bar' and 'club', which is not what is desired.  I only want

'bar' and 'club' to be returned if a search for the complete 'night

cabaret' is submitted.

 

Since query-time synonyms is turned "off", the resulting

parsedquery_toString is simply "name:night", "name:cabaret", etc...

 

Thanks!

 


We are still having problems. Searches for single words that are part of a
multi-word synonym seem to be affected by the synonyms, when they should
not.  Anyone else experience this?  If not, would you mind explaining your
config and the format of your synonyms.txt file?
-- 
View this message in context:
http://www.nabble.com/solr-synonyms-behaviour-tp15051211p18660135.html
Sent from the Solr - User mailing list archive at Nabble.com.

 

On Jul 28, 2008, at 11:00 AM, Andrew Nagy wrote:



Wow - I never thought of this.  This actually might be a perfect solution.  Let's think out loud about this a bit more.

Andrew


-----Original Message-----

From: [hidden email] [[hidden email]

[hidden email]] On Behalf Of Naomi Dushay

Sent: Monday, July 28, 2008 1:03 PM

To: [hidden email]

Subject: [VuFind-Tech] SOLR synonym file

 

I had a stray thought that *might* have some legs:

 

Can we do anything with the SOLR synonyms.txt  for authority

information?  I don't know how well it might scale (I just learned we

have 1.8 million authority records), but it's something to think about.

 

Naomi Dushay

[hidden email]

 

 

 

 

-----------------------------------------------------------------------

--

This SF.Net email is sponsored by the Moblin Your Move Developer's

challenge

Build the coolest Linux based applications with Moblin SDK & win great

prizes

Grand prize is a trip for two to an Open Source event anywhere in the

world

http://moblin-contest.org/redirect.php?banner_id=100&url=/

_______________________________________________

Vufind-tech mailing list

[hidden email]

https://lists.sourceforge.net/lists/listinfo/vufind-tech

 

Naomi Dushay

 

 

 


-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|

Re: SOLR synonym file

Mark Triggs-2
Hi all,

We're just starting to think about this too, so I'll post our thoughts
on the topic once we... er... have some.

I suspect we'll end up merging authority data into our browse in some
fashion, but I'm not yet sure about how we'll fold it into search or the
record view.  I'd personally like it if you could search for, say, "Ivan
the Terrible" and have it indicate in some fashion that "Ivan Czar of
Russia, 1530-1584" was the preferred term, but maybe a simple extension
to the "Did you mean" would do the job here.

So far all I've done is load our auth data into a Lucene index to verify
that I can actually search it at a reasonable speed.  Our indexes only
come out to about 80 megs, so it looks like I can pretty much just load
them into memory.  Always nice :o)

Cheers,

Mark


Andrew Nagy <[hidden email]> writes:

> Hmmm … as I think about this more - I think this is probably not the
> best approach.  We could be potentially implementing more than a
> hundred thousand synonyms and this would limit the faceting and
> searching capabilities.  We would lose that more granular control.
>
> I'm thinking the best approach is to still store the auth data in its
> own index and munge the auth data into the bib index on index time.

--
Mark Triggs
<[hidden email]>


-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech