|
Weeble Starscape Jedi


Joined: 25 Apr 2003 Posts: 1143 Location: Glasgow, Scotland

|
Posted: Mon Jul 14, 2008 10:26 pm Post subject: Scrabble lookup coding challenge! |
|
|
I think we derailed this thread over here with all the talk of algorithms for finding possible all the possible words that can be built with a given set of letters. Let's have a thread specifically for it.
I don't know if it's exactly the problem that Poo Bear had to solve, but the challenge seems to be this:
- Given a list of all allowed words, and a set of letters (possibly containing duplicates) find all arrangements of a subset of the letters that form valid words.
I'm working in Python right now, because it's easy, but if we're getting really competitive I might consider rewriting in C++. Here's what I did:
Code: |
alphabet="ABCDEFGHIJKLMNOPQRSTUVWXYZ"
def bitcodeword(word):
"""Return a 26-bit bitfield where each 1 bit indicates
the corresponding letter of the alphabet is present
in the word. lsb is 'A', msb is 'Z'.
"""
bit=1
code=0
for letter in alphabet:
if letter in word:
code|=bit
bit<<=1
return code
def canmakeword(allowedletters,word):
"""Given a dictionary mapping letters to the number of
each of those letters available, determine if the
given word can be made by rearranging some or all of
the letters."""
remainingletters=dict(allowedletters)
for ch in word:
x = remainingletters[ch]-1
if x<0:
return False
remainingletters[ch]=x
return True
def lettercounters(letters):
"""Create a dictionary mapping letters to the number of
each of those letters found in the provided
string or sequence."""
counters=dict((ch,0) for ch in alphabet)
for l in letters:
counters[l]+=1
return counters
class WordTable(object):
def __init__(self,filename):
self.wordtable=[(word,bitcodeword(word))
for word in
(line.strip() for line in file(filename))
]
def wordlist(self):
"""Get all the words in the table."""
return (word for (word, code) in self.wordtable)
def naivefind(self,letters):
"""Given a set of letters, find all the words that
can be made by rearranging some or all of them.
This is pretty slow."""
allowedletters=lettercounters(letters)
return (word for word in
self.wordlist()
if canmakeword(allowedletters,word))
def filterwords(self,letters):
"""Given a set of letters, find all the words
that can be made by rearranging and multiplying
some or all of them."""
wordcode=bitcodeword(letters)
return (word for (word, code) in
self.wordtable
if code&wordcode==code)
def betterfind(self,letters):
"""Given a set of letters, find all the words that
can be made by rearranging some or all of them.
This is much faster, because it performs a fast
test on each candidate word that will eliminate
all those that need letters we don't have."""
allowedletters=lettercounters(letters)
return (word for word in
self.filterwords(letters)
if canmakeword(allowedletters,word))
|
Usage: Code: |
wt=WordTable("sowpods.txt")
list(wt.betterfind("STRAIN"))
['AI', 'AIN', 'AINS', 'AIR', 'AIRN', 'AIRNS', 'AIRS', 'AIRT', 'AIRTS', 'AIS', 'AIT', 'AITS', 'AN', 'ANI', 'ANIS', 'ANT', 'ANTI', 'ANTIS', 'ANTS', 'AR', 'ARIS', 'ARS', 'ART', 'ARTI', 'ARTIS', 'ARTS', 'AS', 'ASTIR', 'AT', 'IN', 'INS', 'INSTAR', 'INTRA', 'IS', 'ISNA', 'IT', 'ITA', 'ITAS', 'ITS', 'NA', 'NARIS', 'NAS', 'NAT', 'NATIS', 'NATS', 'NIS', 'NIT', 'NITS', 'RAI', 'RAIN', 'RAINS', 'RAIS', 'RAIT', 'RAITS', 'RAN', 'RANI', 'RANIS', 'RANT', 'RANTS', 'RAS', 'RAST', 'RAT', 'RATS', 'RIA', 'RIANT', 'RIAS', 'RIN', 'RINS', 'RIT', 'RITS', 'SAI', 'SAIN', 'SAINT', 'SAIR', 'SAN', 'SANT', 'SANTIR', 'SAR', 'SARI', 'SARIN', 'SAT', 'SATI', 'SATIN', 'SI', 'SIN', 'SIR', 'SIT', 'SITAR', 'SNAR', 'SNIRT', 'SNIT', 'SRI', 'ST', 'STAIN', 'STAIR', 'STAR', 'STARN', 'STIR', 'STRAIN', 'STRIA', 'TA', 'TAI', 'TAIN', 'TAINS', 'TAIS', 'TAN', 'TANS', 'TAR', 'TARN', 'TARNS', 'TARS', 'TARSI', 'TAS', 'TI', 'TIAR', 'TIARS', 'TIN', 'TINS', 'TIS', 'TRAIN', 'TRAINS', 'TRANS', 'TRIN', 'TRINS', 'TSAR']
|
I can't really tackle the slow loading problem. It takes a few seconds to load on my laptop. But what I'm interested in here is optimising the search time. The naive search takes a second or so for most searches. (My favourite test data is "RESTRAINTS".) The optimised search takes about a tenth of that. I imagine it could be substantially faster again in C++. The only trick for optimisation is that I make a 26-bit bitfield for each word in the dictionary, where each bit indicates whether the corresponding letter appears in the word. It doesn't indicate how many times the letter appears, just that it appears at least once. This lets me discard most words very quickly, but I'm fundamentally still ploughing through the whole dictionary.
I'm curious if there's a better algorithm for the problem. Colinvella suggests a trie combined with permutations of the input letters, but as Slyh points out, that will suffer from huge numbers of permutations for large numbers of letters, far more than there are words in the dictionary. I wonder if there's some way to be smarter, perhaps by preprocessing or cross-indexing the word table in some interesting way. Or maybe we could have the permutation generator give up on a branch of the permutation every time it can see that it's not going to find a word in the dictionary. Actually, that sounds quite promising.
PS - I've not really Googled the problem yet, because I'm enjoying it for the challenge, but I would not be surprised if there are well-known solutions for this problem.
EDIT - I fixed the WordTable constructor. It was ignoring the filename argument and always loading "sowpods.txt". |
|
|