Improve unicode to ascii ...

for punctuation, normalization (merging similar looking characters into
the most common one), accents, and full conversion to ascii.

Functions and tests moved into separate files.

Although more comprehensive, the code should run faster because it
eliminates several loops (inc. a loop with two unicodedata references).

This is intended to form the basis of future PRs to:
a. Clean up (simplify) the file naming code
b. Provide script function(s) for cleaning individual tags / file name
parts
c. Add support for translation / transliteration plugins (which I think
make more sense than being included in Picard itself).
d. Support for converting Tags to ISO-8859-1 rather than ascii (since
that is what is supported by ID3 at least)
e. Possible additional options for allowing / preventing normalization,
possible reorganisation of options to centralise all encoding settings
onto one page rather than metadata, tags and file naming pages at
present (to be discussed).
This commit is contained in:
Sophist
2014-03-27 19:57:27 +00:00
parent 445bf56bad
commit 323d12892c
7 changed files with 682 additions and 82 deletions

View File

@@ -5,34 +5,6 @@ import unittest
from picard import util
class UnaccentTest(unittest.TestCase):
def test_correct(self):
self.assertEqual(util.unaccent(u"Lukáš"), u"Lukas")
self.assertEqual(util.unaccent(u"Björk"), u"Bjork")
self.assertEqual(util.unaccent(u"Trentemøller"), u"Trentemoller")
self.assertEqual(util.unaccent(u"小室哲哉"), u"小室哲哉")
self.assertEqual(util.unaccent(u"Ænima"), u"AEnima")
self.assertEqual(util.unaccent(u"ænima"), u"aenima")
def test_incorrect(self):
self.assertNotEqual(util.unaccent(u"Björk"), u"Björk")
self.assertNotEqual(util.unaccent(u"小室哲哉"), u"Tetsuya Komuro")
class ReplaceNonAsciiTest(unittest.TestCase):
def test_correct(self):
self.assertEqual(util.replace_non_ascii(u"Lukáš"), u"Luk__")
self.assertEqual(util.replace_non_ascii(u"Björk"), u"Bj_rk")
self.assertEqual(util.replace_non_ascii(u"Trentemøller"), u"Trentem_ller")
self.assertEqual(util.replace_non_ascii(u"小室哲哉"), u"____")
def test_incorrect(self):
self.assertNotEqual(util.replace_non_ascii(u"Lukáš"), u"Lukáš")
self.assertNotEqual(util.replace_non_ascii(u"Lukáš"), u"Luk____")
class ReplaceWin32IncompatTest(unittest.TestCase):
def test_correct(self):