regex - tokenizing with regular expression python -
i have following code want tokenize text located in directory regular expression
def tokenize(): infile = codecs.open('test_test.txt', 'r', encoding='utf-8') text = infile.read() infile.close() words = [] io.open('test_test.txt', 'r', encoding='utf-8') csvfile: text = unicode_csv_reader(csvfile, delimiter=',', quotechar='"') item in text: word in item: words.append(word) tregex = re.compile(ur'[?&/\'\r\n]', re.ignorecase) newtext1 = tregex.sub(' ', text) newtext = re.sub(' +', ' ', newtext1) words = re.split(r' ', newtext) print words
but error
traceback (most recent call last): file "d:\kksc\kksc.py", line 150, in oncheckspell tokenize() file "d:\kksc\kksc.py", line 32, in tokenize newtext1 = tregex.sub(' ', text)
typeerror: expected string or buffer
newtext1 = tregex.sub(' ', text)
text
2 dimensional array of strings, while sub
expects string. mean:
newtext1 = tregex.sub(' ', word) ?
Comments
Post a Comment