|  | Case tagging and python |  | |
| | | Fred Mangusta |  |
| Posted: Thu Jul 31, 2008 9:00 am Post subject: Case tagging and python |  |
| |  | |
Hi,
I'm relatively new to programming in general, and totally new to python, and I've been told that this language is particularly good for what I need to do. Let me explain. I have a large corpus of English text, in the form of several files.
First of all I would like to scan each file. Then, for each word I find, I'd like to examine its case status, and write the (lower case) word back to another text file - with, appended, a tag stating the case it had in the original file.
An example. Suppose we have three possible "case conditions" -all lowercase -all uppercase -initial uppercase only
Three corresponding tags for each of these might be, respectively: -nocap -allcaps -cap
Therefore, given the string
"The Chairman of BP was asleep"
I would like to produce
"the/cap chairman/cap of/nocap /bp/allcaps was/nocap /asleep/nocap"
and writing this into a file.
I have the following algorithm in mind:
-open input file -open output file -get line of text -split line into words -for each word -tag = checkCase(word) -newword = lowercase(word) + append(tag) rejoin words into line write line into output file
Now, I managed to write the following initial code
for s in file: lines += 1 if lines % 1000 == 0: print '%d lines' % We print the total lines sent = s.split() #split string by spaces #...
But then I don't quite know what would be the fastest/best way to do this. Could I use the join function to reform the string? And, regarding the casetest() function, what do you suggest to do? Should I test each character of each word or there are faster methods?
Thanks very much,
F. |
| |
| | | Guest |  |
| Posted: Thu Jul 31, 2008 11:13 am Post subject: Re: Case tagging and python |  |
Fred Mangusta:
| Quote: | Could I use the join function to reform the string?
|
You can write a function to split the words, for example taking in account the points too, etc.
| Quote: | And, regarding the casetest() function, what do you suggest to do?
|
Python strings have isupper, islower, istitle methods, they may be enough for your purposes.
| Quote: | -open input file -open output file -get line of text -split line into words -for each word -tag = checkCase(word) -newword = lowercase(word) + append(tag) rejoin words into line write line into output file
|
It seems good. To join the words of a line there's str.join. Now you can write a function that splits lines, and another to check the case, then you can show them to us.
Yet, I don't see how much use can have your output file :-)
Bye, bearophile |
| |
| | | Fred Mangusta |  |
| Posted: Thu Jul 31, 2008 6:11 pm Post subject: Re: Case tagging and python |  |
| |  | |
Hi, I came up with the following procedure
ALLCAPS = "|ALLCAPS" NOCAPS = "|NOCAPS" MIDCAPS = "|MIDCAPS" CAPS = "|CAPS" DIGIT = "|DIGIT"
def test_case(w):
w_out = ''
if w.isalpha(): #se la virgola non ci entra if w.isupper(): w_out = w.lower() + ALLCAPS return w_out elif w.islower(): w_out = w + NOCAPS return w_out else: m = re.match("^[A-Z]",w) if m: w_out = w.lower() + CAPS #notsure about this.. return w_out else: w_out = w.lower() + MIDCAPS return w_out elif w.isdigit(): w_out = w + DIGIT return w_out
Called in here: #========================= lines = 0 for s in file: lines += 1 if lines % 1000 == 0: print '%d lines' % lines #sent = sent.replace(",","") sent = s.split() #split string by spaces for w in sent: wout= test_case(w) #==========================
But I don't know if I'm doing something sensible? Moreover:
- test_case has problems, cause whenever It finds some punctuation character attached to some word, doesn't tag it. I was thinking of cleaning the line of the punctuation before using split on it (see commented row) but I don't know if I have to call that replace() once for every punctuation char? -Is there a way to reprint the tagged text in a file including punctuation? -Is my test_case a good start? Would you use regular expressions?
Thanks very much! F. |
| |
| | | Guest |  |
| Posted: Thu Jul 31, 2008 8:21 pm Post subject: Re: Case tagging and python |  |
I second the idea of just using the islower(), isupper(), and istitle() methods. So, you could have a function - let's call it checkCase() - that returns a string with the tag you want...
def checkCase(word):
if word.islower(): tag = 'nocap' elif word.isupper(): tag = 'allcaps' elif word.istitle(): tag = 'cap'
return tag
Then let's take an input file and pass every word through the function...
f = open(path:to:file, 'r') corpus_text = f.read() f.close()
tagged_corpus = '' all_words = corpus_text.split()
for w in all_words: tagtext = checkCase(w) tagged_corpus = tagged_corpus + ' ' + w + '/' + tagtext
output_file = open(path:to:file, 'w') output_file.write(tagged_corpus) print 'All Done!'
Also, if you're doing natural language processing in Python, you should get NLTK. |
| |
|
|