|  | Trying to fix Invalid CSV File |  | |
| | | Ryan Rosario |  |
| Posted: Mon Aug 04, 2008 3:43 am Post subject: Trying to fix Invalid CSV File |  |
I have a very large CSV file that contains double quoted fields (since they contain commas). Unfortunately, some of these fields also contain other double quotes and I made the painful mistake of forgetting to escape or double the quotes inside the field:
123,"Here is some, text "and some quoted text" where the quotes should have been doubled",321
Has anyone dealt with this problem before? Any ideas of an algorithm I can use for a Python script to create a new, repaired CSV file?
TIA, Ryan |
| |
| | | Emile van Sebille |  |
| Posted: Mon Aug 04, 2008 3:43 am Post subject: Re: Trying to fix Invalid CSV File |  |
Ryan Rosario wrote:
| Quote: | I have a very large CSV file that contains double quoted fields (since they contain commas). Unfortunately, some of these fields also contain other double quotes and I made the painful mistake of forgetting to escape or double the quotes inside the field:
123,"Here is some, text "and some quoted text" where the quotes should have been doubled",321
|
rec = '''123,"Here is some, text "and some quoted text" where the quotes should have been doubled",321'''
import csv
csv.reader([rec.replace(',"',',"""') .replace('",','""",') .replace('"""',"'''") .replace('"','""') .replace("'''",'"')]).next()
['123', 'Here is some, text "and some quoted text" where the quotes should have been doubled', '321']
:))
Emile
| Quote: | Has anyone dealt with this problem before? Any ideas of an algorithm I can use for a Python script to create a new, repaired CSV file?
TIA, Ryan -- LINK
|
|
| |
| | | Ryan Rosario |  |
| Posted: Mon Aug 04, 2008 7:49 am Post subject: Re: Trying to fix Invalid CSV File |  |
| |  | |
On Aug 3, 10:38 pm, Emile van Sebille <em...@fenx.com> wrote:
| Quote: | Ryan Rosario wrote: I have a very large CSV file that contains double quoted fields (since they contain commas). Unfortunately, some of these fields also contain other double quotes and I made the painful mistake of forgetting to escape or double the quotes inside the field:
123,"Here is some, text "and some quoted text" where the quotes should have been doubled",321
rec = '''123,"Here is some, text "and some quoted text" where the quotes should have been doubled",321'''
import csv
csv.reader([rec.replace(',"',',"""') .replace('",','""",') .replace('"""',"'''") .replace('"','""') .replace("'''",'"')]).next()
['123', 'Here is some, text "and some quoted text" where the quotes should have been doubled', '321']
:))
Emile
Has anyone dealt with this problem before? Any ideas of an algorithm I can use for a Python script to create a new, repaired CSV file?
TIA, Ryan -- LINK
|
Thanks Emile! Works almost perfectly, but is there some way I can adapt this to quote fields that contain a comma in them?
TIA, Ryan |
| |
| | | John Machin |  |
| Posted: Mon Aug 04, 2008 8:01 am Post subject: Re: Trying to fix Invalid CSV File |  |
On Aug 4, 5:49 pm, Ryan Rosario <uclamath...@gmail.com> wrote:
| Quote: | Thanks Emile! Works almost perfectly, but is there some way I can adapt this to quote fields that contain a comma in them?
|
You originally said "I have a very large CSV file that contains double quoted fields (since they contain commas)". Are you now saying that if a field contained a comma, you didn't wrap the field in quotes? Or is this a separate question unrelated to your original problem? |
| |
| | | Ryan Rosario |  |
| Posted: Mon Aug 04, 2008 8:15 am Post subject: Re: Trying to fix Invalid CSV File |  |
| |  | |
On Aug 4, 1:01 am, John Machin <sjmac...@lexicon.net> wrote:
| Quote: | On Aug 4, 5:49 pm, Ryan Rosario <uclamath...@gmail.com> wrote:
Thanks Emile! Works almost perfectly, but is there some way I can adapt this to quote fields that contain a comma in them?
You originally said "I have a very large CSV file that contains double quoted fields (since they contain commas)". Are you now saying that if a field contained a comma, you didn't wrap the field in quotes? Or is this a separate question unrelated to your original problem?
|
I enclosed all text fields within quotes. The problem is that I have quotes embedded inside those text fields as well and I did not double/ escape them. Emile's snippet takes care of the escaping but it strips the outer quotes from the text fields and if there are commas inside the text field, the field is split into multiple fields. Of course, it is possible that I am not using the snippet correctly I suppose. |
| |
| | | John Machin |  |
| Posted: Mon Aug 04, 2008 9:34 am Post subject: Re: Trying to fix Invalid CSV File |  |
| |  | |
On Aug 4, 6:15 pm, Ryan Rosario <uclamath...@gmail.com> wrote:
| Quote: | On Aug 4, 1:01 am, John Machin <sjmac...@lexicon.net> wrote:
On Aug 4, 5:49 pm, Ryan Rosario <uclamath...@gmail.com> wrote:
Thanks Emile! Works almost perfectly, but is there some way I can adapt this to quote fields that contain a comma in them?
You originally said "I have a very large CSV file that contains double quoted fields (since they contain commas)". Are you now saying that if a field contained a comma, you didn't wrap the field in quotes? Or is this a separate question unrelated to your original problem?
I enclosed all text fields within quotes. The problem is that I have quotes embedded inside those text fields as well and I did not double/ escape them. Emile's snippet takes care of the escaping but it strips the outer quotes from the text fields and if there are commas inside the text field, the field is split into multiple fields. Of course, it is possible that I am not using the snippet correctly I suppose.
|
Without you actually showing how you are using it, I can only surmise:
Emile's snippet is pushing it through the csv reading process, to demonstrate that his series of replaces works (on your *sole* example, at least). Note carefully his output for one line is a *list* of fields. The repr() of that list looks superficially like a line of csv input. It looks like you are csv-reading it a second time, using quotechar="'", after stripping off the enclosing []. If this guess is not correct, please show what you are actually doing.
If (as you said) you require a fixed csv file, you need to read the bad file line by line, use Emile's chain of replaces, and write each fixed line out to the new file. |
| |
| | | Emile van Sebille |  |
| Posted: Mon Aug 04, 2008 1:30 pm Post subject: Re: Trying to fix Invalid CSV File |  |
| |  | |
John Machin wrote:
| Quote: | On Aug 4, 6:15 pm, Ryan Rosario <uclamath...@gmail.com> wrote: On Aug 4, 1:01 am, John Machin <sjmac...@lexicon.net> wrote:
On Aug 4, 5:49 pm, Ryan Rosario <uclamath...@gmail.com> wrote: Thanks Emile! Works almost perfectly, but is there some way I can adapt this to quote fields that contain a comma in them? snip |
| Quote: | Emile's snippet is pushing it through the csv reading process, to demonstrate that his series of replaces works (on your *sole* example, at least).
|
Exactly -- just print out the results of the passed argument:
| Quote: | rec.replace(',"',",'''").replace('",',"''',").replace('"','""').replace("'''",'"') |
'123,"Here is some, text ""and some quoted text"" where the quotes should have been doubled",321'
Where it won't work is if any of the field embedded quotes are next to commas.
I'd run it against the file. Presumably, you've got a consistent field count expectation per record. Any resulting record not matching is suspect and will identify records this approach won't address.
There's probably better ways, but sometimes it's fun to create executable line noise. :)
Emile |
| |
| | | Ryan Rosario |  |
| Posted: Mon Aug 04, 2008 4:01 pm Post subject: Re: Trying to fix Invalid CSV File |  |
| |  | |
On Aug 4, 8:30 am, Emile van Sebille <em...@fenx.com> wrote:
| Quote: | John Machin wrote: On Aug 4, 6:15 pm, Ryan Rosario <uclamath...@gmail.com> wrote: On Aug 4, 1:01 am, John Machin <sjmac...@lexicon.net> wrote:
On Aug 4, 5:49 pm, Ryan Rosario <uclamath...@gmail.com> wrote: Thanks Emile! Works almost perfectly, but is there some way I can adapt this to quote fields that contain a comma in them?
snip
Emile's snippet is pushing it through the csv reading process, to demonstrate that his series of replaces works (on your *sole* example, at least).
Exactly -- just print out the results of the passed argument:
rec.replace(',"',",'''").replace('",',"''',").replace('"','""').replace("'''",'"')
'123,"Here is some, text ""and some quoted text"" where the quotes should have been doubled",321'
Where it won't work is if any of the field embedded quotes are next to commas.
I'd run it against the file. Presumably, you've got a consistent field count expectation per record. Any resulting record not matching is suspect and will identify records this approach won't address.
There's probably better ways, but sometimes it's fun to create executable line noise. :)
Emile
|
Thanks for your responses. I think John may be right that I am reading it a second time. I will take a look at the CSV reader documentation and see if that helps. Then once I run it I can see if I need to worry about the comma-next-to-quote issue. |
| |
| | | Larry Bates |  |
| Posted: Mon Aug 04, 2008 6:56 pm Post subject: Re: Trying to fix Invalid CSV File |  |
| |  | |
Ryan Rosario wrote:
| Quote: | On Aug 4, 8:30 am, Emile van Sebille <em...@fenx.com> wrote: John Machin wrote: On Aug 4, 6:15 pm, Ryan Rosario <uclamath...@gmail.com> wrote: On Aug 4, 1:01 am, John Machin <sjmac...@lexicon.net> wrote: On Aug 4, 5:49 pm, Ryan Rosario <uclamath...@gmail.com> wrote: Thanks Emile! Works almost perfectly, but is there some way I can adapt this to quote fields that contain a comma in them? snip
Emile's snippet is pushing it through the csv reading process, to demonstrate that his series of replaces works (on your *sole* example, at least). Exactly -- just print out the results of the passed argument:
rec.replace(',"',",'''").replace('",',"''',").replace('"','""').replace("'''",'"')
'123,"Here is some, text ""and some quoted text"" where the quotes should have been doubled",321'
Where it won't work is if any of the field embedded quotes are next to commas.
I'd run it against the file. Presumably, you've got a consistent field count expectation per record. Any resulting record not matching is suspect and will identify records this approach won't address.
There's probably better ways, but sometimes it's fun to create executable line noise. :)
Emile
Thanks for your responses. I think John may be right that I am reading it a second time. I will take a look at the CSV reader documentation and see if that helps. Then once I run it I can see if I need to worry about the comma-next-to-quote issue.
|
This is a perfect demonstration of why tab delimited files are so much better than comma and quote delimited. Virtually all software can handle table delimited as well as comma and quote delimited, but you would have none of these problems if you had used tab delimited. The chances of tabs being embedded in most data is virtually nil.
-Larry |
| |
| | | John Machin |  |
| Posted: Tue Aug 05, 2008 11:30 pm Post subject: Re: Trying to fix Invalid CSV File |  |
| |  | |
On Aug 5, 6:56 am, Larry Bates <larry.ba...@websafe.com`> wrote:
| Quote: | Ryan Rosario wrote: On Aug 4, 8:30 am, Emile van Sebille <em...@fenx.com> wrote: John Machin wrote: On Aug 4, 6:15 pm, Ryan Rosario <uclamath...@gmail.com> wrote: On Aug 4, 1:01 am, John Machin <sjmac...@lexicon.net> wrote: On Aug 4, 5:49 pm, Ryan Rosario <uclamath...@gmail.com> wrote: Thanks Emile! Works almost perfectly, but is there some way I can adapt this to quote fields that contain a comma in them? snip
Emile's snippet is pushing it through the csv reading process, to demonstrate that his series of replaces works (on your *sole* example, at least). Exactly -- just print out the results of the passed argument:
rec.replace(',"',",'''").replace('",',"''',").replace('"','""').replace("'''",'"')
'123,"Here is some, text ""and some quoted text"" where the quotes should have been doubled",321'
Where it won't work is if any of the field embedded quotes are next to commas.
I'd run it against the file. Presumably, you've got a consistent field count expectation per record. Any resulting record not matching is suspect and will identify records this approach won't address.
There's probably better ways, but sometimes it's fun to create executable line noise. :)
Emile
Thanks for your responses. I think John may be right that I am reading it a second time. I will take a look at the CSV reader documentation and see if that helps. Then once I run it I can see if I need to worry about the comma-next-to-quote issue.
This is a perfect demonstration of why tab delimited files are so much better than comma and quote delimited.
|
No, it's a perfect demonstration of what happens when a protocol is not followed.
| Quote: | Virtually all software can handle table delimited as well as comma and quote delimited, but you would have none of these problems if you had used tab delimited. The chances of tabs being embedded in most data is virtually nil.
|
There may be no tabs in *your* data. There is no guarantee that there are no tabs in a VARCHAR(n) column in somebody else's database. I've seen all of \x00, \t, \n, \r and \x1a (Ctrl-Z (EOF in CP/M, *DOS and Windows text files)).
The possibilities include (1) Don't check (2) check if '\t' in field and raise an exception (3) silently remove tabs; what do you recommend? |
| |
| Page 1 of 2 .:. Goto page 1, 2 Next | |
|
|