How to Java: Unexpected whitespace in python generated strings

I am using Python to generate an ASCII file composed of very long lines. This is one example line (let's say line 100 in the file, '[...]' are added by me to shorten the line):

{6 1,14 1,[...],264 1,270 2,274 2,[...],478 1,479 8,485 1,[...]}

If I open the ASCII file that I generated with ipython:

f = open('myfile','r')
 print repr(f.readlines()[99])

I do obtain the expected line printed correctly ('[...]' are added by me to shorten the line):

'{6 1,14 1,[...],264 1,270 2,274 2,[...],478 1,479 8,485 1,[...]}\n'

On the contrary, if I open this file with the program that is suppose to read it, it will generate an exception, complaining about an unexpected pair after 478 1.
So I tried to open the file with vim. Still vim shows no problem, but if I copy the line as printed by vim and paste it in another text editor (in my case TextMate), this is the line that I obtain ('[...]' are added by me to shorten the line):

{6 1,14 1,[...],264 1,270      2,274 2,[...],478 1,4     79 8,485 1,[...]}

This line indeed has a problem after the pair 478 1.
I tried to generate my lines in different ways (concatenating, with cStringIO, ...), but I always obtain this result. When using the cStringIO, for example, the lines are generated as in the following (even though I tried to change this, as well, with no luck):

def _construct_arff(self,attributes,header,data_rows):
   """Create the string representation of a Weka ARFF file.
      *attributes* is a dictionary with attribute_name:attribute_type
        (e.g., 'num_of_days':'NUMERIC')
      *header* is a list of the attributes sorted
        (e.g., ['age','name','num_of_days'])
      *data_rows* is a list of lists with the values, sorted as in the header
        (e.g., [ [88,'John',465],[77,'Bob',223]]"""
 
   arff_str = cStringIO.StringIO()
   arff_str.write('@relation %s\n' % self.relation_name)
 
   for idx,att_name in enumerate(header):
     try:
       name = att_name.replace("\\","\\\\").replace("'","\\'")
       arff_str.write("@attribute '%s' %s\n" % (name,attributes[att_name]))
     except UnicodeEncodeError:
       arff_str.write('@attribute unicode_err_%s %s\n' 
                      % (idx,attributes[att_name]))
 
   arff_str.write('@data\n')
   for data_row in data_rows:
     row = []
     for att_idx,att_name in enumerate(header):
       att_type = attributes[att_name]
       value = data_row[att_idx]
       # numeric attributes can be sparse: None and zeros are not written
       if ((not att_type == constants.ARRF_NUMERIC)
           or not ((value == None) or value == 0)):
         row.append('%s %s' % (att_idx,value))
     arff_str.write('{' + (','.join(row)) + '}\n')
   return arff_str.getvalue()

UPDATE: As you can see from the code above, the function transforms a given set of data to a special arff file format. I noticed that one of the attributes I was creating contained numbers as strings (e.g., '1', instead of 1). By forcing these numbers into integers:

features[name] = int(value)

I recreated the arff file successfully. However I don't see how this, which is a value, can have an impact on the formatting of *att_idx*, which is always an integer, as also pointed out by @JohnMachin and @gnibbler (thanks for your answers, btw). So, even if my code runs now, I still don't see why this happens. How can the value, if not properly transformed into int, influence the formatting of something else?

How to Java

Monday, April 23, 2012

Unexpected whitespace in python generated strings

No comments:

Post a Comment