textgrid_convert package

Submodules

textgrid_convert.ArgParser module

Functionality to parse CLI arguments.

textgrid_convert.ArgParser.arg_parser = ArgumentParser(prog='sphinx-build', usage=None, description='convert srt and sbv files to Praat textgrid', formatter_class=<class 'argparse.HelpFormatter'>, conflict_handler='error', add_help=True)

Set up read write and convert arguments

textgrid_convert.ParserABC module

Abstract Base class for implementing transcription parsers

class textgrid_convert.ParserABC.ParserABC

Bases: object

Abstract base class for Parsers to feed textgrid conversion

from_file()

Read file from disk

parse_timestamp(timestamp)

Convert timestamp to datetime.tme

Parameters:timestamp (str) –
Returns:timestamp in milliseconds
parse_transcription(transcription)

Convert transcription input to transcription dictionary

Parameters:transcription (str) –
to_file()

Write file to disk

to_textgrid(input_dict=None, output_file=None, speaker_name='Speaker1', adapt_endstamps=0.001)

FIXME: add output_file Convert internal dict to Praat Textgrid format “Specs” here: http://www.fon.hum.uva.nl/praat/manual/Intro_7__Annotation.html Time needs to be secs.milisecs, round to 2

Parameters:
  • speaker_name (str) –
  • adapt_endstamps (float) – if given, will adapt end stamps to < start stamp
Returns:

TextGrid compatible string

transcription = None
transcription_dict = None
unique_id = None

textgrid_convert.iotools module

Collect read and write functions.

textgrid_convert.iotools.filewriter(filename, outstring, strict=True)
Parameters:
  • filename (str) –
  • outstring (str) –
  • strict (Bool) – if True, will not overwrite
Returns:

True, False

textgrid_convert.preproctools module

Data preprocessing tools.

textgrid_convert.preproctools.adapt_timestamps(input_dict, gap=0.1)

Adapt time end stamps to not overlap with following start time stamp.

Parameters:
  • input_dict (dict) – dictionary with timestamps, e.g. self.transcription_dict in a Parser
  • gap (float) – gap to introduce between end and start index after adapt
Returns:

dict

textgrid_convert.revParser module

textgrid_convert.revParser.parse_revstamp(timestamp)

Convert timestamp from rev format (00:00:20,000) to ms

Parameters:timestamp (str) –
Returns:int
class textgrid_convert.revParser.revParser(transcription)

Bases: textgrid_convert.ParserABC.ParserABC

# transcription dict is formatted like so: {chunk_id(int): {“speaker_name”: “”, “text”: “”, “start”: float, “end”: float}}

parse_timestamp(timestamp)

Convert from rev timestamps to ms

parse_transcription(speaker=None)

Specs are here: https://www.rev.com/api/attachmentsgetcontent

speakers = ()
to_darla_textgrid(speaker_id=None, alias='sentence')

Change TextGrid to the format DARLA understands: only “sentence” grids

Parameters:speaker_id (int) – ID of the speaker to keep, will default to first found
Returns:str to be fed into DARLA

textgrid_convert.sbvParser module

class textgrid_convert.sbvParser.sbvParser(transcription)

Bases: textgrid_convert.ParserABC.ParserABC

Read and parse an sbv formatted file Inofficial specs here: GGL

file_name
Type:optional
sbv_text
Type:str
parse_timestamp(timestamp)

Convert timestamps from sbv format 0:00:00.599 to ms

parse_transcription(transcription, time_stamp_sep=', ')

Pull the stuff from sbv into a dictionary of format {chunk_id: { “speaker”: str, “text”: str, “start”: int, “end”: int}}

Parameters:
  • transcription (str) –
  • time_stamp_sep (str) –
Returns:

dict as described above

sbv_generator(filein, separator='')
Parameters:
  • filein (file read object or other iterable) –
  • separator (str) – separator between records
Returns:

generator over chunk_id, timestamp, text FIXME: deque here

sbv_textparse(speaker_and_text, speaker='Speaker 1', speaker_regex=re.compile('[A-Z]+:'))
Parameters:speaker_and_text (str) –
Returns:tuple (SPEAKER(str), text(str))
to_darla_textgrid(speaker_id=None, speaker_name=None, alias='sentence')

Change TextGrid to the format DARLA understands: only “sentence” grids

Parameters:
  • speaker_id (int) – NA for sbvs
  • speaker_name (str) – name of the speaker to extact
  • alias – the name to use for texttier – DARLA wants ‘sentence’
Returns:

str to be fed into DARLA

textgrid_convert.srtParser module

class textgrid_convert.srtParser.srtParser(transcription)

Bases: textgrid_convert.ParserABC.ParserABC

Read and parse an srt formatted file Inofficial specs here: http://forum.doom9.org/showthread.php?p=470941#post470941

file_name
Type:optional
srt_text
Type:str
parse_timestamp(timestamp)

Convert from srt style timestamp 00:59:58,89 to ms

Parameters:timestamp (str) –
Returns:int
parse_transcription(srt_text=None, speaker_name='Speaker 1', time_stamp_sep=' --> ')

Pull the stuff from srt into a dictionary of format {chunk_id: {“text”: “”, “start”: int, “end”: int}}

Parameters:
  • srt_text (str) –
  • speaker_name (str) –
  • time_stamp_sep (str) – placeholder between start and end time stamp
Returns:

dict as described above

srt_generator(filein, separator='\n')
Parameters:
  • filein (file read object or other iterable) –
  • separator (str) – separator between records
Returns:

generator over chunk_id, timestamp, text

to_darla_textgrid(speaker_id=None, speaker_name=None, alias='sentence')

Change TextGrid to the format DARLA understands: only “sentence” grids

Parameters:
  • speaker_id (int) – NA for sbvs
  • speaker_name (str) – name of the speaker to extact
  • alias – the name to use for texttier – DARLA wants ‘sentence’
Returns:

str to be fed into DARLA

textgrid_convert.textgridtools module

Collect TextGrid related functionality here

textgrid_convert.textgridtools.collect_chunk_values(input_dict, key, strict=True)

Collect all values associated with chunks in input_dict.

Parameters:
  • input_dict (dict) – {chunk_id: {key: value}}
  • key (str) –
  • strict (Bool) – if True, will error out if key not present
Returns:

list of results

textgrid_convert.textgridtools.ms_to_textgrid(milliseconds, strict=True)

Convert milliseconds to textgrid appropriate format,e.g. 12.88

Parameters:
  • milliseconds (int) –
  • strict (Bool) – if True, will error out if no int given
textgrid_convert.textgridtools.to_long_textgrid(tier_dict, tier_key='speaker_name', tier_class='IntervalTier')

Create long form TextGrid, cf specs here:http://www.fon.hum.uva.nl/praat/manual/TextGrid_file_formats.html.

Parameters:
  • tier_dict (dict) –
  • tier_key (str) –
  • tier_class (str) – tier class to use for TextGrid
Returns:

str of TextGrid

textgrid_convert.textgridtools.to_short_textgrid(tier_dict)
textgrid_convert.textgridtools.to_textgrid_time(timestamp, split_char='.')

FIXME: deprecate Output needs to be in mili seconds, round to 2

Parameters:timestamp (str) –

Returns

textgrid_convert.ttextgrid_convert module

textgrid_convert.ttextgrid_convert.convert_to_darla(input_file, source_format, speaker_name='Speaker 1')

Convert from source_format in input_file to DARLA-compatible TextGrid

Parameters:
  • input_file (str) – path to input srt or sbv file to read from
  • source_format (str) – either sbv or srt
  • speaker_name (str) – optional speaker name
Returns:

TextGrid formatted string

textgrid_convert.ttextgrid_convert.convert_to_txtgrid(input_file, source_format, speaker_name='Speaker 1')

Convert from source_format in input_file to TextGrid.

Parameters:
  • input_file (str) – path to input srt or sbv file to read from
  • source_format (str) – either sbv or srt
  • speaker_name (str) – optional speaker name
Returns:

TextGrid formatted string

textgrid_convert.ttextgrid_convert.folder_source_format(input_folder, file_types=['.srt', '.sbv', '.json', '.rev'])

Check whether files in input_foldelibr have sbv, srt endings

Parameters:
  • input_folder (str) –
  • file_types (iterable of str) – file endings to consider
Returns:

str srt or sbv

Raises:

ValueError if mix of extensions

textgrid_convert.ttextgrid_convert.guess_source_format(input_path, extension_map={'json': 'rev', 'sbv': 'sbv', 'srt': 'srt'})

Based on file extension of input_path, guess the format of transcription file.

Parameters:
  • input_path (str) – file name
  • extension_map (dict) – dictoinary {file_ext: format}, e.g. {“srt”: “srt”}
Returns:

format string, None if not found in extension_map

textgrid_convert.ttextgrid_convert.main(source_format, to, input_path, output_path=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/textgrid-convert/checkouts/stable/docs'), suffix='_TEXTGRID.txt', strict=True)

Convert files(s) from input_path from to format to TextGrid. Optionally, write to output_path Example: convert from=sbv to=TextGrid and write to output_path=”home/patrick/output”

Parameters:
  • source_format (str) – file ending, currently accepts sbv and srt
  • to (str) – file ending, only accepts TextGrid atm
  • input_path (str) –
  • output_path (str) –
  • suffix (str) – string to append to file name for writing out TextGrid
  • strict (Bool) – if True, will not overwrite files

Module contents