textgrid_convert package¶

Submodules¶

textgrid_convert.ArgParser module¶

Functionality to parse CLI arguments.

textgrid_convert.ArgParser.arg_parser = ArgumentParser(prog='sphinx-build', usage=None, description='convert srt and sbv files to Praat textgrid', formatter_class=<class 'argparse.HelpFormatter'>, conflict_handler='error', add_help=True)¶: Set up read write and convert arguments

textgrid_convert.ParserABC module¶

Abstract Base class for implementing transcription parsers

class textgrid_convert.ParserABC.ParserABC¶

Bases: object

Abstract base class for Parsers to feed textgrid conversion

from_file()¶: Read file from disk

parse_timestamp(timestamp)¶

Convert timestamp to datetime.tme

Parameters:	timestamp (str) –
Returns:	timestamp in milliseconds

parse_transcription(transcription)¶

Convert transcription input to transcription dictionary

Parameters:	transcription (str) –

to_file()¶: Write file to disk

to_textgrid(input_dict=None, output_file=None, speaker_name='Speaker1', adapt_endstamps=0.001)¶

FIXME: add output_file Convert internal dict to Praat Textgrid format “Specs” here: http://www.fon.hum.uva.nl/praat/manual/Intro_7__Annotation.html Time needs to be secs.milisecs, round to 2

Parameters:	speaker_name (str) – adapt_endstamps (float) – if given, will adapt end stamps to < start stamp
Returns:	TextGrid compatible string

transcription = None¶

transcription_dict = None¶

unique_id = None¶

textgrid_convert.iotools module¶

Collect read and write functions.

textgrid_convert.iotools.filewriter(filename, outstring, strict=True)¶

Parameters:	filename (str) – outstring (str) – strict (Bool) – if True, will not overwrite
Returns:	True, False

textgrid_convert.preproctools module¶

Data preprocessing tools.

textgrid_convert.preproctools.adapt_timestamps(input_dict, gap=0.1)¶

Adapt time end stamps to not overlap with following start time stamp.

Parameters:	input_dict (dict) – dictionary with timestamps, e.g. self.transcription_dict in a Parser gap (float) – gap to introduce between end and start index after adapt
Returns:	dict

textgrid_convert.revParser module¶

textgrid_convert.revParser.parse_revstamp(timestamp)¶

Convert timestamp from rev format (00:00:20,000) to ms

Parameters:	timestamp (str) –
Returns:	int

class textgrid_convert.revParser.revParser(transcription)¶

Bases: textgrid_convert.ParserABC.ParserABC

# transcription dict is formatted like so: {chunk_id(int): {“speaker_name”: “”, “text”: “”, “start”: float, “end”: float}}

parse_timestamp(timestamp)¶: Convert from rev timestamps to ms

parse_transcription(speaker=None)¶: Specs are here: https://www.rev.com/api/attachmentsgetcontent

speakers = ()¶

to_darla_textgrid(speaker_id=None, alias='sentence')¶

Change TextGrid to the format DARLA understands: only “sentence” grids

Parameters:	speaker_id (int) – ID of the speaker to keep, will default to first found
Returns:	str to be fed into DARLA

textgrid_convert.sbvParser module¶

class textgrid_convert.sbvParser.sbvParser(transcription)¶

Bases: textgrid_convert.ParserABC.ParserABC

Read and parse an sbv formatted file Inofficial specs here: GGL

file_name¶

Type:	optional

sbv_text¶

Type:	str

parse_timestamp(timestamp)¶: Convert timestamps from sbv format 0:00:00.599 to ms

parse_transcription(transcription, time_stamp_sep=', ')¶

Pull the stuff from sbv into a dictionary of format {chunk_id: { “speaker”: str, “text”: str, “start”: int, “end”: int}}

Parameters:	transcription (str) – time_stamp_sep (str) –
Returns:	dict as described above

sbv_generator(filein, separator='')¶

Parameters:	filein (file read object or other iterable) – separator (str) – separator between records
Returns:	generator over chunk_id, timestamp, text FIXME: deque here

sbv_textparse(speaker_and_text, speaker='Speaker 1', speaker_regex=re.compile('[A-Z]+:'))¶

Parameters:	speaker_and_text (str) –
Returns:	tuple (SPEAKER(str), text(str))

to_darla_textgrid(speaker_id=None, speaker_name=None, alias='sentence')¶

Change TextGrid to the format DARLA understands: only “sentence” grids

Parameters:	speaker_id (int) – NA for sbvs speaker_name (str) – name of the speaker to extact alias – the name to use for texttier – DARLA wants ‘sentence’
Returns:	str to be fed into DARLA

textgrid_convert.srtParser module¶

class textgrid_convert.srtParser.srtParser(transcription)¶

Bases: textgrid_convert.ParserABC.ParserABC

Read and parse an srt formatted file Inofficial specs here: http://forum.doom9.org/showthread.php?p=470941#post470941

file_name¶

Type:	optional

srt_text¶

Type:	str

parse_timestamp(timestamp)¶

Convert from srt style timestamp 00:59:58,89 to ms

Parameters:	timestamp (str) –
Returns:	int

parse_transcription(srt_text=None, speaker_name='Speaker 1', time_stamp_sep=' --> ')¶

Pull the stuff from srt into a dictionary of format {chunk_id: {“text”: “”, “start”: int, “end”: int}}

Parameters:	srt_text (str) – speaker_name (str) – time_stamp_sep (str) – placeholder between start and end time stamp
Returns:	dict as described above

srt_generator(filein, separator='\n')¶

Parameters:	filein (file read object or other iterable) – separator (str) – separator between records
Returns:	generator over chunk_id, timestamp, text

to_darla_textgrid(speaker_id=None, speaker_name=None, alias='sentence')¶

Change TextGrid to the format DARLA understands: only “sentence” grids

Parameters:	speaker_id (int) – NA for sbvs speaker_name (str) – name of the speaker to extact alias – the name to use for texttier – DARLA wants ‘sentence’
Returns:	str to be fed into DARLA

textgrid_convert.textgridtools module¶

Collect TextGrid related functionality here

textgrid_convert.textgridtools.collect_chunk_values(input_dict, key, strict=True)¶

Collect all values associated with chunks in input_dict.

Parameters:	input_dict (dict) – {chunk_id: {key: value}} key (str) – strict (Bool) – if True, will error out if key not present
Returns:	list of results

textgrid_convert.textgridtools.ms_to_textgrid(milliseconds, strict=True)¶

Convert milliseconds to textgrid appropriate format,e.g. 12.88

Parameters:	milliseconds (int) – strict (Bool) – if True, will error out if no int given

textgrid_convert.textgridtools.to_long_textgrid(tier_dict, tier_key='speaker_name', tier_class='IntervalTier')¶

Create long form TextGrid, cf specs here:http://www.fon.hum.uva.nl/praat/manual/TextGrid_file_formats.html.

Parameters:	tier_dict (dict) – tier_key (str) – tier_class (str) – tier class to use for TextGrid
Returns:	str of TextGrid

textgrid_convert.textgridtools.to_short_textgrid(tier_dict)¶

textgrid_convert.textgridtools.to_textgrid_time(timestamp, split_char='.')¶

FIXME: deprecate Output needs to be in mili seconds, round to 2

Parameters:	timestamp (str) –

Returns

textgrid_convert.ttextgrid_convert module¶

textgrid_convert.ttextgrid_convert.convert_to_darla(input_file, source_format, speaker_name='Speaker 1')¶

Convert from source_format in input_file to DARLA-compatible TextGrid

Parameters:	input_file (str) – path to input srt or sbv file to read from source_format (str) – either sbv or srt speaker_name (str) – optional speaker name
Returns:	TextGrid formatted string

textgrid_convert.ttextgrid_convert.convert_to_txtgrid(input_file, source_format, speaker_name='Speaker 1')¶

Convert from source_format in input_file to TextGrid.

Parameters:	input_file (str) – path to input srt or sbv file to read from source_format (str) – either sbv or srt speaker_name (str) – optional speaker name
Returns:	TextGrid formatted string

textgrid_convert.ttextgrid_convert.folder_source_format(input_folder, file_types=['.srt', '.sbv', '.json', '.rev'])¶

Check whether files in input_foldelibr have sbv, srt endings

Parameters:	input_folder (str) – file_types (iterable of str) – file endings to consider
Returns:	str srt or sbv
Raises:	ValueError if mix of extensions

textgrid_convert.ttextgrid_convert.guess_source_format(input_path, extension_map={'json': 'rev', 'sbv': 'sbv', 'srt': 'srt'})¶

Based on file extension of input_path, guess the format of transcription file.

Parameters:	input_path (str) – file name extension_map (dict) – dictoinary {file_ext: format}, e.g. {“srt”: “srt”}
Returns:	format string, None if not found in extension_map

textgrid_convert.ttextgrid_convert.main(source_format, to, input_path, output_path=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/textgrid-convert/checkouts/stable/docs'), suffix='_TEXTGRID.txt', strict=True)¶

Convert files(s) from input_path from to format to TextGrid. Optionally, write to output_path Example: convert from=sbv to=TextGrid and write to output_path=”home/patrick/output”

Parameters:	source_format (str) – file ending, currently accepts sbv and srt to (str) – file ending, only accepts TextGrid atm input_path (str) – output_path (str) – suffix (str) – string to append to file name for writing out TextGrid strict (Bool) – if True, will not overwrite files

textgrid_convert package¶

Submodules¶

textgrid_convert.ArgParser module¶

textgrid_convert.ParserABC module¶

textgrid_convert.iotools module¶

textgrid_convert.preproctools module¶

textgrid_convert.revParser module¶

textgrid_convert.sbvParser module¶

textgrid_convert.srtParser module¶

textgrid_convert.textgridtools module¶

textgrid_convert.ttextgrid_convert module¶

Module contents¶